Skip to content

Different precision calling .astype(str) on float numbers #11302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marcomayer opened this issue Oct 12, 2015 · 18 comments · Fixed by #11309
Closed

Different precision calling .astype(str) on float numbers #11302

marcomayer opened this issue Oct 12, 2015 · 18 comments · Fixed by #11309
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@marcomayer
Copy link

With pandas 0.16.2:

import pandas as pd
pd.DataFrame([1.12345678901234567890]).astype(str)
0
0 1.12345678901

With pandas 0.17:

import pandas as pd
pd.DataFrame([1.12345678901234567890]).astype(str)
0
0 1.1234567890123457

I read the 0.17 release log but couldn't figure out why that is. Is it a bug or a new feature, and if it's a new feature how can I re-activate the old behavior?

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

what version of numpy?

@marcomayer
Copy link
Author

numpy 1.10.0

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

in both cases?

@marcomayer
Copy link
Author

in both cases yes. I updated with conda update pandas, which also updated numpy. Then I downgraded pandas with conda install pandas=0.16.2 and it worked again.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

this might be just a printing thing eg the display.precision changed in 0.17.0

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

@marcomayer
Copy link
Author

0.16.2:

pd.DataFrame([1.12345678901234567890]).astype(str).to_dict()
{0: {0: '1.12345678901'}}

0.17:

pd.DataFrame([1.12345678901234567890]).astype(str).to_dict()
{0: {0: '1.1234567890123457'}}

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

no see if the actual numbers are th same

eg df.at[0,0]

@marcomayer
Copy link
Author

0.16.2:

pd.DataFrame([1.12345678901234567890]).at[0,0]
1.1234567890123457
pd.DataFrame([1.12345678901234567890]).astype(str).at[0,0]
'1.1234567890123457'

0.17:

pd.DataFrame([1.12345678901234567890]).at[0,0]
1.1234567890123457
pd.DataFrame([1.12345678901234567890]).astype(str).at[0,0]
'1.12345678901'

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

0.16.2

In [2]: pd.__version__
Out[2]: '0.16.2'

In [3]: np.__version__
Out[3]: '1.10.0'

In [4]: pd.DataFrame([1.12345678901234567890]).astype(str)
Out[4]: 
               0
0  1.12345678901

0.17.0

In [1]: pd.__version__
Out[1]: u'0.17.0'

In [2]: np.__version__
Out[2]: '1.10.0'

In [3]: pd.DataFrame([1.12345678901234567890]).astype(str)
Out[3]: 
               0
0  1.12345678901

This is python 2.7 on macosx. pls be more specific about python/os

@marcomayer
Copy link
Author

do you get the same when using .to_dict()?

Also I used the python console instead of ipython/notebook to make sure it's not a display issue cause by ipython.

I'm running Python 3.4.3 :: Anaconda 2.3.0 (x86_64) on macosx.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

Python 3.4.3 |Continuum Analytics, Inc.| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.16.2'
>>> import numpy as np
>>> np.__version__
'1.10.1'
>>> pd.DataFrame([1.12345678901234567890]).astype(str)
               0
0  1.12345678901
>>> pd.DataFrame([1.12345678901234567890]).astype(str).to_dict()
{0: {0: '1.12345678901'}}
>>> quit()

(py3.4_1)bash-3.2$ source deactivate
discarding /Users/jreback/miniconda/envs/py3.4_1/bin from PATH
bash-3.2$ source activate py3.4_2
discarding /Users/jreback/miniconda/bin from PATH
prepending /Users/jreback/miniconda/envs/py3.4_2/bin to PATH
(py3.4_2)bash-3.2$ python
Python 3.4.3 |Continuum Analytics, Inc.| (default, Mar  6 2015, 12:07:41) 
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> pd.__version__
'0.17.0'
>>> np.__version__
'1.10.1'
>>> pd.DataFrame([1.12345678901234567890]).astype(str)
                    0
0  1.1234567890123457
>>> pd.DataFrame([1.12345678901234567890]).astype(str).to_dict()
{0: {0: '1.1234567890123457'}}

(numpy 1.10.1 just released, but doesn't have anything to do with this)

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

so this is just on py3 looks like.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

so this goes thru a slightly different path that in 0.16.2 but not really sure why this would have changed.

I'll mark it as a bug, though odd that you actually rely on this behavior?

@jreback jreback added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Output-Formatting __repr__ of pandas objects, to_string labels Oct 12, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 12, 2015
@marcomayer
Copy link
Author

thank you. I'm not sure about the "output-formatting" label though, isn't this more of a type-conversion/casting issue (float to str)?

I rely on astype(str) for two things:

  • To cast decimal.Decimal types to strings to then save them in HD5 files which is faster than having HD5 save it as non-optimized objects (at least it was so in the past). This still works though, the issue only appears when using floats.
  • I've build hundreds of unittests that take DFs and use astype(str).to_dict() to then pickle the dicts to files. When the unittest is run I load those pickles and compare the contents of each DF. Probably there is a better way to do this but that's what I came up with at some point. Because of this I had also issues with the new date format since it prints differently but that was documented in the release notes so I could adjust them by doing data['date'] = pd.to_datetime(data.date).map(lambda x: str(x.to_datetime64()).replace('NaT','nan')). Now once I would have verified that the results are fine I'll be able to rewrite the pickle files without those converting but I first have to make sure no number at whatever decimal place is different (or figure out and understand why it is).

So I'll try now to find a way to make it through the unittests with 0.17 since I'd like to update due to the new features/optimizations. If you have an idea for a quick workaround let me know...

@marcomayer
Copy link
Author

Regarding a workaround, this helps me for now to get through the unit-tests:

df.applymap(lambda x: str(x)).to_dict() instead of df.astype(str).to_dict()

Another difference I noticed is when np.NaN is converted to strings:

pd.version
'0.16.2'
np.version
'1.10.1'
pd.DataFrame([np.NaN]).astype(str).to_dict()
{0: {0: 'nan'}}

pd.version
'0.17.0'
np.version
'1.10.1'
pd.DataFrame([np.NaN]).astype(str).to_dict()
{0: {0: ''}}

To be honest I wonder if it wouldn't be a good idea to get the same results with astype(str) as with the standard python str() function? For me there's a significant difference between an empty string and np.NaN.

@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

@marcomayer ok, should be fixed in #11309

a better way to compare things is just to use np.allclose (or array_equivalent).
converting to string to compare is not generally a good idea

@marcomayer
Copy link
Author

that fixed it for me! thanks a lot! I'll also consider np.allclose() for the future.

Marco

jreback added a commit that referenced this issue Oct 13, 2015
REGR: change in output formatting for long floats/nan, #11302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants