Skip to content

pd.to_numeric(series, downcast='integer') does not prpoerly handle floats over 10,000 #14941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gryBox opened this issue Dec 21, 2016 · 8 comments
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@gryBox
Copy link

gryBox commented Dec 21, 2016

Hi - I came across this issue in stackoverflow while testing pd.to_numeric()

If all floats in column are over 10000 it loses precision and converts them to integers.

tst_df = pd.DataFrame({'colA':['a','b','c','a','z', 'q'],
                      'colB': pd.date_range(end=datetime.datetime.now() , periods=6),
                      'colC' : ['a1','b2','c3','a4','z5', 'q6'],
                      'colD': [10000.0, 20000, 3000, 40000.36, 50000, 50000.00]})

pd.to_numeric(tst_df['colD'],  downcast='integer')

image

This doesn't seem like the desired behavior.

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

so this is correct, it is effectively doing an .astype(int). though I suppose this is too aggressive and the downcast from float -> int should only succeed if they are actually equal.

@gfyoung thoughts

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 21, 2016
@gryBox
Copy link
Author

gryBox commented Dec 21, 2016

Just so I am clear, this is the behavior expected? Not this?

tst_df = pd.DataFrame({'colA':['a','b','c','a','z', 'q'],
                      'colB': pd.date_range(end=datetime.datetime.now() , periods=6),
                      'colC' : ['a1','b2','c3','a4','z5', 'q6'],
                      'colD': [1000.0, 2000, 3000, 4000.36, 5000, 50000.00]})

pd.to_numeric(tst_df['colD'],  downcast='integer')

image

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

hmm, this looks buggy. if you want to have a look see inside and see what's going on would be great.

@gryBox
Copy link
Author

gryBox commented Dec 21, 2016

@jreback I will try - I am sort of new to this. Is this the module I should be looking at?
pandas/pandas/tools/util.py

@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

yes

@gfyoung
Copy link
Member

gfyoung commented Dec 21, 2016

@gryBox : First of all, thanks for pointing this out! If you follow the code from where you started, the bug traces here. You can see here yourself:

>>> import numpy as np
>>>
>>> arr = [1000000]
>>> arr2 = [1000000.5]
>>>
>>> np.allclose(arr, arr2)  # This is what we do now
True
>>> np.allclose(arr, arr2, rtol=0)  # This is what we probably should do
False

I think if passing rtol=0, you should be able to patch this behavior. It also explains why it breaks with large numbers (the numbers are so large rtol isn't small enough to indicate that they're not close).

@jreback jreback added this to the Next Major Release milestone Dec 21, 2016
@gryBox
Copy link
Author

gryBox commented Dec 22, 2016

@gfyoung : The credit goes to tworec at stack.

@gfyoung
Copy link
Member

gfyoung commented Dec 22, 2016

@gryBox : I'm not sure there is such a method. Also, numpy dtype checking is not 100% compatible with all of our pandas objects (deliberate), hence we prefer to stay away from such methods in numpy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants