-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' #43693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The issue seems to be that within pandas/pandas/core/dtypes/cast.py Lines 351 to 356 in d113a3b
When calling pandas/pandas/core/tools/numeric.py Lines 209 to 218 in d113a3b
No errors with Numpy despite wrong result. np.array([608202549], dtype='float32').astype(int)
array([608202560])
np.array([608202549], dtype='float64').astype(int)
array([608202549]) Since Pandas' Series can hold float64 values, I assume this is not intentional behavior? |
This appears to be as documented:
If you want a 64-bit float with no precision loss (32-bit floats have 6-8 significant figures of precision), then it seems you don't want downcasting, so you just wouldn't supply the kwarg specifying it. Arguably however it is a bug that when passed a scalar. maybe_downcast_float returns a system |
@Liam3851 The documentation seems to indicate that float32 is the min type, not the only type. Also, within the function, the for loop is trying 32, 64, 128 because pandas/pandas/core/tools/numeric.py Lines 202 to 211 in d113a3b
I see your point though, but in the example I provided, the result is not some nth figure precision loss. The result is a distortion at the whole number level (608202560 vs 608202549). Should there be a more strict check on whether the 64 -> 32 downcast is allowed using additional tolerance checks? |
Yes, but those are 9-digit integers. Those numbers can't be represented as 32-bit integers because they're too large (> 2**32) . So there is no way to use 32-bits to represent a number greater than 2 billion to an accuracy of within 1. You have to use a 64-bit number for that purpose. I think maybe you're misapprehending the purpose of "downcast". 64-bit numbers are the default, generally. If you want to downcast it (make it fit in less memory), you'd basically have to be downcasting it to a float32. As the documentation states:
And importantly there are use cases such a loss of precision is totally fine and exactly what is desired, even at the expense of being off by more than 1. For example, perhaps a user is storing scientific data measured in trillions but with only 4-5 sig figs. float32 is fine for that purpose even though the errors will be clearly larger than 1.0. Setting downcast='float' allows users with such data to avoid allocating a 64-bit array that would take twice as much memory unnecessarily. For your use case though, if you need accuracy of integers in the billions to within 1, the only way to do that is with a 64-bit number, so you don't want to downcast. |
Right, but I understood the Using pandas/pandas/core/dtypes/cast.py Lines 348 to 349 in d113a3b
However for Which is why I'm suggesting that if if a 64 -> 32 downcast is not within the tolerances expected given the precision loss (ie. the number cannot be represented in 32 bit form at all), then the function should behave as it does for int and opt to retain the 64 bit representation. I think this would be more consistent behavior? |
wouldn't object to a PR to make float32 test for equivalence |
I don't believe the proposed PR solves the original confusion here. Consider the following call in the new code: pd.to_numeric([6082025490, 6082025491], downcast='float') This is isomorphic to the problem above, it just adds one digit to the number. The PR proposes an atol of 5e-8 for dtypes of float32. But in that case the check proposed returns true and allows downcasting, even though the two numbers above are both distinct integers: a = np.array([6082025490, 6082025491], dtype=np.float64)
b = a.astype(np.float32)
np.allclose(a, b, atol=5e-8)
> True
list(b)
> [6082025500.0, 6082025500.0]
b - a
> array([-18., -19.]) Of course, this is because the numbers are not representable to accuracy within 1 because they are too large to represent in 32-bits. But that was also true in the motivating example. The point being the downcast being performed already is producing values that are exactly equal within the precision of 32-bit numbers, and no downcast is possible in which 64-bit floating point numbers retain their precision during conversion to 32-bit. I don't see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215. |
In your example, you didn't set the a = np.array([6082025490, 6082025491], dtype=np.float64)
b = a.astype(np.float32)
np.allclose(a, b, atol=5e-8, rtol=0.0)
> False Indeed this is what the PR is doing, preventing the downcast because these are not downcastable beyond an expected precision tolerance s = pd.Series([6082025490, 6082025491], dtype=np.float64)
pd.to_numeric(s, downcast='float').dtype # With PR applied
> dtype('float64')
I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used. |
Got it on the atol, I was having a brain freeze and interpreting as rtol. I see your point.
Got it. To be clear this changes behavior for people who are trying (legitimately) to store, say, 3.14e12 as a 32-bit float; though I do see where this eliminates a footgun for other (possibly/probably more numerous) users. |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
When attempting to use
to_numeric()
on a Pandas Series or list with values >= 16777217, the resulting float is not expected given the input number.This behavior does not occur if the value is passed as a scalar, but does also appear if a List is passed. By comparing values within a range, it seems the issue starts from 16,777,217 onward. This seems to have roots in the discussion about 32 bit floating point in this Stack Overflow as well as another Pandas issue on float issues with rank #18271
Edit- I updated this issue to be clear that this is affecting both integers and floats larger than 16,777,217
Expected Behavior
to_numeric()
should yield the same functional number in float form as the original value(s)Installed Versions
INSTALLED VERSIONS
commit : 73c6825
python : 3.9.4.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: