Skip to content

BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' #43693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
jbencina opened this issue Sep 21, 2021 · 9 comments · Fixed by #43710
Closed
3 tasks done

BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' #43693

jbencina opened this issue Sep 21, 2021 · 9 comments · Fixed by #43710
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@jbencina
Copy link
Contributor

jbencina commented Sep 21, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549]), downcast='float')

#Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549.0]), downcast='float')

# Single: Result 608202549
pd.to_numeric(608202549, downcast='float')

# List: Result array([608202560, 608202560])
pd.to_numeric([608202548, 608202549], downcast='float').astype(int)

# Range: 1st issue at 16777217
arr = np.arange(0, 6e8, dtype='int64')
a = pd.Series(arr)
b = pd.to_numeric(a, downcast='float').astype(int)
np.where(a != b)[0]

Issue Description

When attempting to use to_numeric() on a Pandas Series or list with values >= 16777217, the resulting float is not expected given the input number.

pd.to_numeric(pd.Series([608202549]), downcast='float')
0    608202560.0 

>>> pd.to_numeric(pd.Series([608202549.0]), downcast='float')
0    608202560.0

This behavior does not occur if the value is passed as a scalar, but does also appear if a List is passed. By comparing values within a range, it seems the issue starts from 16,777,217 onward. This seems to have roots in the discussion about 32 bit floating point in this Stack Overflow as well as another Pandas issue on float issues with rank #18271

Edit- I updated this issue to be clear that this is affecting both integers and floats larger than 16,777,217

Expected Behavior

to_numeric() should yield the same functional number in float form as the original value(s)

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : 73c6825
python : 3.9.4.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@jbencina jbencina added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 21, 2021
@jbencina jbencina changed the title BUG: to_numeric incorrectly converting ints to float for Pandas Series BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float' Sep 22, 2021
@jbencina
Copy link
Contributor Author

jbencina commented Sep 22, 2021

The issue seems to be that within to_numeric(), Pandas is calling maybe_downcast_numeric() which is simply down converting all float values to float32, even if that causes issues

elif (
issubclass(dtype.type, np.floating)
and not is_bool_dtype(result.dtype)
and not is_string_dtype(result.dtype)
):
return result.astype(dtype)

When calling maybe_downcast_numeric(), Pandas is iterating through float32, float64, float128 in sequence, but float32 is always satisfied as the conversion occurs in Numpy without any warnings raised.

if typecodes is not None:
# from smallest to largest
for typecode in typecodes:
dtype = np.dtype(typecode)
if dtype.itemsize <= values.dtype.itemsize:
values = maybe_downcast_numeric(values, dtype)
# successful conversion
if values.dtype == dtype:
break

No errors with Numpy despite wrong result.

np.array([608202549], dtype='float32').astype(int)
array([608202560])

np.array([608202549], dtype='float64').astype(int)
array([608202549])

Since Pandas' Series can hold float64 values, I assume this is not intentional behavior?

@Liam3851
Copy link
Contributor

This appears to be as documented:

If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:
‘integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)
‘unsigned’: smallest unsigned int dtype (min.: np.uint8)
‘float’: smallest float dtype (min.: np.float32)

If you want a 64-bit float with no precision loss (32-bit floats have 6-8 significant figures of precision), then it seems you don't want downcasting, so you just wouldn't supply the kwarg specifying it.

Arguably however it is a bug that when passed a scalar. maybe_downcast_float returns a system float rather than np.float32, and thus returns something with 64-bit precision, even though this is documented as returning 32-bit precision.

@jbencina
Copy link
Contributor Author

jbencina commented Sep 22, 2021

@Liam3851 The documentation seems to indicate that float32 is the min type, not the only type. Also, within the function, the for loop is trying 32, 64, 128 because typecodes contains all 3. The issue is that the check doesn't consider whether values are materially different, only that the conversion succeeded. So the loop stops at 32.

# pandas support goes only to np.float32,
# as float dtypes smaller than that are
# extremely rare and not well supported
float_32_char = np.dtype(np.float32).char
float_32_ind = typecodes.index(float_32_char)
typecodes = typecodes[float_32_ind:]
if typecodes is not None:
# from smallest to largest
for typecode in typecodes:

I see your point though, but in the example I provided, the result is not some nth figure precision loss. The result is a distortion at the whole number level (608202560 vs 608202549). Should there be a more strict check on whether the 64 -> 32 downcast is allowed using additional tolerance checks?

@Liam3851
Copy link
Contributor

I see your point though, but in the example I provided, the result is not some nth figure precision loss. The result is a distortion at the whole number level (608202560 vs 608202549).

Yes, but those are 9-digit integers. Those numbers can't be represented as 32-bit integers because they're too large (> 2**32) . So there is no way to use 32-bits to represent a number greater than 2 billion to an accuracy of within 1. You have to use a 64-bit number for that purpose.

I think maybe you're misapprehending the purpose of "downcast". 64-bit numbers are the default, generally. If you want to downcast it (make it fit in less memory), you'd basically have to be downcasting it to a float32. As the documentation states:

The default return dtype is float64 or int64 depending on the data supplied. Use the downcast parameter to obtain other dtypes.

And importantly there are use cases such a loss of precision is totally fine and exactly what is desired, even at the expense of being off by more than 1. For example, perhaps a user is storing scientific data measured in trillions but with only 4-5 sig figs. float32 is fine for that purpose even though the errors will be clearly larger than 1.0. Setting downcast='float' allows users with such data to avoid allocating a 64-bit array that would take twice as much memory unnecessarily.

For your use case though, if you need accuracy of integers in the billions to within 1, the only way to do that is with a 64-bit number, so you don't want to downcast.

@jbencina
Copy link
Contributor Author

Right, but I understood the to_numeric() to make an attempt to downcast if the downcasting results in the same values (the function is even called maybe_downcast_numeric).

Using int in the same function for example, if we attempt to downcast there's a check that only allows the downcast if the resulting values are still equivalent. If the downcast for int produces different values, then the function does not return a downcasted array and instead keeps the same size. See these lines:

if np.allclose(new_result, result, rtol=0):
return new_result

However for float, no such check is performed and all floats are always downcast to 32bit, regardless of what happens to the values. The nature of floats of course is more approximate so the concept of "same" cannot be a simple equality.

Which is why I'm suggesting that if if a 64 -> 32 downcast is not within the tolerances expected given the precision loss (ie. the number cannot be represented in 32 bit form at all), then the function should behave as it does for int and opt to retain the 64 bit representation.

I think this would be more consistent behavior?

@jreback
Copy link
Contributor

jreback commented Sep 23, 2021

wouldn't object to a PR to make float32 test for equivalence

@Liam3851
Copy link
Contributor

I don't believe the proposed PR solves the original confusion here. Consider the following call in the new code:

pd.to_numeric([6082025490, 6082025491], downcast='float')

This is isomorphic to the problem above, it just adds one digit to the number.

The PR proposes an atol of 5e-8 for dtypes of float32. But in that case the check proposed returns true and allows downcasting, even though the two numbers above are both distinct integers:

a = np.array([6082025490, 6082025491], dtype=np.float64)
b = a.astype(np.float32)

np.allclose(a, b, atol=5e-8)
> True

list(b)
> [6082025500.0, 6082025500.0]

b - a
> array([-18., -19.])

Of course, this is because the numbers are not representable to accuracy within 1 because they are too large to represent in 32-bits. But that was also true in the motivating example.

The point being the downcast being performed already is producing values that are exactly equal within the precision of 32-bit numbers, and no downcast is possible in which 64-bit floating point numbers retain their precision during conversion to 32-bit. I don't see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215.

@jbencina
Copy link
Contributor Author

jbencina commented Sep 23, 2021

In your example, you didn't set the rtol parameter to 0.0 like in the PR. The default value is 1e-05, which at the scale of the example 6082025490 equates to +/- 60820. The PR only considers the absolute difference. So in your example:

a = np.array([6082025490, 6082025491], dtype=np.float64)
b = a.astype(np.float32)

 np.allclose(a, b, atol=5e-8, rtol=0.0)
> False

Indeed this is what the PR is doing, preventing the downcast because these are not downcastable beyond an expected precision tolerance

s = pd.Series([6082025490, 6082025491], dtype=np.float64)
pd.to_numeric(s, downcast='float').dtype # With PR applied
> dtype('float64')

I don't see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215.

I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used.

@Liam3851
Copy link
Contributor

Got it on the atol, I was having a brain freeze and interpreting as rtol. I see your point.

I don't see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215.

I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used.

Got it. To be clear this changes behavior for people who are trying (legitimately) to store, say, 3.14e12 as a 32-bit float; though I do see where this eliminates a footgun for other (possibly/probably more numerous) users.

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 1, 2021
@jreback jreback added this to the 1.5 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants