Skip to content

BUG: Float values get corrupted with df.astype(), for values with no overflow error #34618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
ketanhdoshi opened this issue Jun 6, 2020 · 5 comments
Closed
2 of 3 tasks
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request

Comments

@ketanhdoshi
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1270., 570., 14130., 620., 29910.]})
df['b'] = df['b'].astype('float16')
df

Problem description

Converting a 'float64' to 'float16' should return the original values, as long as they don't cause any overflow. In the example above, two values (14130. and 29910.) get corrupted to (14128. and 29904.) respectively. Both those values can fit into a 'float16' without overflow, which can be confirmed via np.finfo(np.float16).min < 29910. < np.finfo(np.float16).max

Expected Output

The values (14130. and 29910.) should be preserved.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.104+
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.4
pytz : 2018.9
dateutil : 2.8.1
pip : 19.3.1
setuptools : 47.1.1
Cython : 0.29.19
pytest : 3.6.4
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.2.6
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 5.5.0
pandas_datareader: 0.8.1
bs4 : 4.6.3
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.2.6
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 2.5.9
pandas_gbq : 0.11.0
pyarrow : 0.14.1
pytables : None
pytest : 3.6.4
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.4.4
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.1.0
xlwt : 1.3.0
xlsxwriter : None
numba : 0.48.0

@ketanhdoshi ketanhdoshi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 6, 2020
@ketanhdoshi
Copy link
Author

Here is a similar but slightly different example

# Your code here
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [4., 4., 2., 3., 3.]})
df['c'] = df['b'].astype('float16')
df['b'] = (df['b'] - 3.52) / 1.86
df['c'] = (df['c'] - 3.52) / 1.86
df

Problem description

In this case, data doesn't get corrupted in the .astype('float16') call, but the arithmetic operations after that produce different results for the 'float16' column 'c' versus the 'float64' column 'b'. Since all the numbers fit comfortably into the 'float16' numeric range, they should produce identical results.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2020

pandas has almost 0 support for float16

you are welcome to contribute patches

@miccoli
Copy link
Contributor

miccoli commented Jun 7, 2020

This is definitely not a bug.

Half-precision floating-point format has only 11 bits of significand precision. This means that integers between 8192 and 16384 round to a multiple of 8: 14128 and 14130 round to the same number in half precision.

And of course, regarding the second example, it is expected that arithmetic with half precision gives different results from arithmetic in double precision.

The only thing to note here is that numpy and pandas display half precision floating point numbers in a different way:

>>> pd.Series([14128, 14130]).astype('float16')
0    14128.0
1    14128.0
dtype: float16
>>> np.array([14128, 14130], dtype='float16')
array([14130., 14130.], dtype=float16)

but please, be assured, there is no corruption here, only correct rounding:

>>> np.float16(14130) == np.float16(14128)
True
>>> pd.Series([14128, 14130]).astype('float16') == np.array([14128, 14130], dtype='float16')
0    True
1    True
dtype: bool
>>> pd.Series([14128, 14130]).astype('float16') == np.float16(14128)
0    True
1    True
dtype: bool
>>> pd.Series([14128, 14130]).astype('float16') == np.float16(14130)
0    True
1    True
dtype: bool

Please see also numpy/numpy#12613 as what regards half precision display.

@jreback
Copy link
Contributor

jreback commented Jun 8, 2020

duplicates of #9220 and #22841

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 8, 2020
@jreback jreback added this to the No action milestone Jun 8, 2020
@jreback
Copy link
Contributor

jreback commented Jun 8, 2020

closing as a duplicate

@jreback jreback closed this as completed Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants