Skip to content

Wrong result for float32 Series when using bottleneck #25307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lumbric opened this issue Feb 13, 2019 · 8 comments
Closed

Wrong result for float32 Series when using bottleneck #25307

lumbric opened this issue Feb 13, 2019 · 8 comments
Labels
Needs Info Clarification about behavior needed to assess issue

Comments

@lumbric
Copy link

lumbric commented Feb 13, 2019

Minimal example

Requires bottleneck, numpy and pandas to be installed:

>>> import numpy as np
>>> import pandas as pd
>>> import bottleneck as bn
>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()  # wrong
0.5
>>> bn.nanmean(data)  # wrong
0.5
>>> data.mean()  # correct
1.0

Problem description

The mean() of large float32 Series is wrong when bottleneck is used. Uninstalling bottleneck or using float64 is a valid workaround. xarray is or has been affected too, see pydata/xarray#1346.

Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow, not sure if this is still to be considered as bug in bottleneck. Anyhow since it seems quite severe, I want to raise attention here too.

Update: This is not an overflow, it's a numerical error (which is very high because bottleneck does not use pairwise summation).

Bottleneck's implementation of mean().

Related issues

Expected Output

>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()
1.0

Output of pd.show_versions()

$ pip3 freeze 
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
...
>>> pd.show_versions()                                                                                           
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: 0.11.3
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@WillAyd
Copy link
Member

WillAyd commented Feb 14, 2019

What are you suggesting pandas to do here exactly? Your example shows a Series matches numpy behavior and the "problem" is documented limitation of the third party tool you are using, so I'm unclear on what pandas should be doing (though admittedly not terribly familiar with bottleneck)

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 14, 2019
@lumbric
Copy link
Author

lumbric commented Feb 14, 2019

I'm not sure what you mean by "Series matches numpy behavior". In the example above, np.mean() returns the correct result 1.0, but pd.Series.mean() returns 0.5.

Bottleneck's documentation warns about overflows, but a naive user (like me) reading only the pandas documentation would not think about the dangers involved. When stumbling upon this behavior, I didn't even know that bottleneck is used behind the scenes (nor that it exists or that it is installed on my machine). To be honest, even after reading Bottleneck's code, I still don't understand the wrong result. How can a float32 overflow?

Update: it's not an overflow, it is a numerical error caused by naive summation in bottleneck.

I think the best option in this case is, if pandas avoids using bottleneck for dtype=float32. At least this is what was discussed in the xarray issue. Disclaimer: I'm not sure if I understand all details to suggest the best solution.

@lumbric
Copy link
Author

lumbric commented Feb 16, 2019

Oh, I think I misinterpreted the situation slightly. The wrong result is actually caused by numerical instability in bottleneck's implementation. That's still not good, but maybe different than an overflow. The hint in de documentation to overflow is not related.

Still, I think it might make sense not to use bottleneck for 32bit.

@lumbric
Copy link
Author

lumbric commented Feb 25, 2019

Is there still information missing here? Feel free to close the ticket if you are sure that this is not a pandas bug. I simply think it buts pandas users in danger of wrong results, so something should be done. Let me know, if I can help somehow.

@TomAugspurger
Copy link
Contributor

Has xarray decided what to do for 32-bit?

@lumbric
Copy link
Author

lumbric commented Feb 25, 2019

Kind of. xarray's master is not affected anylonger, it has been fixed in pydata/xarray@0b9ab2d1 I'm not 100% sure if fixing this was the intention of the commit. The commit is titled "Refactor nanops" and the diff more than 1000 lines. So I'm not sure if this is a clear decision. The discussion is here, the ticket is still open.

@lumbric
Copy link
Author

lumbric commented Feb 28, 2019

I think I found a hint, why xarray users might be more affacted than pandas users: the pagacke python3-xarray recommends python3-bottleneck in Debian and Ubuntu, but python3-pandas does not recommend python3-bottleneck (neither in Debian nor in Ubuntu).

@mroeschke
Copy link
Member

Closing as I think this is more of an issue with bottleneck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

4 participants