Wrong result for float32 Series when using bottleneck #25307

lumbric · 2019-02-13T19:29:17Z

Minimal example

Requires bottleneck, numpy and pandas to be installed:

>>> import numpy as np
>>> import pandas as pd
>>> import bottleneck as bn
>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()  # wrong
0.5
>>> bn.nanmean(data)  # wrong
0.5
>>> data.mean()  # correct
1.0

Problem description

The mean() of large float32 Series is wrong when bottleneck is used. Uninstalling bottleneck or using float64 is a valid workaround. xarray is or has been affected too, see pydata/xarray#1346.

~~Bottleneck's documentation explicitly mentions that no error is raised in case of an overflow, not sure if this is still to be considered as bug in bottleneck.~~ Anyhow since it seems quite severe, I want to raise attention here too.

Update: This is not an overflow, it's a numerical error (which is very high because bottleneck does not use pairwise summation).

Bottleneck's implementation of mean().

Related issues

same thing in xarray: bottleneck : Wrong mean for float32 array pydata/xarray#1346
same thing in aospy: Inaccuracy of some operations (e.g. stacked average) when underlying data is float32 spencerahill/aospy#217
similar bug in pandas with bottleneck, but related to int not float: BUG: int64 overflow/wrap around with sum() #15453 and int64 overflow/wrap around with nansum() pydata/bottleneck#163
another very similar but older bug in pandas with bottleneck but related to int not float: Bug in pd.Series.mean() #6915 BUG: nansum platform overflow pydata/bottleneck#83
something different, not to be confused - race conditions when reading netcdf files: Issues (wrong result) when computing the mean on a NetCDF file dask/dask#2095
probably not related, but who knows: Unexpected behavior for bn.move_std with float32 array pydata/bottleneck#164

Expected Output

>>> data = np.ones(2**25, dtype=np.float32)
>>> pd.Series(data).mean()
1.0

Output of `pd.show_versions()`

$ pip3 freeze 
Bottleneck==1.2.1
numpy==1.16.1
pandas==0.24.1
...

>>> pd.show_versions()                                                                                           
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.2
setuptools: 40.8.0
Cython: None
numpy: 1.16.1
scipy: None
pyarrow: None
xarray: 0.11.3
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-02-14T06:30:56Z

What are you suggesting pandas to do here exactly? Your example shows a Series matches numpy behavior and the "problem" is documented limitation of the third party tool you are using, so I'm unclear on what pandas should be doing (though admittedly not terribly familiar with bottleneck)

lumbric · 2019-02-14T10:37:37Z

I'm not sure what you mean by "Series matches numpy behavior". In the example above, np.mean() returns the correct result 1.0, but pd.Series.mean() returns 0.5.

Bottleneck's documentation warns about overflows, but a naive user (like me) reading only the pandas documentation would not think about the dangers involved. When stumbling upon this behavior, I didn't even know that bottleneck is used behind the scenes (nor that it exists or that it is installed on my machine). To be honest, even after reading Bottleneck's code, I still don't understand the wrong result. How can a float32 overflow?

Update: it's not an overflow, it is a numerical error caused by naive summation in bottleneck.

I think the best option in this case is, if pandas avoids using bottleneck for dtype=float32. At least this is what was discussed in the xarray issue. Disclaimer: I'm not sure if I understand all details to suggest the best solution.

lumbric · 2019-02-16T12:20:14Z

Oh, I think I misinterpreted the situation slightly. The wrong result is actually caused by numerical instability in bottleneck's implementation. That's still not good, but maybe different than an overflow. The hint in de documentation to overflow is not related.

Still, I think it might make sense not to use bottleneck for 32bit.

lumbric · 2019-02-25T14:46:13Z

Is there still information missing here? Feel free to close the ticket if you are sure that this is not a pandas bug. I simply think it buts pandas users in danger of wrong results, so something should be done. Let me know, if I can help somehow.

TomAugspurger · 2019-02-25T14:56:37Z

Has xarray decided what to do for 32-bit?

lumbric · 2019-02-25T15:15:44Z

Kind of. xarray's master is not affected anylonger, it has been fixed in pydata/xarray@0b9ab2d1 I'm not 100% sure if fixing this was the intention of the commit. The commit is titled "Refactor nanops" and the diff more than 1000 lines. So I'm not sure if this is a clear decision. The discussion is here, the ticket is still open.

lumbric · 2019-02-28T20:14:36Z

I think I found a hint, why xarray users might be more affacted than pandas users: the pagacke python3-xarray recommends python3-bottleneck in Debian and Ubuntu, but python3-pandas does not recommend python3-bottleneck (neither in Debian nor in Ubuntu).

mroeschke · 2020-03-08T18:13:08Z

Closing as I think this is more of an issue with bottleneck

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 14, 2019

mroeschke closed this as completed Mar 8, 2020

jreback mentioned this issue Aug 5, 2021

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong result for float32 Series when using bottleneck #25307

Wrong result for float32 Series when using bottleneck #25307

lumbric commented Feb 13, 2019 •

edited

Loading

WillAyd commented Feb 14, 2019

lumbric commented Feb 14, 2019 •

edited

Loading

lumbric commented Feb 16, 2019

lumbric commented Feb 25, 2019

TomAugspurger commented Feb 25, 2019

lumbric commented Feb 25, 2019

lumbric commented Feb 28, 2019

mroeschke commented Mar 8, 2020

Wrong result for float32 Series when using bottleneck #25307

Wrong result for float32 Series when using bottleneck #25307

Comments

lumbric commented Feb 13, 2019 • edited Loading

Minimal example

Problem description

Related issues

Expected Output

Output of pd.show_versions()

WillAyd commented Feb 14, 2019

lumbric commented Feb 14, 2019 • edited Loading

lumbric commented Feb 16, 2019

lumbric commented Feb 25, 2019

TomAugspurger commented Feb 25, 2019

lumbric commented Feb 25, 2019

lumbric commented Feb 28, 2019

mroeschke commented Mar 8, 2020

lumbric commented Feb 13, 2019 •

edited

Loading

Output of `pd.show_versions()`

lumbric commented Feb 14, 2019 •

edited

Loading