-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: sum vs groupby.sum errors #38778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report @cipriantrofin! The current implementation being used for groupby sums is here: pandas/pandas/_libs/groupby.pyx Line 476 in e752928
which does not include considerations for floating point error. Looking in the notes here (https://numpy.org/doc/stable/reference/generated/numpy.sum.html), numpy may be doing a partial pairwise summation, leading to a more accurate result. Not sure the best way to handle this - pandas could implement something similar, but it would likely be a slower implementation, and so would have to be opt-in to use a more accurate algorithm to avoid slowing down code which doesn't care about floating point error accumulation.
|
Thank you @mzeitlin11 |
Interesting! Would have to look into more to confirm, but would guess |
I found that using math.fsum as an aggregate function gives better (best?) results. |
the issue is that we need to use kahan summation in both of these (actually the 1d does if dispatched thru bottleneck) so groupby needs it we recently implemented this for rolling |
The regular sum is wrong too. Neither of the values within the dataset has more than two decimals. I checked this with
returns the same result, which is obviously wrong |
@cipriantrofin Could you adjust your expected output? |
You are right, of course, but we are dealing with floating point operations and some "errors" are expected. However, rounding the regular sum to twp digits returns the right answer, and it is ok with me. |
I would like to, but I am quite aware that using the regular sum function will not help me in that regard. |
I am currently fixing the groupby bug, this will return .43 in the future. |
Interesting to observe following: >>> np.sum(df["rul_c"])
30880496049.429993
>>> df["rul_c"].sum()
30880496049.429993
>>> sum(df["rul_c"].values.tolist())
30880496049.45165
>>> math.fsum(df["rul_c"].values.tolist())
30880496049.43 Python version: Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:50:38)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__version__
'1.20.3'
>>> pd.__version__
'1.3.2' |
It is relevant to me, many thanks Sir |
data_source.zip
The attached data source is a subset of a larger set. I included just enough rows to show the error (when the datasource is larger, the difference of sums grows larger too).
Code Sample, a copy-pastable example
Problem description
I know about floating point math and small associated "errors" but the "cont" column has an unique, not-null value, and this means all sums above should be the same (about 30880496049.43). However the groupby.sum is quite different from the expected result .45165.
The larger the dataset, the larger the difference.
Expected Output
All sums should be 30880496049.429993
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 3e89b4c
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Romanian_Romania.1252
pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 47.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.4
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: