Skip to content

BUG: sum vs groupby.sum errors #38778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cipriantrofin opened this issue Dec 29, 2020 · 12 comments · Fixed by #38903
Closed

BUG: sum vs groupby.sum errors #38778

cipriantrofin opened this issue Dec 29, 2020 · 12 comments · Fixed by #38903
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@cipriantrofin
Copy link

data_source.zip

The attached data source is a subset of a larger set. I included just enough rows to show the error (when the datasource is larger, the difference of sums grows larger too).

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.5f}'.format

df = pd.read_csv("data_source.csv", index_col = 0)

df.reset_index(inplace = True)

print ("Regular sum: %s\n" % df["rul_c"].sum())

print ("Regular sum on filtered column: %s\n" % df[df["cont"] == 20]["rul_c"].sum())

print ("GroupBy sum:\n%s" % df.groupby("cont")["rul_c"].sum())

Problem description

I know about floating point math and small associated "errors" but the "cont" column has an unique, not-null value, and this means all sums above should be the same (about 30880496049.43). However the groupby.sum is quite different from the expected result .45165.

The larger the dataset, the larger the difference.

Regular sum: 30880496049.429993

Regular sum on filtered column: 30880496049.429993

GroupBy sum:
cont
20   30880496049.45165
Name: rul_c, dtype: float64

Expected Output

All sums should be 30880496049.429993

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Romanian_Romania.1252

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 47.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.4
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

@cipriantrofin cipriantrofin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020
@mzeitlin11
Copy link
Member

Thanks for the report @cipriantrofin! The current implementation being used for groupby sums is here:

def _group_add(complexfloating_t[:, :] out,

which does not include considerations for floating point error. Looking in the notes here (https://numpy.org/doc/stable/reference/generated/numpy.sum.html), numpy may be doing a partial pairwise summation, leading to a more accurate result. Not sure the best way to handle this - pandas could implement something similar, but it would likely be a slower implementation, and so would have to be opt-in to use a more accurate algorithm to avoid slowing down code which doesn't care about floating point error accumulation.

@mzeitlin11 mzeitlin11 added Needs Discussion Requires discussion from core team before further action Groupby Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020
@cipriantrofin
Copy link
Author

Thank you @mzeitlin11
I don't know if it's relevant, but using np.sum:
df.groupby("cont")["rul_c"].agg(np.sum))
has the same output as
df.groupby("cont")["rul_c"].sum()

@mzeitlin11
Copy link
Member

Thank you @mzeitlin11
I don't know if it's relevant, but using np.sum:
df.groupby("cont")["rul_c"].agg(np.sum))
has the same output as
df.groupby("cont")["rul_c"].sum()

Interesting! Would have to look into more to confirm, but would guess np.sum is not using partial pairwise summation in this case - the doc notes say "Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory."

@cipriantrofin
Copy link
Author

I found that using math.fsum as an aggregate function gives better (best?) results.
Of course, it is slower.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

the issue is that we need to use kahan summation in both of these (actually the 1d does if dispatched thru bottleneck) so groupby needs it

we recently implemented this for rolling

@phofl
Copy link
Member

phofl commented Jan 2, 2021

The regular sum is wrong too. Neither of the values within the dataset has more than two decimals. I checked this with

df["rounded"] = df["rul_c"].round(2)
print ("Regular sum: %s\n" % df["rounded"].sum())

returns the same result, which is obviously wrong

@phofl
Copy link
Member

phofl commented Jan 2, 2021

@cipriantrofin Could you adjust your expected output? 30880496049.43

@cipriantrofin
Copy link
Author

The regular sum is wrong too. Neither of the values within the dataset has more than two decimals. I checked this with

df["rounded"] = df["rul_c"].round(2)
print ("Regular sum: %s\n" % df["rounded"].sum())

returns the same result, which is obviously wrong

You are right, of course, but we are dealing with floating point operations and some "errors" are expected. However, rounding the regular sum to twp digits returns the right answer, and it is ok with me.
On the other hand, group by sum is significantly away from the expected answer.

@cipriantrofin
Copy link
Author

@cipriantrofin Could you adjust your expected output? 30880496049.43

I would like to, but I am quite aware that using the regular sum function will not help me in that regard.

@phofl
Copy link
Member

phofl commented Jan 2, 2021

I am currently fixing the groupby bug, this will return .43 in the future.

@liketheflower
Copy link

liketheflower commented Feb 17, 2023

Interesting to observe following:

>>> np.sum(df["rul_c"])
30880496049.429993
>>> df["rul_c"].sum()
30880496049.429993
>>> sum(df["rul_c"].values.tolist())
30880496049.45165
>>> math.fsum(df["rul_c"].values.tolist())
30880496049.43

Python version:

Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:50:38) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__version__
'1.20.3'
>>> pd.__version__
'1.3.2'

@GBlanch
Copy link

GBlanch commented Dec 16, 2023

Thank you @mzeitlin11 I don't know if it's relevant, but using np.sum: df.groupby("cont")["rul_c"].agg(np.sum)) has the same output as df.groupby("cont")["rul_c"].sum()

It is relevant to me, many thanks Sir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Needs Discussion Requires discussion from core team before further action Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants