BUG: sum vs groupby.sum errors #38778

cipriantrofin · 2020-12-29T14:41:19Z

The attached data source is a subset of a larger set. I included just enough rows to show the error (when the datasource is larger, the difference of sums grows larger too).

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
pd.options.display.float_format = '{:.5f}'.format

df = pd.read_csv("data_source.csv", index_col = 0)

df.reset_index(inplace = True)

print ("Regular sum: %s\n" % df["rul_c"].sum())

print ("Regular sum on filtered column: %s\n" % df[df["cont"] == 20]["rul_c"].sum())

print ("GroupBy sum:\n%s" % df.groupby("cont")["rul_c"].sum())

Problem description

I know about floating point math and small associated "errors" but the "cont" column has an unique, not-null value, and this means all sums above should be the same (about 30880496049.43). However the groupby.sum is quite different from the expected result .45165.

The larger the dataset, the larger the difference.

Regular sum: 30880496049.429993

Regular sum on filtered column: 30880496049.429993

GroupBy sum:
cont
20   30880496049.45165
Name: rul_c, dtype: float64

Expected Output

All sums should be 30880496049.429993

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Romanian_Romania.1252

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 47.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.3.4
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2020-12-29T16:42:40Z

Thanks for the report @cipriantrofin! The current implementation being used for groupby sums is here:

pandas/pandas/_libs/groupby.pyx

Line 476 in e752928

def _group_add(complexfloating_t[:, :] out,

which does not include considerations for floating point error. Looking in the notes here (https://numpy.org/doc/stable/reference/generated/numpy.sum.html), numpy may be doing a partial pairwise summation, leading to a more accurate result. Not sure the best way to handle this - pandas could implement something similar, but it would likely be a slower implementation, and so would have to be opt-in to use a more accurate algorithm to avoid slowing down code which doesn't care about floating point error accumulation.

cipriantrofin · 2020-12-29T16:56:57Z

Thank you @mzeitlin11
I don't know if it's relevant, but using np.sum:
df.groupby("cont")["rul_c"].agg(np.sum))
has the same output as
df.groupby("cont")["rul_c"].sum()

mzeitlin11 · 2020-12-29T17:11:30Z

Thank you @mzeitlin11
I don't know if it's relevant, but using np.sum:
df.groupby("cont")["rul_c"].agg(np.sum))
has the same output as
df.groupby("cont")["rul_c"].sum()

Interesting! Would have to look into more to confirm, but would guess np.sum is not using partial pairwise summation in this case - the doc notes say "Technically, to provide the best speed possible, the improved precision is only used when the summation is along the fast axis in memory."

cipriantrofin · 2020-12-29T18:37:32Z

I found that using math.fsum as an aggregate function gives better (best?) results.
Of course, it is slower.

jreback · 2020-12-29T18:45:41Z

the issue is that we need to use kahan summation in both of these (actually the 1d does if dispatched thru bottleneck) so groupby needs it

we recently implemented this for rolling

phofl · 2021-01-02T15:11:33Z

The regular sum is wrong too. Neither of the values within the dataset has more than two decimals. I checked this with

df["rounded"] = df["rul_c"].round(2)
print ("Regular sum: %s\n" % df["rounded"].sum())

returns the same result, which is obviously wrong

phofl · 2021-01-02T15:37:03Z

@cipriantrofin Could you adjust your expected output? 30880496049.43

cipriantrofin · 2021-01-02T15:37:41Z

The regular sum is wrong too. Neither of the values within the dataset has more than two decimals. I checked this with
df["rounded"] = df["rul_c"].round(2)
print ("Regular sum: %s\n" % df["rounded"].sum())
returns the same result, which is obviously wrong

You are right, of course, but we are dealing with floating point operations and some "errors" are expected. However, rounding the regular sum to twp digits returns the right answer, and it is ok with me.
On the other hand, group by sum is significantly away from the expected answer.

cipriantrofin · 2021-01-02T15:40:51Z

@cipriantrofin Could you adjust your expected output? 30880496049.43

I would like to, but I am quite aware that using the regular sum function will not help me in that regard.

phofl · 2021-01-02T15:43:20Z

I am currently fixing the groupby bug, this will return .43 in the future.

liketheflower · 2023-02-17T03:57:06Z

Interesting to observe following:

>>> np.sum(df["rul_c"])
30880496049.429993
>>> df["rul_c"].sum()
30880496049.429993
>>> sum(df["rul_c"].values.tolist())
30880496049.45165
>>> math.fsum(df["rul_c"].values.tolist())
30880496049.43

Python version:

Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:50:38) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> np.__version__
'1.20.3'
>>> pd.__version__
'1.3.2'

GBlanch · 2023-12-16T02:04:33Z

Thank you @mzeitlin11 I don't know if it's relevant, but using np.sum: df.groupby("cont")["rul_c"].agg(np.sum)) has the same output as df.groupby("cont")["rul_c"].sum()

It is relevant to me, many thanks Sir

cipriantrofin added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020

mzeitlin11 added Needs Discussion Requires discussion from core team before further action Groupby Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 29, 2020

phofl mentioned this issue Jan 2, 2021

ENH: Use Kahan summation to calculate groupby.sum() #38903

Merged

5 tasks

jreback added this to the 1.3 milestone Jan 3, 2021

jreback closed this as completed in #38903 Jan 3, 2021

ggold7046 mentioned this issue Aug 10, 2023

Modified doc/make.py to run sphinx-build -b linkcheck #54265

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: sum vs groupby.sum errors #38778

BUG: sum vs groupby.sum errors #38778

cipriantrofin commented Dec 29, 2020

INSTALLED VERSIONS

mzeitlin11 commented Dec 29, 2020

cipriantrofin commented Dec 29, 2020

mzeitlin11 commented Dec 29, 2020

cipriantrofin commented Dec 29, 2020

jreback commented Dec 29, 2020

phofl commented Jan 2, 2021

phofl commented Jan 2, 2021

cipriantrofin commented Jan 2, 2021

cipriantrofin commented Jan 2, 2021

phofl commented Jan 2, 2021

liketheflower commented Feb 17, 2023 •

edited

Loading

GBlanch commented Dec 16, 2023

BUG: sum vs groupby.sum errors #38778

BUG: sum vs groupby.sum errors #38778

Comments

cipriantrofin commented Dec 29, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Dec 29, 2020

cipriantrofin commented Dec 29, 2020

mzeitlin11 commented Dec 29, 2020

cipriantrofin commented Dec 29, 2020

jreback commented Dec 29, 2020

phofl commented Jan 2, 2021

phofl commented Jan 2, 2021

cipriantrofin commented Jan 2, 2021

cipriantrofin commented Jan 2, 2021

phofl commented Jan 2, 2021

liketheflower commented Feb 17, 2023 • edited Loading

GBlanch commented Dec 16, 2023

Output of `pd.show_versions()`

liketheflower commented Feb 17, 2023 •

edited

Loading