BUG: GroupBy.std floating point error #51332

jbrockmendel · 2023-02-11T20:49:54Z

values = pd.timedelta_range("1 Day", periods=10_000).asi8
ser = pd.Series(values)
gb = ser.groupby(list(range(5)) * 2000)

result = gb.std()
expected = pd.Series([gb.get_group(i).std() for i in gb.groups])

>>> result - expected
0    192.0
1    192.0
2    192.0
3    192.0
4    192.0
dtype: float64

In the groupby std method we cast from ints to floats, which I don't think we do in the relevant Series code (in nanops). This is my best guess for the culprit in this mismatch.

matt-hoskins · 2023-07-10T08:24:43Z

In writing some test cases for some of my code I had an oddity where I'd compute the std of some values to figure out what the output of a routine doing multiple aggregations of groups should be and the std result via the group by gave a very slightly different result for one set of values.

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part). I pared it down to a short bit of code to illustrate:

import pandas

score_list = [1,1,2]

stdseries = pandas.Series(score_list).std()
stdgb = pandas.DataFrame({'A':['a' for x in score_list],'B':score_list}).groupby('A').std()['B']['a']

print(stdseries==stdgb)
print(stdseries)
print(stdgb)

And the output:

False
0.5773502691896257
0.5773502691896258

jbrockmendel · 2023-07-10T17:36:10Z

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part)

I expect the difference is due to having different implementations of std for Series vs GroupBy, xref #53261

matt-hoskins · 2023-07-14T22:22:48Z

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part)

I expect the difference is due to having different implementations of std for Series vs GroupBy, xref #53261

I guess the implementation difference in std for GroupBy is also why the GroupBy version of it seems to give different results depending on the order of the values fed into it...

import pandas

score_list2 = [2.142857142857143, 2, 1.75]
stdgb1 = pandas.DataFrame({'A':['a' for x in score_list2],'B':score_list2}).groupby('A').std(ddof=1)['B']['a']
stdgb2 = pandas.DataFrame({'A':['a' for x in score_list2],'B':sorted(score_list2)}).groupby('A').std(ddof=1)['B']['a']
print(stdgb1 == stdgb2)
print(stdgb1)
print(stdgb2)

...

False
0.19884872724392919
0.19884872724392935

It took me a while to figure out why the test I wrote to test some code I have which aggregates via groupby was calculating a different answer for std in some cases even when using groupby itself until I double checked the problematic group of values and noticed the order was the only difference.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 11, 2023

phofl added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2023

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy.std floating point error #51332

BUG: GroupBy.std floating point error #51332

jbrockmendel commented Feb 11, 2023 •

edited

Loading

matt-hoskins commented Jul 10, 2023 •

edited

Loading

jbrockmendel commented Jul 10, 2023

matt-hoskins commented Jul 14, 2023

BUG: GroupBy.std floating point error #51332

BUG: GroupBy.std floating point error #51332

Comments

jbrockmendel commented Feb 11, 2023 • edited Loading

matt-hoskins commented Jul 10, 2023 • edited Loading

jbrockmendel commented Jul 10, 2023

matt-hoskins commented Jul 14, 2023

jbrockmendel commented Feb 11, 2023 •

edited

Loading

matt-hoskins commented Jul 10, 2023 •

edited

Loading