Skip to content

BUG: GroupBy.std floating point error #51332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue Feb 11, 2023 · 3 comments
Open

BUG: GroupBy.std floating point error #51332

jbrockmendel opened this issue Feb 11, 2023 · 3 comments
Labels
Bug Groupby Reduction Operations sum, mean, min, max, etc.

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 11, 2023

values = pd.timedelta_range("1 Day", periods=10_000).asi8
ser = pd.Series(values)
gb = ser.groupby(list(range(5)) * 2000)

result = gb.std()
expected = pd.Series([gb.get_group(i).std() for i in gb.groups])

>>> result - expected
0    192.0
1    192.0
2    192.0
3    192.0
4    192.0
dtype: float64

In the groupby std method we cast from ints to floats, which I don't think we do in the relevant Series code (in nanops). This is my best guess for the culprit in this mismatch.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 11, 2023
@phofl phofl added Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2023
@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Jun 7, 2023
@matt-hoskins
Copy link

matt-hoskins commented Jul 10, 2023

In writing some test cases for some of my code I had an oddity where I'd compute the std of some values to figure out what the output of a routine doing multiple aggregations of groups should be and the std result via the group by gave a very slightly different result for one set of values.

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part). I pared it down to a short bit of code to illustrate:

import pandas

score_list = [1,1,2]

stdseries = pandas.Series(score_list).std()
stdgb = pandas.DataFrame({'A':['a' for x in score_list],'B':score_list}).groupby('A').std()['B']['a']

print(stdseries==stdgb)
print(stdseries)
print(stdgb)

And the output:

False
0.5773502691896257
0.5773502691896258

@jbrockmendel
Copy link
Member Author

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part)

I expect the difference is due to having different implementations of std for Series vs GroupBy, xref #53261

@matt-hoskins
Copy link

When I searched the list of open issues I saw this bug report and wondered if what I'm observing is related (or a misunderstanding on my part)

I expect the difference is due to having different implementations of std for Series vs GroupBy, xref #53261

I guess the implementation difference in std for GroupBy is also why the GroupBy version of it seems to give different results depending on the order of the values fed into it...

import pandas

score_list2 = [2.142857142857143, 2, 1.75]
stdgb1 = pandas.DataFrame({'A':['a' for x in score_list2],'B':score_list2}).groupby('A').std(ddof=1)['B']['a']
stdgb2 = pandas.DataFrame({'A':['a' for x in score_list2],'B':sorted(score_list2)}).groupby('A').std(ddof=1)['B']['a']
print(stdgb1 == stdgb2)
print(stdgb1)
print(stdgb2)

...

False
0.19884872724392919
0.19884872724392935

It took me a while to figure out why the test I wrote to test some code I have which aggregates via groupby was calculating a different answer for std in some cases even when using groupby itself until I double checked the problematic group of values and noticed the order was the only difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

3 participants