Skip to content

BUG: aggfunc use different default arguments in pivot_table #36508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
dcsaba89 opened this issue Sep 20, 2020 · 5 comments
Closed
2 of 3 tasks

BUG: aggfunc use different default arguments in pivot_table #36508

dcsaba89 opened this issue Sep 20, 2020 · 5 comments
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@dcsaba89
Copy link

dcsaba89 commented Sep 20, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'a'], 'X': [5, 9]})

# standard deviation of X:
np.std(df['X'])
# 2.0

# aggfunc returns the same standard deviation as np.std as expected with this trick
pd.pivot_table(df, index='A', values='X', aggfunc=lambda x: np.std(x))
#        X
#  A
#  a     2

# aggfunc returns the unbiased standard deviation unexpectedly
pd.pivot_table(df, index='A', values='X', aggfunc=np.std)

#               X
#  A
#  a     2.828427

Problem description

The default value of ddof for np.std is 0:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)

When np,std passed to aggfunc to calculate the standard deviation of X it returns the unbiased standard deviation, because it picks different ddof =1 and does not pick up the default ddof = 0.
On the other hand the expected behavior if we have a function f, when we pass it to aggfunc:
aggfunc=f and aggfunc=lambda x: f(x) must return exactly the same result.

Expected Output

  1. When np.std is passed to aggfunc directly it picks up ddof=1 by default (this is unexpected as the default ddof for np.std is 0).
  2. When np.std is passed to aggfunc by using lambda x: np.std(x), it picks up correctly the default ddof=0 from numpy as expected.

To summerize, the expected behavior is to use the function's default arguments when it is passed to aggregate values in pd.pivot_table.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 32
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.1.2
numpy : 1.19.2

@dcsaba89 dcsaba89 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2020
@jreback
Copy link
Contributor

jreback commented Sep 20, 2020

there have been several issues about this -
pls
search the tracker

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 21, 2020
@denck007
Copy link

denck007 commented Oct 9, 2020

@dcsaba89 Apparently it is using the sample (unbiased) standard deviation, see #34437 for info.
@jreback This behavior is unexpected and is not documented. The expected behavior should be the defaults of the agg_func. This seems to only happen when using np.std. I cannot find any reference to this behavior in the docs or even in the code base itself. If this behavior is documented it may be acceptable, but it is in no way intuitive.

This example demonstrates why this is an issue:

def population_std(x):
    mean = np.mean(x)
    std = (1/len(x) * np.sum((x-mean)**2))**.5
    return std

df = pd.DataFrame({"image":[3,3,3,3,3],'value':[0,1,1,1,1]})
df_std = df.pivot_table(values=['value'],index='image',aggfunc=np.std)
df_std_lambda = df.pivot_table(values=['value'],index='image',aggfunc=lambda x:np.std(x))

# this asserts to True because by default np.std is population based
assert np.std(df['value'].to_numpy()) == population_std(df['value'].to_numpy()), f"np.std should be based on the population by default" 

# Asserts False because pandas overrides the default behavior
assert df_std['value'].iloc[0] == df_std_lambda['value'].iloc[0], f"pandas pivot table changed the default behavior of np.std"

# this asserts to False because the agg function overrides the default behavior of np.std
assert np.std(df['value'].to_numpy()) == df_std['value'].iloc[0], f"pandas pivot table agg function changed the default behavior of np.std"

edit: Added lambda method showing how aggregating with np.std vs lamda x:np.std(x) create different results

@jreback
Copy link
Contributor

jreback commented Oct 9, 2020

see #13344 and #31782 and #18734

if you want to submit a doc fix more than happy to take it

though pls search first iirc it's already there

@dcsaba89
Copy link
Author

dcsaba89 commented Dec 7, 2020

I agree with you @denck007, this behavior is unexpected and furthermore undocumented. However, I think this behavior should be fixed to return the biased std_dev as this is what np.std returns by default.

@mroeschke mroeschke added Docs and removed Bug labels Aug 13, 2021
@jbrockmendel jbrockmendel added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 10, 2023
@rhshadrach
Copy link
Member

Closed by #57444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

7 participants