BUG: aggfunc use different default arguments in pivot_table #36508

dcsaba89 · 2020-09-20T20:19:47Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['a', 'a'], 'X': [5, 9]})

# standard deviation of X:
np.std(df['X'])
# 2.0

# aggfunc returns the same standard deviation as np.std as expected with this trick
pd.pivot_table(df, index='A', values='X', aggfunc=lambda x: np.std(x))
#        X
#  A
#  a     2

# aggfunc returns the unbiased standard deviation unexpectedly
pd.pivot_table(df, index='A', values='X', aggfunc=np.std)

#               X
#  A
#  a     2.828427

Problem description

The default value of ddof for np.std is 0:
numpy.std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=)

When np,std passed to aggfunc to calculate the standard deviation of X it returns the unbiased standard deviation, because it picks different ddof =1 and does not pick up the default ddof = 0.
On the other hand the expected behavior if we have a function f, when we pass it to aggfunc:
aggfunc=f and aggfunc=lambda x: f(x) must return exactly the same result.

Expected Output

When np.std is passed to aggfunc directly it picks up ddof=1 by default (this is unexpected as the default ddof for np.std is 0).
When np.std is passed to aggfunc by using lambda x: np.std(x), it picks up correctly the default ddof=0 from numpy as expected.

To summerize, the expected behavior is to use the function's default arguments when it is passed to aggregate values in pd.pivot_table.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.5.final.0
python-bits : 32
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.1.2
numpy : 1.19.2

The text was updated successfully, but these errors were encountered:

jreback · 2020-09-20T21:31:28Z

there have been several issues about this -
pls
search the tracker

denck007 · 2020-10-09T15:02:00Z

@dcsaba89 Apparently it is using the sample (unbiased) standard deviation, see #34437 for info.
@jreback This behavior is unexpected and is not documented. The expected behavior should be the defaults of the agg_func. This seems to only happen when using np.std. I cannot find any reference to this behavior in the docs or even in the code base itself. If this behavior is documented it may be acceptable, but it is in no way intuitive.

This example demonstrates why this is an issue:

def population_std(x):
    mean = np.mean(x)
    std = (1/len(x) * np.sum((x-mean)**2))**.5
    return std

df = pd.DataFrame({"image":[3,3,3,3,3],'value':[0,1,1,1,1]})
df_std = df.pivot_table(values=['value'],index='image',aggfunc=np.std)
df_std_lambda = df.pivot_table(values=['value'],index='image',aggfunc=lambda x:np.std(x))

# this asserts to True because by default np.std is population based
assert np.std(df['value'].to_numpy()) == population_std(df['value'].to_numpy()), f"np.std should be based on the population by default" 

# Asserts False because pandas overrides the default behavior
assert df_std['value'].iloc[0] == df_std_lambda['value'].iloc[0], f"pandas pivot table changed the default behavior of np.std"

# this asserts to False because the agg function overrides the default behavior of np.std
assert np.std(df['value'].to_numpy()) == df_std['value'].iloc[0], f"pandas pivot table agg function changed the default behavior of np.std"

edit: Added lambda method showing how aggregating with np.std vs lamda x:np.std(x) create different results

jreback · 2020-10-09T23:43:54Z

see #13344 and #31782 and #18734

if you want to submit a doc fix more than happy to take it

though pls search first iirc it's already there

dcsaba89 · 2020-12-07T15:28:43Z

I agree with you @denck007, this behavior is unexpected and furthermore undocumented. However, I think this behavior should be fixed to return the biased std_dev as this is what np.std returns by default.

rhshadrach · 2024-05-28T21:39:02Z

Closed by #57444

dcsaba89 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 20, 2020

dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 21, 2020

mroeschke added Docs and removed Bug labels Aug 13, 2021

jbrockmendel added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 10, 2023

rhshadrach closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: aggfunc use different default arguments in pivot_table #36508

BUG: aggfunc use different default arguments in pivot_table #36508

dcsaba89 commented Sep 20, 2020 •

edited

Loading

INSTALLED VERSIONS

jreback commented Sep 20, 2020

denck007 commented Oct 9, 2020 •

edited

Loading

jreback commented Oct 9, 2020 •

edited

Loading

dcsaba89 commented Dec 7, 2020

rhshadrach commented May 28, 2024

BUG: aggfunc use different default arguments in pivot_table #36508

BUG: aggfunc use different default arguments in pivot_table #36508

Comments

dcsaba89 commented Sep 20, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Sep 20, 2020

denck007 commented Oct 9, 2020 • edited Loading

jreback commented Oct 9, 2020 • edited Loading

dcsaba89 commented Dec 7, 2020

rhshadrach commented May 28, 2024

dcsaba89 commented Sep 20, 2020 •

edited

Loading

Output of `pd.show_versions()`

denck007 commented Oct 9, 2020 •

edited

Loading

jreback commented Oct 9, 2020 •

edited

Loading