Skip to content

BUG: pivot_table calculates incorrect standard deviation when np.std passed to aggfunc #34437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
dcsaba89 opened this issue May 28, 2020 · 3 comments
Closed
2 of 3 tasks

Comments

@dcsaba89
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


`
import numpy as np
import pandas as pd

df = pd.DataFrame({'A': ['a', 'a'], 'X': [2.8597, 2.8503]})
pivot = pd.pivot_table(df, index='A', values='X', aggfunc=(np.mean, np.std))
my_std = np.std(df.X) # 0.0047000000000001485
pivot_std = pivot.loc['a', 'std'] # 0.006646803743153757
`

Problem description

pivot_table calculates incorrect standard deviation when np.std passed to aggfunc.

Expected Output

Numpy calculates the correct std in the given example.

@dcsaba89 dcsaba89 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 28, 2020
@dsaxton
Copy link
Member

dsaxton commented May 29, 2020

pivot_table is using the unbiased estimate of variance where we divide by n - 1 instead of n:

import numpy as np

np.std([2.8597, 2.8503], ddof=1)                                                                                                                                                               
# 0.006646803743153757
np.std([2.8597, 2.8503], ddof=0)                                                                                                                                                              
# 0.0047000000000001485

@dsaxton dsaxton added Usage Question and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@mroeschke
Copy link
Member

Going to close as a usage question.

@dcsaba89
Copy link
Author

This difference is unexcepted:

df = pd.DataFrame({'A': ['a', 'a'], 'X': [2.8597, 2.8503]})
pivot = pd.pivot_table(df, index='A', values='X', aggfunc=(np.std, lambda x: np.std(x)))
lambda_std = pivot.loc['a', '<lambda_0>'] # 0.0047000000000001485
std = pivot.loc['a', 'std'] # 0.006646803743153757

In the above example, standard deviation is calculated twice, in the first case, np.std function directly passed to aggfunc, on the other hand in the second case, a lambda function is passed to aggfunc which uses the same np.std function to calculate standard deviation. The results in these cases must equal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants