Skip to content

ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yuzie007 opened this issue Dec 11, 2017 · 3 comments · Fixed by #52371
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@yuzie007
Copy link

Code Sample, a copy-pastable example if possible

In [31]: import numpy as np

In [32]: import pandas as pd

In [33]: df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])

In [34]: df
Out[34]:
   A  B
0  0  1
1  2  3
2  4  5

In [35]: df.agg(np.std)  # Behavior of ddof=0
Out[35]:
A    1.632993
B    1.632993
dtype: float64

In [36]: df.agg([np.std])  # Behavior of ddof=1
Out[36]:
       A    B
std  2.0  2.0

In [37]: # So how to get the ddof=0 behavior when giving a list of functions?

In [39]: df.agg([lambda x: np.std(x)])  # This gives a numerically unexpected result.
Out[39]:
         A        B
  <lambda> <lambda>
0      0.0      0.0
1      0.0      0.0
2      0.0      0.0

In [40]: df.agg([np.mean, lambda x: np.std(x)])  # This gives an error.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-52f4ec4195b5> in <module>()
----> 1 df.agg([np.mean, lambda x: np.std(x)])

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4740         if axis == 0:
   4741             try:
-> 4742                 result, how = self._aggregate(func, axis=0, *args, **kwargs)
   4743             except TypeError:
   4744                 pass

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    537             return self._aggregate_multiple_funcs(arg,
    538                                                   _level=_level,
--> 539                                                   _axis=_axis), None
    540         else:
    541             result = None

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    594         # if we are empty
    595         if not len(results):
--> 596             raise ValueError("no results")
    597
    598         try:

ValueError: no results

Problem description

When using, e.g., df.agg, the ddof (degrees of freedom) value for the function np.std changes depending on how the function is given (single function or a list of functions), which may be so confusing for many people. I believe the behavior should be unified in some way.

Furthermore, I could not find the way to obtain to the np.std result with ddof=0 by supplying it as one of the members of a list of functions. The lambda expression does not work well in a list of functions (this gives numerically unexpected results or even gives errors). This prohibits us to use many useful methods like df.agg, df.apply, and df.describe when we hope the ddof=0 behavior.

From #13344, I guess Developers prefer the ddof=1 behavior in pandas. So the expected behavior should be as below.

Expected Output

In [35]: df.agg(np.std)  # Behavior of ddof=1
Out[35]:
A    2.0
B    2.0
dtype: float64

In [38]: df.agg([lambda x: np.std(x)])  # To obtain the ddof=0 results
Out[38]:
                     A             B
<lambda>      1.632993      1.632993

In [41]: df.agg([np.mean, lambda x: np.std(x)])
                     A             B
mean          2.0           3.0
<lambda>      1.632993      1.632993

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.21.0
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 15, 2017
@mroeschke
Copy link
Member

This looks to work on master. Could use a test

In [15]: In [35]: df.agg(np.std)
Out[15]:
A    2.0
B    2.0
dtype: float64

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 12, 2021
@jcrvz
Copy link

jcrvz commented Dec 5, 2021

It works by doing:
df.agg([lambda x: np.mean(x), lambda x: np.std(x, ddof=0)])

@sappersapper
Copy link

sappersapper commented Dec 9, 2021

I encounter a related inconsistence:

import pandas as pd
import numpy as np

df = pd.DataFrame({'id': ['a', 'a', 'b', 'b'], 'value': [1, 3, 5, 7]})
df.groupby('id').agg(np.var)  # calc with ddof=1
id value
a 2
b 2
df.groupby('id').apply(np.var)  # calc with ddof=0
id value
a 1
b 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants