ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

yuzie007 · 2017-12-11T22:19:25Z

Code Sample, a copy-pastable example if possible

In [31]: import numpy as np

In [32]: import pandas as pd

In [33]: df = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])

In [34]: df
Out[34]:
   A  B
0  0  1
1  2  3
2  4  5

In [35]: df.agg(np.std)  # Behavior of ddof=0
Out[35]:
A    1.632993
B    1.632993
dtype: float64

In [36]: df.agg([np.std])  # Behavior of ddof=1
Out[36]:
       A    B
std  2.0  2.0

In [37]: # So how to get the ddof=0 behavior when giving a list of functions?

In [39]: df.agg([lambda x: np.std(x)])  # This gives a numerically unexpected result.
Out[39]:
         A        B
  <lambda> <lambda>
0      0.0      0.0
1      0.0      0.0
2      0.0      0.0

In [40]: df.agg([np.mean, lambda x: np.std(x)])  # This gives an error.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-52f4ec4195b5> in <module>()
----> 1 df.agg([np.mean, lambda x: np.std(x)])

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/frame.py in aggregate(self, func, axis, *args, **kwargs)
   4740         if axis == 0:
   4741             try:
-> 4742                 result, how = self._aggregate(func, axis=0, *args, **kwargs)
   4743             except TypeError:
   4744                 pass

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    537             return self._aggregate_multiple_funcs(arg,
    538                                                   _level=_level,
--> 539                                                   _axis=_axis), None
    540         else:
    541             result = None

/Users/ikeda/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    594         # if we are empty
    595         if not len(results):
--> 596             raise ValueError("no results")
    597
    598         try:

ValueError: no results

Problem description

When using, e.g., df.agg, the ddof (degrees of freedom) value for the function np.std changes depending on how the function is given (single function or a list of functions), which may be so confusing for many people. I believe the behavior should be unified in some way.

Furthermore, I could not find the way to obtain to the np.std result with ddof=0 by supplying it as one of the members of a list of functions. The lambda expression does not work well in a list of functions (this gives numerically unexpected results or even gives errors). This prohibits us to use many useful methods like df.agg, df.apply, and df.describe when we hope the ddof=0 behavior.

From #13344, I guess Developers prefer the ddof=1 behavior in pandas. So the expected behavior should be as below.

Expected Output

In [35]: df.agg(np.std)  # Behavior of ddof=1
Out[35]:
A    2.0
B    2.0
dtype: float64

In [38]: df.agg([lambda x: np.std(x)])  # To obtain the ddof=0 results
Out[38]:
                     A             B
<lambda>      1.632993      1.632993

In [41]: df.agg([np.mean, lambda x: np.std(x)])
                     A             B
mean          2.0           3.0
<lambda>      1.632993      1.632993

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.21.0
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-06-12T19:39:28Z

This looks to work on master. Could use a test

In [15]: In [35]: df.agg(np.std)
Out[15]:
A    2.0
B    2.0
dtype: float64

jcrvz · 2021-12-05T16:06:30Z

It works by doing:
df.agg([lambda x: np.mean(x), lambda x: np.std(x, ddof=0)])

sappersapper · 2021-12-09T03:10:12Z

I encounter a related inconsistence:

import pandas as pd
import numpy as np

df = pd.DataFrame({'id': ['a', 'a', 'b', 'b'], 'value': [1, 3, 5, 7]})

df.groupby('id').agg(np.var)  # calc with ddof=1

id	value
a	2
b	2

df.groupby('id').apply(np.var)  # calc with ddof=0

id	value
a	1
b	1

gfyoung added the Numeric Operations Arithmetic, Comparison, and Logical operations label Dec 15, 2017

jreback mentioned this issue Oct 9, 2020

BUG: aggfunc use different default arguments in pivot_table #36508

Closed

3 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 12, 2021

steliospetrakis02 mentioned this issue Apr 3, 2023

Add a test for agg std #52371

Merged

5 tasks

mroeschke closed this as completed in #52371 Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

yuzie007 commented Dec 11, 2017

INSTALLED VERSIONS

mroeschke commented Jun 12, 2021

jcrvz commented Dec 5, 2021

sappersapper commented Dec 9, 2021 •

edited

Loading

ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

ddof for np.std in df.agg changes depending on how given & lambda expression does not work correctly in a list of functions #18734

Comments

yuzie007 commented Dec 11, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jun 12, 2021

jcrvz commented Dec 5, 2021

sappersapper commented Dec 9, 2021 • edited Loading

Output of `pd.show_versions()`

sappersapper commented Dec 9, 2021 •

edited

Loading