Skip to content

groupby.agg changes the default parameter of np.std #13344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ayhanfuat opened this issue Jun 1, 2016 · 5 comments
Closed

groupby.agg changes the default parameter of np.std #13344

ayhanfuat opened this issue Jun 1, 2016 · 5 comments
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question

Comments

@ayhanfuat
Copy link

ayhanfuat commented Jun 1, 2016

Code Sample, a copy-pastable example if possible

As opposed to pandas, the default value for ddof in standard deviation calculations is 0 in numpy. However, when I use groupby.agg(np.std) I get the result for ddof=1.

import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
df = pd.DataFrame({"A": prng.choice(list("abcdef"), 15), 
                   "B": prng.randint(0, 10, 15)})
df.groupby('A')['B'].agg(np.std)

Out[54]: 
A
a    3.511885
b         NaN
c    2.828427
d    3.593976
e    1.527525
f    1.414214
Name: B, dtype: float64

Expected Output

There is a note in the docs that states the default axis is changed but for me ddof change is quite unexpected. So my expected output is this:

df.groupby('A')['B'].apply(np.std)

Out[61]: 
A
a    2.867442
b    0.000000
c    2.000000
d    3.112475
e    1.247219
f    1.000000
Name: B, dtype: float64

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_GB

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

this is as expected (and has been these way for as long as I can remember).

numpy functions are translated to pandas ones. If you really want to use numpy then.

.groupby(...).agg(lambda x: np.std(x))

@jreback jreback closed this as completed Jun 1, 2016
@jreback jreback added Usage Question Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 1, 2016
@jreback jreback added this to the No action milestone Jun 1, 2016
@jreback
Copy link
Contributor

jreback commented Jun 1, 2016

you can also explicty pass in ddof=0 if you want.

@ayhanfuat
Copy link
Author

I thought about that possibility but seeing that it is not translated in df.groupby('A')['B'].apply(np.std) I thought it could be a bug. Thanks.

@EricCousineau-TRI
Copy link

you can also explicty pass in ddof=0 if you want.

While some approaches use lambda x: np.std(x), you can also use functools.partial(np.std) as a way to defeat func is np.std (which is what seems to trigger this family of behaviors) - it also preserves the name.

FWIW Came here by way of StackOverflow post: https://stackoverflow.com/a/47699378/7829525

@stevetracvc
Copy link

stevetracvc commented Aug 12, 2022

I'm posting here in case someone else runs into the same issue. I want both mean and standard deviation, but I can't pass in ddof to the std function when using more than one aggregate function. So then, if I try to do something like:

df.agg(["mean", lambda x: np.std(x, ddof=0)])

That throws ValueError: cannot combine transform and aggregation operations (same if I try to use partial).

So instead I tried to do

df.agg([lambda x: np.mean(x), lambda x: np.std(x, ddof=0)])

But this actually does not work correctly because it's trying to transform each cell instead of doing an actual aggretate.

So this then turns into

df.agg(["mean", lambda x: np.std(x, axis=0, ddof=0)])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Usage Question
Projects
None yet
Development

No branches or pull requests

4 participants