Skip to content

Different behavior of numpy functions when used as a Python lambda function #31782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
suddensleep opened this issue Feb 7, 2020 · 1 comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@suddensleep
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np

df = pd.DataFrame().from_dict({"a": [1, 1, 1, 1], "b": [1, 2, 3, 4]})
df
    a   b
0   1   1
1   1   2
2   1   3
3   1   4

gdf = df.groupby("a")

gdf['b'].agg([("std1", lambda x: np.std(x))])  # ddof = 0 
        std1
a          
1  1.118034

gdf['b'].agg([("std2", np.std)])  # ddof = 1
       std2
a          
1  1.290994

Problem description

To most Python/Pandas users, np.std and lambda x: np.std(x) should be interchangeable, as should any other method and its lambda equivalent. However, due to the implicit translation to pd.std that occurs in the latter example, these two calls use different default values of the ddof parameter, which can lead to real mismatches when refactoring code.

Given the discussion on closed Issue #13344, I gather that this is intended behavior, but it is nevertheless confusing. We may want to consider allowing for comparable output, at least as a default; if users want to override that behavior, they can explicitly provide a value for ddof.

Expected Output

BOTH aggregates output the ddof = 1 calculation:

       std2
a          
1  1.290994

when no parameters are passed to the lambda.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.1.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.1.0.post20200119
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@jreback
Copy link
Contributor

jreback commented Feb 7, 2020

as you have read the linked issue; you should understand that a) this is the default and b) this is impossible to change

happy to take even better documentation

closing as wont fix

@jreback jreback closed this as completed Feb 7, 2020
@jreback jreback added this to the No action milestone Feb 7, 2020
@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Feb 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

2 participants