-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Groupby mean differs depending on where np.inf
value was introduced
#50367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report; I'm not able to reproduce on main - I get NaN in both cases. What version of NumPy are you using? |
@rhshadrach I have |
I can't reproduce either, neither on pandas 1.5.2 nor on main: In [6]: pdf
Out[6]:
x y z
0 inf 1.0 alice
1 1.0 1.0 bob
2 2.0 1.0 alice
3 3.0 1.0 bob
4 4.0 1.0 alice
5 5.0 1.0 bob
In [7]: pdf.groupby("z").x.mean()
Out[7]:
z
alice NaN
bob 3.0
Name: x, dtype: float32 |
np.inf
value was introducednp.inf
value was introduced
Ok, I had a typo on my initial example, the problem comes on whether the inf is at idx 0 or last one. Here is the new reproducer d ={"x": [1, 2, 3, 4, 5, np.inf], #notice inf at the end
"y": [1, 1 ,1, 1, 1, 1],
"z": ["alice", "bob", "alice", "bob",
"alice", "bob",]}
pdf = pd.DataFrame(d).astype({"x": "float32", "y": "float32","z": "string"})
pdf.groupby("z").x.mean()
"""returns
z
alice 3.0
bob inf
Name: x, dtype: float32
"""
d2 ={"x": [ np.inf, 1, 2, 3, 4, 5], #inf at the beginning
"y": [1, 1 ,1, 1, 1, 1],
"z": ["alice", "bob", "alice", "bob",
"alice", "bob",]}
pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})
pdf2.groupby("z").x.mean()
"""returns
z
alice NaN
bob 3.0
Name: x, dtype: float32
"""
I can reproduce with both numpy |
ok this reproduces, thanks |
Should I update the main comment, or leave it as is? |
@ncclementi - I'd recommend correcting the OP. |
take |
Typo in the OP While creating the dataframe called pdf2, instead of passing d2 - d is passed pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"}) |
Good, catch, I updated the OP. Just to be clear, the bug is still present with the correction |
Simpler reproducer
Further investigations and PRs to fix are welcome! |
I came accross a similar issue but from a different starting point. >>> import pandas as pd
>>> import numpy as np
>>> data = pd.DataFrame({"group": [1, 2, 2, 3, 3, 3], "values": [np.inf, np.inf, np.inf, np.inf, np.inf, 13.0]})
>>> data
group values
0 1 inf
1 2 inf
2 2 inf
3 3 inf
4 3 inf
5 3 13.0
>>> data.groupby("group").agg("sum")
values
group
1 inf
2 NaN
3 NaN
>>> data.groupby("group").agg(np.sum)
values
group
1 inf
2 NaN
3 NaN
>>> data.groupby("group").agg(lambda x: np.sum(x))
values
group
1 inf
2 inf
3 inf
>>> data.groupby("group").agg(lambda x: x.sum())
values
group
1 inf
2 inf
3 inf Versions: >>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.8.13.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.5.2
numpy : 1.24.1
pytz : 2022.7
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None |
In groupby.pyx line number 1076 we are summing np.inf and np.nan
At 5th iteration we can see that
That's the reason for the discrepancy. @rhshadrach |
@rhshadrach I have created a PR. Please go through. If you could give a confirmation that it is correct. I'll go ahead and write test cases for it |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
EDIT: editing RE as initial one had some issues, the posts that say they couldn't reproduce is based on the initial post with some errors.
Reproducible Example (latest)
Issue Description
Regardless of how
inf
values are inserted in the dataframe the groupby aggregation should be consistent.Expected Behavior
It is not clear to me whether it should return
NaN
orInf
but it should be consistentInstalled Versions
pandas=1.5.2
numpy=1.23.5 / tried 1.24.0 too
The text was updated successfully, but these errors were encountered: