Skip to content

BUG: Groupby mean differs depending on where np.inf value was introduced #50367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
ncclementi opened this issue Dec 20, 2022 · 14 comments · Fixed by #52964
Closed
2 of 3 tasks

BUG: Groupby mean differs depending on where np.inf value was introduced #50367

ncclementi opened this issue Dec 20, 2022 · 14 comments · Fixed by #52964
Assignees
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@ncclementi
Copy link

ncclementi commented Dec 20, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

EDIT: editing RE as initial one had some issues, the posts that say they couldn't reproduce is based on the initial post with some errors.

Reproducible Example (latest)

#When `inf` is in at the end of the columns

d ={"x": [1, 2, 3, 4, 5, np.inf], #notice inf at the end
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf = pd.DataFrame(d).astype({"x": "float32", "y": "float32","z": "string"})

pdf.groupby("z").x.mean()
"""returns
z
alice    3.0
bob      inf
Name: x, dtype: float32
""" 
#When `inf` is in at the beginning of the columns

d2 ={"x": [ np.inf, 1, 2, 3, 4, 5], #inf at the beginning
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

pdf2.groupby("z").x.mean()
"""returns
z
alice    NaN
bob      3.0
Name: x, dtype: float32
"""   

Issue Description

Regardless of how inf values are inserted in the dataframe the groupby aggregation should be consistent.

Expected Behavior

It is not clear to me whether it should return NaN or Inf but it should be consistent

Installed Versions

pandas=1.5.2
numpy=1.23.5 / tried 1.24.0 too

@rhshadrach
Copy link
Member

rhshadrach commented Dec 20, 2022

Thanks for the report; I'm not able to reproduce on main - I get NaN in both cases. What version of NumPy are you using?

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Info Clarification about behavior needed to assess issue labels Dec 20, 2022
@ncclementi
Copy link
Author

@rhshadrach I have numpy=1.23.5

@MarcoGorelli
Copy link
Member

I can't reproduce either, neither on pandas 1.5.2 nor on main:

In [6]: pdf
Out[6]:
     x    y      z
0  inf  1.0  alice
1  1.0  1.0    bob
2  2.0  1.0  alice
3  3.0  1.0    bob
4  4.0  1.0  alice
5  5.0  1.0    bob

In [7]: pdf.groupby("z").x.mean()
Out[7]:
z
alice    NaN
bob      3.0
Name: x, dtype: float32

@ncclementi ncclementi changed the title BUG: Groupby mean differs depending on how np.inf value was introduced BUG: Groupby mean differs depending on where np.inf value was introduced Dec 21, 2022
@ncclementi
Copy link
Author

ncclementi commented Dec 21, 2022

Ok, I had a typo on my initial example, the problem comes on whether the inf is at idx 0 or last one. Here is the new reproducer

d ={"x": [1, 2, 3, 4, 5, np.inf], #notice inf at the end
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf = pd.DataFrame(d).astype({"x": "float32", "y": "float32","z": "string"})

pdf.groupby("z").x.mean()
"""returns
z
alice    3.0
bob      inf
Name: x, dtype: float32
""" 

d2 ={"x": [ np.inf, 1, 2, 3, 4, 5], #inf at the beginning
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

pdf2.groupby("z").x.mean()
"""returns
z
alice    NaN
bob      3.0
Name: x, dtype: float32
"""
   

I can reproduce with both numpy 1.23.5 and 1.24.0

@MarcoGorelli
Copy link
Member

ok this reproduces, thanks

@ncclementi
Copy link
Author

Should I update the main comment, or leave it as is?

@rhshadrach rhshadrach removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2022
@rhshadrach
Copy link
Member

@ncclementi - I'd recommend correcting the OP.

@parthi-siva
Copy link
Contributor

take

@parthi-siva
Copy link
Contributor

Typo in the OP

While creating the dataframe called pdf2, instead of passing d2 - d is passed

pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

@ncclementi
Copy link
Author

Typo in the OP

While creating the dataframe called pdf2, instead of passing d2 - d is passed

pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

Good, catch, I updated the OP. Just to be clear, the bug is still present with the correction

@rhshadrach
Copy link
Member

Simpler reproducer

df = pd.DataFrame(
    {
        "z": [0, 1, 0, 1, 0, 1],
        "x1": [np.inf, 1, 2, 3, 4, 5],
        "x2": [1, 2, 3, 4, 5, np.inf],
     }
)
print(df.groupby("z").mean())
#         x1   x2
# z
# alice  NaN  3.0
# bob    3.0  inf

Further investigations and PRs to fix are welcome!

@b4rdos
Copy link

b4rdos commented Jan 13, 2023

I came accross a similar issue but from a different starting point.

>>> import pandas as pd
>>> import numpy as np
>>> data = pd.DataFrame({"group": [1, 2, 2, 3, 3, 3], "values": [np.inf, np.inf, np.inf, np.inf, np.inf, 13.0]})
>>> data
   group  values
0      1     inf
1      2     inf
2      2     inf
3      3     inf
4      3     inf
5      3    13.0
>>> data.groupby("group").agg("sum")
       values
group
1         inf
2         NaN
3         NaN
>>> data.groupby("group").agg(np.sum)
       values
group
1         inf
2         NaN
3         NaN
>>> data.groupby("group").agg(lambda x: np.sum(x))
       values
group
1         inf
2         inf
3         inf
>>> data.groupby("group").agg(lambda x: x.sum())
       values
group
1         inf
2         inf
3         inf

Versions:

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.8.13.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.5.2
numpy            : 1.24.1
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 8.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

@parthi-siva
Copy link
Contributor

In groupby.pyx

line number 1076 we are summing np.inf and np.nan

     if not isna_entry:
        nobs[lab, j] += 1
1075    y = val - compensation[lab, j]
1076    t = sumx[lab, j] + y
            compensation[lab, j] = t - sumx[lab, j] - y
            sumx[lab, j] = t

At 5th iteration we can see that

sumx[lab, j]        0.0
y                   inf
sumx[lab, j] + y    inf
******************************************
sumx[lab, j]        0.0
y                   1.0
sumx[lab, j] + y    1.0
******************************************
sumx[lab, j]        0.0
y                   1.0
sumx[lab, j] + y    1.0
******************************************
sumx[lab, j]        0.0
y                   2.0
sumx[lab, j] + y    2.0
******************************************
sumx[lab, j]        inf
y                   nan
sumx[lab, j] + y    nan                  <--------------------------- Summing inf and nan resulting in nan
******************************************
sumx[lab, j]        1.0
y                   3.0
sumx[lab, j] + y    4.0
******************************************
sumx[lab, j]        1.0
y                   3.0
sumx[lab, j] + y    4.0
******************************************
sumx[lab, j]        2.0
y                   4.0
sumx[lab, j] + y    6.0
******************************************
sumx[lab, j]        nan
y                   nan
sumx[lab, j] + y    nan
******************************************
sumx[lab, j]        4.0
y                   5.0
sumx[lab, j] + y    9.0
******************************************
sumx[lab, j]        4.0
y                   5.0
sumx[lab, j] + y    9.0
******************************************
sumx[lab, j]        6.0
y                   inf
sumx[lab, j] + y    inf
******************************************

That's the reason for the discrepancy. @rhshadrach

@parthi-siva
Copy link
Contributor

parthi-siva commented Apr 27, 2023

@rhshadrach I have created a PR. Please go through. If you could give a confirmation that it is correct. I'll go ahead and write test cases for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants