BUG: Groupby mean differs depending on where `np.inf` value was introduced #50367

ncclementi · 2022-12-20T22:27:56Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

EDIT: editing RE as initial one had some issues, the posts that say they couldn't reproduce is based on the initial post with some errors.

Reproducible Example (latest)

#When `inf` is in at the end of the columns

d ={"x": [1, 2, 3, 4, 5, np.inf], #notice inf at the end
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf = pd.DataFrame(d).astype({"x": "float32", "y": "float32","z": "string"})

pdf.groupby("z").x.mean()
"""returns
z
alice    3.0
bob      inf
Name: x, dtype: float32
""" 
#When `inf` is in at the beginning of the columns

d2 ={"x": [ np.inf, 1, 2, 3, 4, 5], #inf at the beginning
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

pdf2.groupby("z").x.mean()
"""returns
z
alice    NaN
bob      3.0
Name: x, dtype: float32
"""

Issue Description

Regardless of how inf values are inserted in the dataframe the groupby aggregation should be consistent.

Expected Behavior

It is not clear to me whether it should return NaN or Inf but it should be consistent

Installed Versions

pandas=1.5.2
numpy=1.23.5 / tried 1.24.0 too

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-12-20T23:00:34Z

Thanks for the report; I'm not able to reproduce on main - I get NaN in both cases. What version of NumPy are you using?

ncclementi · 2022-12-21T15:01:08Z

@rhshadrach I have numpy=1.23.5

MarcoGorelli · 2022-12-21T15:07:14Z

I can't reproduce either, neither on pandas 1.5.2 nor on main:

In [6]: pdf
Out[6]:
     x    y      z
0  inf  1.0  alice
1  1.0  1.0    bob
2  2.0  1.0  alice
3  3.0  1.0    bob
4  4.0  1.0  alice
5  5.0  1.0    bob

In [7]: pdf.groupby("z").x.mean()
Out[7]:
z
alice    NaN
bob      3.0
Name: x, dtype: float32

ncclementi · 2022-12-21T15:21:22Z

Ok, I had a typo on my initial example, the problem comes on whether the inf is at idx 0 or last one. Here is the new reproducer

d ={"x": [1, 2, 3, 4, 5, np.inf], #notice inf at the end
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf = pd.DataFrame(d).astype({"x": "float32", "y": "float32","z": "string"})

pdf.groupby("z").x.mean()
"""returns
z
alice    3.0
bob      inf
Name: x, dtype: float32
""" 

d2 ={"x": [ np.inf, 1, 2, 3, 4, 5], #inf at the beginning
    "y": [1, 1 ,1, 1, 1, 1],
    "z": ["alice", "bob", "alice", "bob", 
          "alice", "bob",]}
          
pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

pdf2.groupby("z").x.mean()
"""returns
z
alice    NaN
bob      3.0
Name: x, dtype: float32
"""

I can reproduce with both numpy 1.23.5 and 1.24.0

MarcoGorelli · 2022-12-21T15:22:25Z

ok this reproduces, thanks

ncclementi · 2022-12-21T15:35:28Z

Should I update the main comment, or leave it as is?

rhshadrach · 2022-12-21T15:38:30Z

@ncclementi - I'd recommend correcting the OP.

parthi-siva · 2022-12-31T03:53:07Z

take

parthi-siva · 2022-12-31T04:04:40Z

Typo in the OP

While creating the dataframe called pdf2, instead of passing d2 - d is passed

pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

ncclementi · 2022-12-31T16:41:29Z

Typo in the OP

While creating the dataframe called pdf2, instead of passing d2 - d is passed

pdf2 = pd.DataFrame(d2).astype({"x": "float32", "y": "float32","z": "string"})

Good, catch, I updated the OP. Just to be clear, the bug is still present with the correction

rhshadrach · 2023-01-03T22:11:55Z

Simpler reproducer

df = pd.DataFrame(
    {
        "z": [0, 1, 0, 1, 0, 1],
        "x1": [np.inf, 1, 2, 3, 4, 5],
        "x2": [1, 2, 3, 4, 5, np.inf],
     }
)
print(df.groupby("z").mean())
#         x1   x2
# z
# alice  NaN  3.0
# bob    3.0  inf

Further investigations and PRs to fix are welcome!

b4rdos · 2023-01-13T09:03:39Z

I came accross a similar issue but from a different starting point.

>>> import pandas as pd
>>> import numpy as np
>>> data = pd.DataFrame({"group": [1, 2, 2, 3, 3, 3], "values": [np.inf, np.inf, np.inf, np.inf, np.inf, 13.0]})
>>> data
   group  values
0      1     inf
1      2     inf
2      2     inf
3      3     inf
4      3     inf
5      3    13.0
>>> data.groupby("group").agg("sum")
       values
group
1         inf
2         NaN
3         NaN
>>> data.groupby("group").agg(np.sum)
       values
group
1         inf
2         NaN
3         NaN
>>> data.groupby("group").agg(lambda x: np.sum(x))
       values
group
1         inf
2         inf
3         inf
>>> data.groupby("group").agg(lambda x: x.sum())
       values
group
1         inf
2         inf
3         inf

Versions:

>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit           : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python           : 3.8.13.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.6.0
Version          : Darwin Kernel Version 21.6.0: Mon Aug 22 20:17:10 PDT 2022; root:xnu-8020.140.49~2/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.5.2
numpy            : 1.24.1
pytz             : 2022.7
dateutil         : 2.8.2
setuptools       : 65.5.0
pip              : 22.3
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : 8.8.0
pandas_datareader: None
bs4              : None
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : None
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : None
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
zstandard        : None
tzdata           : None

parthi-siva · 2023-04-27T09:59:27Z

In groupby.pyx

line number 1076 we are summing np.inf and np.nan

     if not isna_entry:
        nobs[lab, j] += 1
1075    y = val - compensation[lab, j]
1076    t = sumx[lab, j] + y
            compensation[lab, j] = t - sumx[lab, j] - y
            sumx[lab, j] = t

At 5th iteration we can see that

sumx[lab, j]        0.0
y                   inf
sumx[lab, j] + y    inf
******************************************
sumx[lab, j]        0.0
y                   1.0
sumx[lab, j] + y    1.0
******************************************
sumx[lab, j]        0.0
y                   1.0
sumx[lab, j] + y    1.0
******************************************
sumx[lab, j]        0.0
y                   2.0
sumx[lab, j] + y    2.0
******************************************
sumx[lab, j]        inf
y                   nan
sumx[lab, j] + y    nan                  <--------------------------- Summing inf and nan resulting in nan
******************************************
sumx[lab, j]        1.0
y                   3.0
sumx[lab, j] + y    4.0
******************************************
sumx[lab, j]        1.0
y                   3.0
sumx[lab, j] + y    4.0
******************************************
sumx[lab, j]        2.0
y                   4.0
sumx[lab, j] + y    6.0
******************************************
sumx[lab, j]        nan
y                   nan
sumx[lab, j] + y    nan
******************************************
sumx[lab, j]        4.0
y                   5.0
sumx[lab, j] + y    9.0
******************************************
sumx[lab, j]        4.0
y                   5.0
sumx[lab, j] + y    9.0
******************************************
sumx[lab, j]        6.0
y                   inf
sumx[lab, j] + y    inf
******************************************

That's the reason for the discrepancy. @rhshadrach

parthi-siva · 2023-04-27T15:37:21Z

@rhshadrach I have created a PR. Please go through. If you could give a confirmation that it is correct. I'll go ahead and write test cases for it

ncclementi added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 20, 2022

ncclementi mentioned this issue Dec 20, 2022

Groupby mean inconsistent depending on where inf is inserted in data dask/dask#9778

Open

rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Info Clarification about behavior needed to assess issue labels Dec 20, 2022

ncclementi changed the title ~~BUG: Groupby mean differs depending on how np.inf value was introduced~~ BUG: Groupby mean differs depending on where np.inf value was introduced Dec 21, 2022

rhshadrach removed Needs Info Clarification about behavior needed to assess issue Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 21, 2022

github-actions bot assigned parthi-siva Dec 31, 2022

parthi-siva mentioned this issue Apr 27, 2023

BUG: Fix np.inf + np.nan sum issue on groupby mean #52964

Merged

3 tasks

rhshadrach added this to the 2.1 milestone May 17, 2023

mroeschke closed this as completed in #52964 May 23, 2023

rhshadrach mentioned this issue Jun 12, 2023

BUG: inconsistent treatment of numpy.Inf between groupby.sum() and groupby.apply(lambda: _grp: _grp.sum()) #53606

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby mean differs depending on where `np.inf` value was introduced #50367

BUG: Groupby mean differs depending on where `np.inf` value was introduced #50367

ncclementi commented Dec 20, 2022 •

edited

Loading

rhshadrach commented Dec 20, 2022 •

edited

Loading

ncclementi commented Dec 21, 2022

MarcoGorelli commented Dec 21, 2022

ncclementi commented Dec 21, 2022 •

edited

Loading

MarcoGorelli commented Dec 21, 2022

ncclementi commented Dec 21, 2022

rhshadrach commented Dec 21, 2022

parthi-siva commented Dec 31, 2022

parthi-siva commented Dec 31, 2022

ncclementi commented Dec 31, 2022

rhshadrach commented Jan 3, 2023

b4rdos commented Jan 13, 2023 •

edited

Loading

parthi-siva commented Apr 27, 2023

parthi-siva commented Apr 27, 2023 •

edited

Loading

BUG: Groupby mean differs depending on where np.inf value was introduced #50367

BUG: Groupby mean differs depending on where np.inf value was introduced #50367

Comments

ncclementi commented Dec 20, 2022 • edited Loading

Pandas version checks

Reproducible Example (latest)

Issue Description

Expected Behavior

Installed Versions

rhshadrach commented Dec 20, 2022 • edited Loading

ncclementi commented Dec 21, 2022

MarcoGorelli commented Dec 21, 2022

ncclementi commented Dec 21, 2022 • edited Loading

MarcoGorelli commented Dec 21, 2022

ncclementi commented Dec 21, 2022

rhshadrach commented Dec 21, 2022

parthi-siva commented Dec 31, 2022

parthi-siva commented Dec 31, 2022

ncclementi commented Dec 31, 2022

rhshadrach commented Jan 3, 2023

b4rdos commented Jan 13, 2023 • edited Loading

parthi-siva commented Apr 27, 2023

parthi-siva commented Apr 27, 2023 • edited Loading

BUG: Groupby mean differs depending on where `np.inf` value was introduced #50367

BUG: Groupby mean differs depending on where `np.inf` value was introduced #50367

ncclementi commented Dec 20, 2022 •

edited

Loading

rhshadrach commented Dec 20, 2022 •

edited

Loading

ncclementi commented Dec 21, 2022 •

edited

Loading

b4rdos commented Jan 13, 2023 •

edited

Loading

parthi-siva commented Apr 27, 2023 •

edited

Loading