pandas groupby sum min_count misbehaves #23889

ginward · 2018-11-24T23:03:35Z

Code Sample, a copy-pastable example if possible

d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)

Problem description

My hope is that, by using min_count=1, pandas will return NaN when all columns being summed up are NaN. However, now it is returning 0 instead of NaN. Any idea why?

Actual output:

 col1  col2
0   0.0   3.0
1   0.0   4.0

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

   col1  col2
0   Nan   3.0
1   Nan   4.0

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-11-25T21:43:06Z

Thanks for the report. Investigation and PRs always welcome

kaushikb11 · 2018-12-10T09:57:32Z

Hey, could you explain the problem in details and in what cases having NaNs as an output would be helpful? As we could just add an extra condition to make it work. I just can't understand any good application where this could be used.

ginward · 2019-02-05T15:46:49Z

I think having NaN is what min_count parameter is supposed to behave according to the documentation, right?

min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

wbijster · 2019-05-15T16:17:20Z

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

jtorcasso · 2019-08-28T19:08:56Z

I find the current behavior of returning 0 when all values are NaN really strange. Why would the default be to return 0 if missing all data? Seems sort of arbitrary (you could pick any number, not just 0 as the default) and counterintuitive.

Suppose I have data containing the income of individuals over time. The individuals are grouped into families and I want to get the total income of the family. The current behavior would assign a value of 0 to families where all individuals have missing income information. Because its very possible some families don't have any income (all individuals with income of 0), seeing a bunch of 0s in the resulting family income series wouldn't be too strange. You'd then interpret all the zeros as reflecting real 0s and would proceed in the analysis without knowing your data is corrupted.

Also, to echo @ginward, what, then, is the purpose of the min_count argument?

jtorcasso · 2019-08-28T19:17:36Z

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame(columns=['a'])
>>> df
Empty DataFrame
Columns: [a]
Index: []
>>> df.sum()
a    0.0
dtype: float64
>>> df.sum(min_count=1)
a   NaN
dtype: float64

jtorcasso · 2019-08-28T19:19:02Z

UPDATE: It seems like my original critique no longer applies at all, even for groupby. I guess the documentation should be updated?

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame([[1, np.nan],[1,np.nan], [2,4]], columns=['a', 'b'])
>>> df.groupby(by='a')['b'].sum()
a
1    0.0
2    4.0
Name: b, dtype: float64
>>> df.groupby(by='a')['b'].sum(min_count=1)
a
1    NaN
2    4.0
Name: b, dtype: float64

jtorcasso · 2019-08-28T19:19:50Z

Documentation should omit this part:
"""
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
"""

TomAugspurger · 2019-08-28T20:58:00Z

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

How does your example show that? The output of Series([]).sum() and Series([np.nan]).sum() are 0, as documented.

jtorcasso · 2019-08-28T21:38:14Z

I misinterpreted the documentation. Because the previous bug (which at least applies to version '0.24.2') was to return 0 regardless of min_count (for groupby sum), I interpreted default to mean it would return 0 if all were NaN, which I found really strange.

So I think all is well then in version '0.25.1'. Should this issue be closed then?

AngelaO · 2020-01-22T17:30:39Z

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:
def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

For anyone else who ends up here because of problems using groupby sum and min_count with decimal values @wbijster 's workaround also works if you are summing over a column containing decimals 🎉

For my purposes I actually wanted the groupby sum over decimal values to be NaN if any NaNs were present in the series so I used

def sumgroup(s):
    s = s.sum(skipna=False)
    return s

(Using pandas 0.25.3 btw)

nborrmann · 2020-02-03T12:28:24Z

The workaround described by @AngelaO does work, but it's much slower than .groupby().sum(). I would really appreciate if sum() on GroupBys honors the min_count parameter.

ikramersh · 2021-11-07T09:12:34Z

It appears that the min_count parameter works for pandas v1.3.4

sv1990 · 2022-01-10T17:22:27Z

The min_count parameter still doesn't work when used as

d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).agg({'col1': 'sum'}, min_count=1)

rhshadrach · 2023-07-15T17:54:21Z

The OP issue seems to be resolved on main, but using agg as in the comment above still return 0.0. I believe this is due to min_count not being utilized. Likely related to #47701, #44813, #39169

ghost · 2023-12-07T18:28:52Z

Did this get resolved?

WillAyd added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 25, 2018

jorisvandenbossche mentioned this issue Jan 30, 2020

Inconsistent behavior when groupby pandas Categorical variables #31422

Closed

mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 23, 2021

jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas groupby sum min_count misbehaves #23889

pandas groupby sum min_count misbehaves #23889

ginward commented Nov 24, 2018 •

edited

Loading

INSTALLED VERSIONS

WillAyd commented Nov 25, 2018

kaushikb11 commented Dec 10, 2018

ginward commented Feb 5, 2019

wbijster commented May 15, 2019 •

edited

Loading

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

AngelaO commented Jan 22, 2020 •

edited

Loading

nborrmann commented Feb 3, 2020

ikramersh commented Nov 7, 2021

sv1990 commented Jan 10, 2022

rhshadrach commented Jul 15, 2023

ghost commented Dec 7, 2023

pandas groupby sum min_count misbehaves #23889

pandas groupby sum min_count misbehaves #23889

Comments

ginward commented Nov 24, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Actual output:

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Nov 25, 2018

kaushikb11 commented Dec 10, 2018

ginward commented Feb 5, 2019

wbijster commented May 15, 2019 • edited Loading

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

TomAugspurger commented Aug 28, 2019

jtorcasso commented Aug 28, 2019

AngelaO commented Jan 22, 2020 • edited Loading

nborrmann commented Feb 3, 2020

ikramersh commented Nov 7, 2021

sv1990 commented Jan 10, 2022

rhshadrach commented Jul 15, 2023

ghost commented Dec 7, 2023

ginward commented Nov 24, 2018 •

edited

Loading

Output of `pd.show_versions()`

wbijster commented May 15, 2019 •

edited

Loading

AngelaO commented Jan 22, 2020 •

edited

Loading