Skip to content

pandas groupby sum min_count misbehaves #23889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ginward opened this issue Nov 24, 2018 · 16 comments
Open

pandas groupby sum min_count misbehaves #23889

ginward opened this issue Nov 24, 2018 · 16 comments
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.

Comments

@ginward
Copy link

ginward commented Nov 24, 2018

Code Sample, a copy-pastable example if possible

d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).sum(min_count=1)

Problem description

My hope is that, by using min_count=1, pandas will return NaN when all columns being summed up are NaN. However, now it is returning 0 instead of NaN. Any idea why?

Actual output:

 col1  col2
0   0.0   3.0
1   0.0   4.0

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

   col1  col2
0   Nan   3.0
1   Nan   4.0

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Nov 25, 2018

Thanks for the report. Investigation and PRs always welcome

@WillAyd WillAyd added Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Nov 25, 2018
@kaushikb11
Copy link

Hey, could you explain the problem in details and in what cases having NaNs as an output would be helpful? As we could just add an extra condition to make it work. I just can't understand any good application where this could be used.

@ginward
Copy link
Author

ginward commented Feb 5, 2019

I think having NaN is what min_count parameter is supposed to behave according to the documentation, right?

min_count : int, default 0
The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.

@wbijster
Copy link

wbijster commented May 15, 2019

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

@jtorcasso
Copy link
Contributor

I find the current behavior of returning 0 when all values are NaN really strange. Why would the default be to return 0 if missing all data? Seems sort of arbitrary (you could pick any number, not just 0 as the default) and counterintuitive.

Suppose I have data containing the income of individuals over time. The individuals are grouped into families and I want to get the total income of the family. The current behavior would assign a value of 0 to families where all individuals have missing income information. Because its very possible some families don't have any income (all individuals with income of 0), seeing a bunch of 0s in the resulting family income series wouldn't be too strange. You'd then interpret all the zeros as reflecting real 0s and would proceed in the analysis without knowing your data is corrupted.

Also, to echo @ginward, what, then, is the purpose of the min_count argument?

@jtorcasso
Copy link
Contributor

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame(columns=['a'])
>>> df
Empty DataFrame
Columns: [a]
Index: []
>>> df.sum()
a    0.0
dtype: float64
>>> df.sum(min_count=1)
a   NaN
dtype: float64

@jtorcasso
Copy link
Contributor

UPDATE: It seems like my original critique no longer applies at all, even for groupby. I guess the documentation should be updated?

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.1'
>>> df = pd.DataFrame([[1, np.nan],[1,np.nan], [2,4]], columns=['a', 'b'])
>>> df.groupby(by='a')['b'].sum()
a
1    0.0
2    4.0
Name: b, dtype: float64
>>> df.groupby(by='a')['b'].sum(min_count=1)
a
1    NaN
2    4.0
Name: b, dtype: float64

@jtorcasso
Copy link
Contributor

Documentation should omit this part:
"""
New in version 0.22.0: Added with the default being 0. This means the sum of an all-NA or empty Series is 0, and the product of an all-NA or empty Series is 1.
"""

@TomAugspurger
Copy link
Contributor

Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:

How does your example show that? The output of Series([]).sum() and Series([np.nan]).sum() are 0, as documented.

@jtorcasso
Copy link
Contributor

I misinterpreted the documentation. Because the previous bug (which at least applies to version '0.24.2') was to return 0 regardless of min_count (for groupby sum), I interpreted default to mean it would return 0 if all were NaN, which I found really strange.

So I think all is well then in version '0.25.1'. Should this issue be closed then?

@AngelaO
Copy link

AngelaO commented Jan 22, 2020

FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:

def sumgroup(s):  
    s = s.sum(min_count=1)
    return(s)

dftest['new'] = dftest.groupby(level=['one', 'two'])['data'].apply(sumgroup) # pd.__version__ == 0.24.2

For anyone else who ends up here because of problems using groupby sum and min_count with decimal values @wbijster 's workaround also works if you are summing over a column containing decimals 🎉

For my purposes I actually wanted the groupby sum over decimal values to be NaN if any NaNs were present in the series so I used

def sumgroup(s):
    s = s.sum(skipna=False)
    return s

(Using pandas 0.25.3 btw)

@nborrmann
Copy link

The workaround described by @AngelaO does work, but it's much slower than .groupby().sum(). I would really appreciate if sum() on GroupBys honors the min_count parameter.

@mroeschke mroeschke added Numeric Operations Arithmetic, Comparison, and Logical operations and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 13, 2020
@mroeschke mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 23, 2021
@ikramersh
Copy link

It appears that the min_count parameter works for pandas v1.3.4

@sv1990
Copy link

sv1990 commented Jan 10, 2022

The min_count parameter still doesn't work when used as

d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).agg({'col1': 'sum'}, min_count=1)

@jbrockmendel jbrockmendel added the Reduction Operations sum, mean, min, max, etc. label Mar 28, 2023
@rhshadrach
Copy link
Member

The OP issue seems to be resolved on main, but using agg as in the comment above still return 0.0. I believe this is due to min_count not being utilized. Likely related to #47701, #44813, #39169

@ghost
Copy link

ghost commented Dec 7, 2023

Did this get resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests