-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
pandas groupby sum min_count misbehaves #23889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Investigation and PRs always welcome |
Hey, could you explain the problem in details and in what cases having NaNs as an output would be helpful? As we could just add an extra condition to make it work. I just can't understand any good application where this could be used. |
I think having NaN is what min_count parameter is supposed to behave according to the documentation, right?
|
FYI, I have the same issue. I use groupby to sum data and I want to retain the NaNs if there is no data in a group but have a sum if the group does contain data, even if there are some NaNs. I use this workaround now:
|
I find the current behavior of returning 0 when all values are NaN really strange. Why would the default be to return 0 if missing all data? Seems sort of arbitrary (you could pick any number, not just 0 as the default) and counterintuitive. Suppose I have data containing the income of individuals over time. The individuals are grouped into families and I want to get the total income of the family. The current behavior would assign a value of 0 to families where all individuals have missing income information. Because its very possible some families don't have any income (all individuals with income of 0), seeing a bunch of 0s in the resulting family income series wouldn't be too strange. You'd then interpret all the zeros as reflecting real 0s and would proceed in the analysis without knowing your data is corrupted. Also, to echo @ginward, what, then, is the purpose of the min_count argument? |
Also, the documentation for pandas.dataframe.sum says that the default for all NaN series is to give 0 now, but this does not happen when you don't use a groupby:
|
UPDATE: It seems like my original critique no longer applies at all, even for groupby. I guess the documentation should be updated?
|
Documentation should omit this part: |
How does your example show that? The output of |
I misinterpreted the documentation. Because the previous bug (which at least applies to version '0.24.2') was to return 0 regardless of min_count (for groupby sum), I interpreted default to mean it would return 0 if all were NaN, which I found really strange. So I think all is well then in version '0.25.1'. Should this issue be closed then? |
For anyone else who ends up here because of problems using groupby sum and min_count with decimal values @wbijster 's workaround also works if you are summing over a column containing decimals 🎉 For my purposes I actually wanted the groupby sum over decimal values to be NaN if any NaNs were present in the series so I used
(Using pandas 0.25.3 btw) |
The workaround described by @AngelaO does work, but it's much slower than |
It appears that the min_count parameter works for pandas v1.3.4 |
The d=pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
d['col2']=d['col2'].astype(str)
d['col1']=np.nan
d=d.groupby(lambda x:x, axis=1).agg({'col1': 'sum'}, min_count=1) |
Did this get resolved? |
Code Sample, a copy-pastable example if possible
Problem description
My hope is that, by using min_count=1, pandas will return NaN when all columns being summed up are NaN. However, now it is returning 0 instead of NaN. Any idea why?
Actual output:
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading
pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check ifmaster
addresses this issue, but that is not necessary.For documentation-related issues, you can check the latest versions of the docs on
master
here:https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: 3.8.2
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: