Skip to content

Inconsistent behavior when groupby pandas Categorical variables #31422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DigitalPig opened this issue Jan 29, 2020 · 13 comments · Fixed by #35280
Closed

Inconsistent behavior when groupby pandas Categorical variables #31422

DigitalPig opened this issue Jan 29, 2020 · 13 comments · Fixed by #35280
Assignees
Labels
Bug Categorical Categorical Data Type Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@DigitalPig
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
test = pd.DataFrame(
		{
        'col1': pd.Series(pd.Categorical(["a", "b", "c", "a"],categories=["a", "b", "c", "d"])),
        'col2': np.random.randint(0,100, size=4),
        'col3': [12,32,23,22]
   		 }
	)

test['col2'] = pd.cut(test['col2'], [0,50,100])

# Group by two variables
print(test.groupby(['col1', 'col2']).sum())

# Group by a single variable
print(test.groupby(['col1']).sum())

This gives the result:

col1 col2
a    (0, 50]    22.0
     (50, 100]  12.0
b    (0, 50]     NaN
     (50, 100]  32.0
c    (0, 50]    23.0
     (50, 100]   NaN
d    (0, 50]     NaN
     (50, 100]   NaN
      col3
col1
a       34
b       32
c       23
d        0

Problem description

Clearly there is no record for category 'd'. When you group by one column, category 'd' result in sum of 0. When you do group by two columns, those non-existed row result in NaN. Those two are inconsistent.

Expected Output

Both results for d rows should be the same. I think NaN makes more sense.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200102
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Cross ref to dask/dask#5838

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 29, 2020

Thanks. We use 0 as the sum of empty / all NA, so 0 is the correct result here, not NaN.

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Groupby Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 29, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 29, 2020
@yuvrajsingh86
Copy link

I think if the category does not exist, the value should be null. There is a possibility that combination of negative and positive can sum to 0. Keeping value as null avoids any accidental use of these variables.

@sirajali
Copy link

sirajali commented Jan 29, 2020

I understand that for the existing categories, the aggregation treats NaNs as 0s when taking the sum. But for the category that does not exist, the aggregation is taking the sum of NaNs, essentially. Does it make sense to use 0 instead of NaN here?

import numpy as np

np.sum([np.nan, np.nan])

The above statement give nan as the result, and not 0.

@dsaxton
Copy link
Member

dsaxton commented Jan 29, 2020

@sirajali Here the sum is empty, which is different from summing NaN values:

import numpy as np

np.sum([])                                                                                                                     
# 0.0

@DigitalPig
Copy link
Author

I have a different opinion here.

So first, I think everyone agrees that there is an inconsistent issue of groupby and sum after grouping by between one column and two columns. And those need to be aligned.

The point that we disagree is whether those should be 0 or null. I found that the default behavior of pandas regarding all null column has change since 0.22. I am not sure the reason behind that, but one of my colleagues show me the related changes in numpy.nansum.

Let's slice this issue a little bit further. First, how pandas handles an empty level in a categorical column is treat there is null attach to it (see here):

All values of the Categorical are either in categories or np.nan.

So when you do group-by, you are essentially reduce all np.nan together and sum on that. What should be the expected return value of a list/pandas Series with all null?

Value 0 as the sum means different than null. 0 can mean +1 + (-1), or 3+2+(-5), or other infinite combination of a series that cancel out each other. 0 gives user the impression that there is some information and the aggregated results is 0. In the other hand, NaN means pandas does not know what the final result should be. 15/0 should give null because we don't have a clear answer to this question.

If the aggregation action is count and all null gives 0, I can understand that. Because there is no record. Everybody agrees on that. However, summation after group by with 0 gives me the impression that there are stuff there but they happen to cancel each other after aggregation. This is misleading and counter intuitive.

In addition, I want to bring up is if you look at numpy.sum, it returns null for array that is all null:

np.sum(np.array([np.nan, np.nan, np.nan]))

gives nan as results, for numpy version 1.17.3

BTW: The other bug I found is the sum after groupby really does not check what parameter it receives. If you run the original example and run:

test.groupby(['col1']).sum(sdfsdf=1)

there is no parameter like sdfsdf for sum. Not only it does not error out, it also shows null for the final results now:

      col3
col1
a     34.0
b     32.0
c     23.0
d      NaN

@TomAugspurger
Copy link
Contributor

Let's slice this issue a little bit further. First, how pandas handles an empty level in a categorical column is treat there is null attach to it (see here):

You aren't doing a sum on categoricals here. Your grouping by a categorical, which determines the output index. Some of the groups are unobserved (length-0) and so their sum should be 0, matching pd.Series([], dtype=int).sum().

In addition, I want to bring up is if you look at numpy.sum, it returns null for array that is all null:

Pandas' behavior is equivalent to nansum, not sum.

@DigitalPig
Copy link
Author

DigitalPig commented Jan 29, 2020

First, I am not arguing the alignment between pandas.groupby.sum and pd.Series([], dtype=int).sum() or np.nansum. My argument here is np.sum makes more sense instead of np.nansum here.

Some of the groups are unobserved (length-0) and so their sum should be 0:

I don't think so. So suppose I am monitoring a group of patient's daily calorie intake by day. There is one day that I did not record that for any one of them. And if I want to find out the total daily calorie intake on that day for all my patients. And I got 0 calorie. Does it mean no one eats a thing on that day?

You either do not show that day in your final results, or you report null for that day. 0 gives a different meaning here.

@DigitalPig
Copy link
Author

DigitalPig commented Jan 29, 2020

If you check R data.table basically gives expected result:

library(data.table)

emp.data <- data.frame(
  emp_id = c (1:5), 
  emp_name = c("Rick","Rick","Ryan","Ryan","Ryan"),
  salary = c(NA,NA,611.0,729.0,843.25), 
  
  stringsAsFactors = TRUE
)
emp.data['emp_name'] = factor(c("Rick", "Rick", "Ryan", "Ryan", "Ryan"), levels = c("Rick", "Ryan", "Dan"))
emp.data.dt <- data.table(emp.data)

emp.data.dt[,sum(salary), by=emp_name]

Result:

   emp_name      V1
1:     Rick      NA
2:     Ryan 2183.25

For categorical with unobserved results, the result is null. For categorical that is not in the table ("Dan"), it is not shown in the final results.

@jorisvandenbossche
Copy link
Member

You either do not show that day in your final results, or you report null for that day. 0 gives a different meaning here.

For not showing that day, you can use observed=True keyword in groupby, for reporting null you can use min_count=1 keyword in sum.
No default will be the good default in all use cases, therefore there are keywords to get the behaviour you want.

The point that we disagree is whether those should be 0 or null. I found that the default behavior of pandas regarding all null column has change since 0.22.
...
My argument here is np.sum makes more sense instead of np.nansum here.

@DigitalPig you have found the relevant release notes on this from two years ago. As you see, back then we made a decision to choose for 0 as the deault result for the sum of an all-NaN or empty series (but introducing a keyword with which you can get NaN instead).
Unfortunately, this is not something up for discussion again, I think. And to be clear, it's not that I want to shut down all discussion, but it was a big change (with a lot of discussions) back then, and IMO it is simply not realistic to change this again.
(if you want to see some of this discussion, look at #9422, #18678 and linked issues / mailing list discussions)

Rather, I think we should focus the discussion on other aspects:

  • The original issue you brought up of the inconsistency between grouping by a single vs two keys (I think we all agree this is an inconsistency we should solve), and how we can tackle this?
  • Can this be better explained in the docs on why this is, and how to handle it when the default is not the behaviour you want?
  • Do all sum implemenations correctly follow this and expose the min_count keyword? (eg pandas groupby sum min_count misbehaves #23889)
  • The bug you bring up about the ignored nonsense keyword is indeed a good catch. Can you open a separate issue for that?

@charlesdong1991
Copy link
Member

charlesdong1991 commented Jan 31, 2020

@DigitalPig May I ask which version of pandas did you use when running test.groupby(['col1']).sum(sdfsdf=1)? Somehow, I am not able to reproduce it (or at least on master)

@DigitalPig
Copy link
Author

I am on '0.25.3'. You can check the details part of the issue for other detail information.

What version of pandas do you use at this moment?

@smithto1
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment