-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent behavior when groupby pandas Categorical variables #31422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks. We use 0 as the sum of empty / all NA, so 0 is the correct result here, not NaN. |
I think if the category does not exist, the value should be null. There is a possibility that combination of negative and positive can sum to 0. Keeping value as null avoids any accidental use of these variables. |
I understand that for the existing categories, the aggregation treats NaNs as 0s when taking the sum. But for the category that does not exist, the aggregation is taking the sum of NaNs, essentially. Does it make sense to use 0 instead of NaN here? import numpy as np
np.sum([np.nan, np.nan]) The above statement give |
@sirajali Here the sum is empty, which is different from summing import numpy as np
np.sum([])
# 0.0 |
I have a different opinion here. So first, I think everyone agrees that there is an inconsistent issue of groupby and sum after grouping by between one column and two columns. And those need to be aligned. The point that we disagree is whether those should be 0 or null. I found that the default behavior of pandas regarding all null column has change since 0.22. I am not sure the reason behind that, but one of my colleagues show me the related changes in numpy.nansum. Let's slice this issue a little bit further. First, how
So when you do group-by, you are essentially reduce all Value 0 as the sum means different than null. 0 can mean +1 + (-1), or 3+2+(-5), or other infinite combination of a series that cancel out each other. 0 gives user the impression that there is some information and the aggregated results is 0. In the other hand, NaN means pandas does not know what the final result should be. 15/0 should give null because we don't have a clear answer to this question. If the aggregation action is In addition, I want to bring up is if you look at np.sum(np.array([np.nan, np.nan, np.nan])) gives BTW: The other bug I found is the sum after groupby really does not check what parameter it receives. If you run the original example and run: test.groupby(['col1']).sum(sdfsdf=1) there is no parameter like
|
You aren't doing a sum on categoricals here. Your grouping by a categorical, which determines the output index. Some of the groups are unobserved (length-0) and so their sum should be 0, matching
Pandas' behavior is equivalent to nansum, not sum. |
First, I am not arguing the alignment between
I don't think so. So suppose I am monitoring a group of patient's daily calorie intake by day. There is one day that I did not record that for any one of them. And if I want to find out the total daily calorie intake on that day for all my patients. And I got 0 calorie. Does it mean no one eats a thing on that day? You either do not show that day in your final results, or you report null for that day. 0 gives a different meaning here. |
If you check R library(data.table)
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Rick","Ryan","Ryan","Ryan"),
salary = c(NA,NA,611.0,729.0,843.25),
stringsAsFactors = TRUE
)
emp.data['emp_name'] = factor(c("Rick", "Rick", "Ryan", "Ryan", "Ryan"), levels = c("Rick", "Ryan", "Dan"))
emp.data.dt <- data.table(emp.data)
emp.data.dt[,sum(salary), by=emp_name] Result:
For categorical with unobserved results, the result is null. For categorical that is not in the table ("Dan"), it is not shown in the final results. |
For not showing that day, you can use
@DigitalPig you have found the relevant release notes on this from two years ago. As you see, back then we made a decision to choose for 0 as the deault result for the sum of an all-NaN or empty series (but introducing a keyword with which you can get NaN instead). Rather, I think we should focus the discussion on other aspects:
|
@DigitalPig May I ask which version of pandas did you use when running |
I am on What version of pandas do you use at this moment? |
take |
…grouping by multiple Categoricals. Updates to tests to reflect this expected output
Code Sample, a copy-pastable example if possible
This gives the result:
Problem description
Clearly there is no record for category 'd'. When you group by one column, category 'd' result in sum of 0. When you do group by two columns, those non-existed row result in NaN. Those two are inconsistent.
Expected Output
Both results for
d
rows should be the same. I think NaN makes more sense.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.3
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200102
Cython : None
pytest : 5.3.2
hypothesis : None
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
Cross ref to dask/dask#5838
The text was updated successfully, but these errors were encountered: