-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby #17605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Worth noting that |
The desired output is the Series one. You definitely don't want the cartesian product of your groupby cols. The dataframe one is wrong or at least I can't see why you would want that as the default behaviour. You will explode memory. |
And to be more constructive, I would imagine in your use case you want some sort of reindex first using something like this stuff: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.MultiIndex.from_product.html |
This issue was created before the introduction of the On first sight this is true for most aggregations, but pdf = pd.DataFrame({
"category_1": pd.Categorical(list("AABBCC"), categories=list("ABCDEF")),
"category_2": pd.Categorical(list("ABC") * 2, categories=list("ABCDEF")),
"value": [0.1] * 6
})
pdf.groupby(["category_1", "category_2"])["value"].sum() # All categories present
pdf.groupby(["category_1", "category_2"])["value"].mean() # All categories present
pdf.groupby(["category_1", "category_2"])["value"].min() # All categories present
pdf.groupby(["category_1", "category_2"])["value"].count() # Only observed present! So I do think this is a bug that's still present. |
A severe warning should be issued when this gets pushed in the what's new as it will cause memory blow-ups if anyone is not using the observed=True non-default on sparsely observed high-arity categoricals. |
@cottrell would welcome a memory benchmark in asv to see for sure |
@jreback is there a way of doing that though without killing the test framework? I don't think it is a test-worthy case really ... I mean simply that if you have 20k rows indexed by three cols with arity 10k x 10k x 10k, you will get a cube ravelled to 1e12 rows with the default settings. Setting observed=True gives < 20k rows. The new default is fine, probably best folks learn to turn off the cartesian expansion. But could hit people if they upgrade old code. |
@cottrell you can show a simliar effect in a much smaller df, e.g i don't see why 10 x 10 x 10 wouldn't work |
I don't think we are talking about the same thing. A reasonable test to block this default change would have been any test that fails due to explosion of dimensions when observed=False. The test would need to run and try to produce an array too large to compute. If the test was runnable with observed=False, then it would have been an invalid test. As the new default is in, there is nothing to block anymore and this kind of test has no value in the current state. It is probably the good state since now anyway since everything must be explicit in high arity cases. Below is the example above in the two cases for ref. |
Steps to reproduce
Problem description
When performing a groupby on categorical columns, categories with empty groups should be present in output. That is, the multi-index of the object returned by
count()
should contain the Cartesian product of all the labels of the first categorical column ("treatment"
in the example above) and the second categorical column ("type"
) by which the grouping was performed.The behavior in cell [3] above is correct. But in cell [4], after obtaining a
pandas.core.groupby.SeriesGroupBy
object, the series returned by thecount()
method does not have entries for all levels of the"type"
categorical.Expected Output
The output from cell [4] should be equivalent to this output, with length 6, and include values for the index values
(C, B)
and(T, C)
.Workaround
Perform column access after calling
count()
:Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: