Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
BUG: regression when applying groupby aggregation on categorical columns #31359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: regression when applying groupby aggregation on categorical columns #31359
Changes from 31 commits
7e461a1
1314059
8bcb313
24c3ede
dea38f2
cd9e7ac
e5e912b
d0cddb3
4e1abde
0b917c6
e9cac5d
e03357e
a1df393
8743c47
1e10d71
905b3a5
c5d670b
916d9b2
2208fc2
86a254c
c36d97b
3f8ea8f
a6ad1a2
c7daa46
c4ebfa9
bbad886
a6a498e
2a3f5a2
ceef95e
ed91cc1
9c7af0f
ca35648
1b826bb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will this return? Not a categorical?
I am not fully sure that is correct as well. Eg if you do a first-like aggregation
lambda x: x[0]
, you would expect this to work...(but of course any heuristic will never be correct in all cases .. for full control in cases like this we will need an additional keyword)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it will be object dtype. That's the API breaking change. But, I'm OK with it because
df.groupby([1, 1]).agg(lambda x: 'a').A.dtype
is object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesdong1991 it looks like this change caused the regression in #32194
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugly! But this ensures that
Series[IntDtype].resample().sum()
has int dtype. Assuming we want the restriction "only try casting back to EA when the scalars results are instances of the EA dtypes type, then I think this is the least worst option.I briefly tried avoiding the cast from Int64 -> float with NaN, but that wasn't feasible today. Our Cython reductions like
group_add
only handle floats IIUC.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment suggesting this get cleaned up, pointing back to this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to open an issue and will point here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the expected dtype. I think the old test was incorrect, as
mean
should be float.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, no it tries to cast back originally (which i suppose is dubious), but yeah we should also return float on nullable mean ops (thought we did genreally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that should be float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test change. The expected result here is less clear I think. You could argue for either categorical or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API change from 0.25.x. Dunno what to do here :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 0.25.x, we have the following
We're potentially changing the values of the result by returning a CategoricalDtype there. I don't think that behavior is correct, so I'm in favor of making this breaking change... I think...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems exactly the bug we were fixing initially. For groupby it was a regression, so fine to do it as a fix for resample I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that I commented about in two different places (also in the whatsnew), with a contradicting message ...
I don't know what is best for now in this PR, but in general: if we are going to take the rule of "only trying to cast back if the scalars are of the dtype.type", then I think we should try to cast to categorical (meaning, we should see the string "a" as a valid scalar of a categorical dtype with string categories). The specific case of strings is a bit difficult of course, as also the categories' dtype is object and can hold anything.
Now, long term, I am thinking we should maybe not use this "scalar of dtype.type" rule (we can say that to explain it, but not as actual check in the code). But we could rather dictate that
_from_sequence
should be strict. And then it's_from_sequence
that can hold the logic to know if it's a proper scalar (eg to know that pd.NA is also fine, or to know that "a" is a proper scalar for a categorical with categories that include "a", or knowing that a python int is also fine in addition to np.int64 scalar for an IntDtype, etc).For the specific case of categorical, it could then fail if it's getting scalars (eg "c" in the example above) which cannot be a scalar of its dtype (since the categories are part of the dtype, it can know "c" is not a valid scalar in that case).
You will still get variable behaviour depending on what the agg function returns, eg
.agg(lambda x: 'a')
you get back categorical and.agg(lambda x: 'c')
not, while both are simple functions that return a string. But with any rule such small differences will be unavoidable for the default behaviour (hence having a keyword to have full control might be interesting to investigate)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long comment here :) That should probably be moved to the follow-up issue.
About what is best for this PR, I am not fully sure. I suppose what you have now is fine. The
isinstance(result[0], dtype.type)
check just doesn't really work for categoricals ..There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, missed this earlier. I'm writing up the followup issue now, but will address this there.