-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add DataFrameGroupBy.value_counts #44267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add DataFrameGroupBy.value_counts #44267
Conversation
johnzangwill
commented
Nov 1, 2021
- closes ENH: DataFrameGroupby.value_counts #43564
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-12-19 00:21:03 UTC |
The seemingly innocuous #44755 seems to have become stuck again. I would merge this PR now. I just needed #44755 to cover @rhshadrach's pathological case with as_index=False, It raises, and is covered in the tests, and I don't think that it matters. |
Agree this case is not of sufficient severity to be of concern. But from #44267 (comment), can we check this upfront rather than go through the entire computation just to fail. I think this would just amount to checking if there are duplicate column labels in the case |
@rhshadrach It is more complicated than just duplicates on the incoming. Your example had a column labelled "level_1" that happened to clash with a new column from a non-column grouper. The question is: do you want it to work and create duplicates, or raise? |
Ah, very true. Of course, this is already an issue with other ops, e.g.
raises as well. I'd suggest raising an issue on this, and only specifically detecting duplicate input columns here. |
keys = [] if name in in_axis_names else [self._selected_obj] | ||
else: | ||
keys = [ | ||
# Can't use .values because the column label needs to be preserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not just do column selection? e.g. []
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column selection would fail if there are duplicate column labels since these are groupers and must be 1-dimensional (otherwise grouper construction will raise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok is there a test with duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_column_name_clashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to use positional, to avoid problems with duplicate column labels, and Series, to preserve the label in the grouper.
The test is test_column_name_clashes
In the as_index=False
case, this currently detects failure for all inputs. Once I have reset_index(allow_duplicates=True)
available then I will make another PR to allow duplicate input column labels through (but not level_n clashes, which I think deserve to fail!)
# We are guaranteed to have the first N levels be the | ||
# user-requested grouping. | ||
levels = list(range(len(self.grouper.groupings), result.index.nlevels)) | ||
indexed_group_size = result.groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not use gb here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's meant by gb (self?), but the idea here is to use the result which is typically smaller than the obj in the groupby. By transforming, the index of indexed_group_size
and result
are the same, meaning there doesn't need to be alignment and this speeds up the division on L1712 below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gb is the groupby for size above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, when grouping by ['a', 'b']
in a DataFrame that has columns ['a', 'b', 'c']
, gb will be grouping by all three columns whereas here we only want to group by ['a', 'b'].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each groupby has different options. These ones have the default as_index=True
to get Series
, whereas self
might not have done. Perhaps one could optimize this with a test, but it is pretty fast anyway, and complicated enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 questions @johnzangwill
@rhshadrach you all ok here?
looks like a doc-test failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
All green |
thanks @johnzangwill very nice! |