Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ENH: Add DataFrameGroupBy.value_counts #44267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add DataFrameGroupBy.value_counts #44267
Changes from all commits
963b7e1
1f710e0
3531383
d7f733b
eb067ec
a6a07d1
6a22a57
9492ee4
b9885fd
e896879
6de9653
5b49322
651b20b
26353ee
9f44a6d
0e065b3
b821fca
19d7257
71ee5f4
1dd2db0
1c18d7d
f25e861
faac0f0
3934042
4904c31
0f615da
c2db74f
ba793bb
221b76a
424d7a6
5216929
50d4c59
9b2869f
3de6132
a9c2b83
0ad5ffb
6905bcd
0281539
eb9600f
f529714
0ae5218
15e3167
dfa82cb
6e2b06e
2dc5972
925d3ec
82730f1
d7b3149
8d8d9b0
c12d831
57d3fb8
c431953
4d10e47
e4582ef
c948274
2a58c42
e1596b1
df76279
04ebe65
f179fbb
98355d5
0be0150
25edd1e
97ab9c1
6ac9356
45b99af
0cbb3e2
fada9a9
8e3f359
ca15937
f055323
417958d
86a0df6
7d29bd4
32f4b6f
c13eef0
2c2eb0a
7638086
57b564b
aa3cb98
5e5d7e7
09cee2f
5838066
8f81bd2
92cb494
95ccdb4
085e8c9
9fcfbfe
bb5f82a
377cee0
c824f3e
92c718b
14d8172
e26cba1
928a9d7
ad0f5b4
e827cd3
2c2b967
51a3a3e
2ee133e
ec2a2d4
9d330d1
8e4f3ed
b2c61de
392986d
3b2ac58
34e6529
91e1ff3
06aaaeb
e062823
a8b0fc5
493e3aa
548c45b
6c19ce2
6141f85
050f070
d669af3
6c0d7f8
de68836
c81adb6
08fd6ab
dc67009
d023579
71d9780
b93f47c
db31257
124b1e9
d613261
a776a3d
4ef5ea0
0f0891f
5c1d021
fe58245
11ad6ea
857e5be
5b9d85a
e226547
c8f1731
8f89580
ac38571
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not just do column selection? e.g.
[]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column selection would fail if there are duplicate column labels since these are groupers and must be 1-dimensional (otherwise grouper construction will raise)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok is there a test with duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_column_name_clashes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to use positional, to avoid problems with duplicate column labels, and Series, to preserve the label in the grouper.
The test is test_column_name_clashes
In the
as_index=False
case, this currently detects failure for all inputs. Once I havereset_index(allow_duplicates=True)
available then I will make another PR to allow duplicate input column labels through (but not level_n clashes, which I think deserve to fail!)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you not use gb here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what's meant by gb (self?), but the idea here is to use the result which is typically smaller than the obj in the groupby. By transforming, the index of
indexed_group_size
andresult
are the same, meaning there doesn't need to be alignment and this speeds up the division on L1712 below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gb is the groupby for size above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, when grouping by
['a', 'b']
in a DataFrame that has columns['a', 'b', 'c']
, gb will be grouping by all three columns whereas here we only want to group by ['a', 'b'].There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each groupby has different options. These ones have the default
as_index=True
to getSeries
, whereasself
might not have done. Perhaps one could optimize this with a test, but it is pretty fast anyway, and complicated enough!