ENH: Add DataFrameGroupBy.value_counts #44267

johnzangwill · 2021-11-01T07:34:22Z

closes ENH: DataFrameGroupby.value_counts #43564
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pep8speaks · 2021-11-01T08:27:14Z

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-19 00:21:03 UTC

johnzangwill · 2021-12-14T15:48:08Z

lgtm pending #44755

The seemingly innocuous #44755 seems to have become stuck again. I would merge this PR now. I just needed #44755 to cover @rhshadrach's pathological case with as_index=False, It raises, and is covered in the tests, and I don't think that it matters.

rhshadrach · 2021-12-15T02:48:15Z

Agree this case is not of sufficient severity to be of concern. But from #44267 (comment), can we check this upfront rather than go through the entire computation just to fail. I think this would just amount to checking if there are duplicate column labels in the case as_index=False.

johnzangwill · 2021-12-15T11:44:27Z

@rhshadrach It is more complicated than just duplicates on the incoming. Your example had a column labelled "level_1" that happened to clash with a new column from a non-column grouper. The question is: do you want it to work and create duplicates, or raise?
Also, if the incomong data already has actual duplicates, then it seems bad to be unable to process it. But I could always fix that later in #44755, if and when it ever gets merged...

rhshadrach · 2021-12-16T02:15:36Z

Ah, very true. Of course, this is already an issue with other ops, e.g.

df = pd.DataFrame({'level_1': [1, 1, 2]})
df.groupby(['level_1', [0, 0, 1]], as_index=False).size()

raises as well. I'd suggest raising an issue on this, and only specifically detecting duplicate input columns here.

pandas/core/groupby/generic.py

jreback · 2021-12-17T14:55:52Z

pandas/core/groupby/generic.py

+                keys = [] if name in in_axis_names else [self._selected_obj]
+            else:
+                keys = [
+                    # Can't use .values because the column label needs to be preserved


can you not just do column selection? e.g. []

Column selection would fail if there are duplicate column labels since these are groupers and must be 1-dimensional (otherwise grouper construction will raise)

ok is there a test with duplicates?

test_column_name_clashes

I have to use positional, to avoid problems with duplicate column labels, and Series, to preserve the label in the grouper.
The test is test_column_name_clashes
In the as_index=False case, this currently detects failure for all inputs. Once I have reset_index(allow_duplicates=True) available then I will make another PR to allow duplicate input column labels through (but not level_n clashes, which I think deserve to fail!)

pandas/core/groupby/generic.py

jreback · 2021-12-17T14:56:54Z

pandas/core/groupby/generic.py

+                # We are guaranteed to have the first N levels be the
+                # user-requested grouping.
+                levels = list(range(len(self.grouper.groupings), result.index.nlevels))
+                indexed_group_size = result.groupby(


can you not use gb here?

Not sure what's meant by gb (self?), but the idea here is to use the result which is typically smaller than the obj in the groupby. By transforming, the index of indexed_group_size and result are the same, meaning there doesn't need to be alignment and this speeds up the division on L1712 below.

gb is the groupby for size above

Ah, when grouping by ['a', 'b'] in a DataFrame that has columns ['a', 'b', 'c'], gb will be grouping by all three columns whereas here we only want to group by ['a', 'b'].

Each groupby has different options. These ones have the default as_index=True to get Series, whereas self might not have done. Perhaps one could optimize this with a test, but it is pretty fast anyway, and complicated enough!

pandas/core/groupby/generic.py

jreback

2 questions @johnzangwill

@rhshadrach you all ok here?

jreback · 2021-12-17T22:32:33Z

> =================================== FAILURES ===================================
_____ [doctest] pandas.core.groupby.generic.DataFrameGroupBy.value_counts ______
1633         >>> df
1634             gender 	education 	country
1635         0 	male 	low 	    US
1636         1 	male 	medium 	    FR
1637         2 	female 	high 	    US
1638         3 	male 	low 	    FR
1639         4 	female 	high 	    FR
1640         5 	male 	low 	    FR
1641 
1642         >>> df.groupby('gender').value_counts()
Differences (unified diff with -expected +actual):
    @@ -5,3 +5,3 @@
                        US         1
             medium     FR         1
    -dtype: float64
    +dtype: int64

/home/runner/work/pandas/pandas/pandas/core/groupby/generic.py:1642: DocTestFailure

looks like a doc-test failure

…hnzangwill/pandas into DataFrameGroupBy.value_counts

rhshadrach

lgtm

johnzangwill · 2021-12-18T10:54:25Z

All green

jreback · 2021-12-19T23:25:48Z

thanks @johnzangwill very nice!

johnzangwill added 4 commits November 1, 2021 08:28

Add DataFrameGroupBy.value_counts

963b7e1

Update test_frame_value_counts.py

1f710e0

Catch axis=1

3531383

Add to base and tab_completion

d7f733b

johnzangwill added 13 commits November 1, 2021 09:29

Line too long

eb067ec

Update test_frame_value_counts.py

a6a07d1

Add docstring

6a22a57

Update generic.py

9492ee4

Update groupby.rst

b9885fd

generic.py types

e896879

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

6de9653

Add observed parameter

5b49322

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

651b20b

Change output name to "count" and deal with categorical data

26353ee

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

9f44a6d

Update generic.py

0e065b3

Add test_categorical

b821fca

johnzangwill changed the title ~~Add DataFrameGroupBy.value_counts~~ ENH: Add DataFrameGroupBy.value_counts Nov 3, 2021

johnzangwill added 11 commits November 3, 2021 11:26

Update test_frame_value_counts.py

19d7257

Add by=function test

71ee5f4

Update test_frame_value_counts.py

1dd2db0

Update test_frame_value_counts.py

1c18d7d

Update generic.py

f25e861

Update test_frame_value_counts.py

faac0f0

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

3934042

Merge branch 'master' into DataFrameGroupBy.value_counts

4904c31

Update v1.4.0.rst

0f615da

Merge branch 'master' into DataFrameGroupBy.value_counts

c2db74f

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

ba793bb

johnzangwill added 2 commits December 12, 2021 21:32

Trigger CI

124b1e9

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

d613261

johnzangwill added 2 commits December 14, 2021 23:33

Trigger CI

a776a3d

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

4ef5ea0

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

0f0891f

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

5c1d021

jreback requested changes Dec 17, 2021

View reviewed changes

johnzangwill added 2 commits December 17, 2021 16:12

Update generic.py

fe58245

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

11ad6ea

jreback added this to the 1.4 milestone Dec 17, 2021

jreback requested changes Dec 17, 2021

View reviewed changes

johnzangwill added 3 commits December 18, 2021 00:49

Update generic.py

857e5be

Merge branch 'DataFrameGroupBy.value_counts' of https://github.com/jo…

5b9d85a

…hnzangwill/pandas into DataFrameGroupBy.value_counts

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

e226547

rhshadrach approved these changes Dec 18, 2021

View reviewed changes

johnzangwill added 2 commits December 18, 2021 09:33

Trigger CI

c8f1731

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

8f89580

johnzangwill requested a review from jreback December 18, 2021 11:10

Merge branch 'pandas-dev:master' into DataFrameGroupBy.value_counts

ac38571

jreback approved these changes Dec 19, 2021

View reviewed changes

jreback merged commit 539545b into pandas-dev:master Dec 19, 2021

johnzangwill deleted the DataFrameGroupBy.value_counts branch December 20, 2021 08:05

simonjayhawkins mentioned this pull request Jun 9, 2022

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add DataFrameGroupBy.value_counts #44267

ENH: Add DataFrameGroupBy.value_counts #44267

johnzangwill commented Nov 1, 2021

pep8speaks commented Nov 1, 2021 •

edited

Loading

johnzangwill commented Dec 14, 2021

rhshadrach commented Dec 15, 2021

johnzangwill commented Dec 15, 2021

rhshadrach commented Dec 16, 2021

jreback Dec 17, 2021

jreback Dec 17, 2021

rhshadrach Dec 18, 2021 •

edited

Loading

jreback Dec 18, 2021

rhshadrach Dec 18, 2021

johnzangwill Dec 18, 2021 •

edited

Loading

jreback Dec 17, 2021

rhshadrach Dec 18, 2021

jreback Dec 18, 2021

rhshadrach Dec 18, 2021

johnzangwill Dec 18, 2021 •

edited

Loading

jreback left a comment

jreback commented Dec 17, 2021

rhshadrach left a comment

johnzangwill commented Dec 18, 2021 •

edited

Loading

jreback commented Dec 19, 2021

ENH: Add DataFrameGroupBy.value_counts #44267

ENH: Add DataFrameGroupBy.value_counts #44267

Conversation

johnzangwill commented Nov 1, 2021

pep8speaks commented Nov 1, 2021 • edited Loading

Comment last updated at 2021-12-19 00:21:03 UTC

johnzangwill commented Dec 14, 2021

rhshadrach commented Dec 15, 2021

johnzangwill commented Dec 15, 2021

rhshadrach commented Dec 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Dec 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnzangwill Dec 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnzangwill Dec 18, 2021 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 17, 2021

rhshadrach left a comment

Choose a reason for hiding this comment

johnzangwill commented Dec 18, 2021 • edited Loading

jreback commented Dec 19, 2021

pep8speaks commented Nov 1, 2021 •

edited

Loading

rhshadrach Dec 18, 2021 •

edited

Loading

johnzangwill Dec 18, 2021 •

edited

Loading

johnzangwill Dec 18, 2021 •

edited

Loading

johnzangwill commented Dec 18, 2021 •

edited

Loading