Skip to content

ENH: Add DataFrameGroupBy.value_counts #44267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 140 commits into from
Dec 19, 2021

Conversation

johnzangwill
Copy link
Contributor

@pep8speaks
Copy link

pep8speaks commented Nov 1, 2021

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-19 00:21:03 UTC

@johnzangwill johnzangwill changed the title Add DataFrameGroupBy.value_counts ENH: Add DataFrameGroupBy.value_counts Nov 3, 2021
@johnzangwill
Copy link
Contributor Author

lgtm pending #44755

The seemingly innocuous #44755 seems to have become stuck again. I would merge this PR now. I just needed #44755 to cover @rhshadrach's pathological case with as_index=False, It raises, and is covered in the tests, and I don't think that it matters.

@rhshadrach
Copy link
Member

Agree this case is not of sufficient severity to be of concern. But from #44267 (comment), can we check this upfront rather than go through the entire computation just to fail. I think this would just amount to checking if there are duplicate column labels in the case as_index=False.

@johnzangwill
Copy link
Contributor Author

@rhshadrach It is more complicated than just duplicates on the incoming. Your example had a column labelled "level_1" that happened to clash with a new column from a non-column grouper. The question is: do you want it to work and create duplicates, or raise?
Also, if the incomong data already has actual duplicates, then it seems bad to be unable to process it. But I could always fix that later in #44755, if and when it ever gets merged...

@rhshadrach
Copy link
Member

Ah, very true. Of course, this is already an issue with other ops, e.g.

df = pd.DataFrame({'level_1': [1, 1, 2]})
df.groupby(['level_1', [0, 0, 1]], as_index=False).size()

raises as well. I'd suggest raising an issue on this, and only specifically detecting duplicate input columns here.

keys = [] if name in in_axis_names else [self._selected_obj]
else:
keys = [
# Can't use .values because the column label needs to be preserved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not just do column selection? e.g. []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member

@rhshadrach rhshadrach Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column selection would fail if there are duplicate column labels since these are groupers and must be 1-dimensional (otherwise grouper construction will raise)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok is there a test with duplicates?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_column_name_clashes

Copy link
Contributor Author

@johnzangwill johnzangwill Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to use positional, to avoid problems with duplicate column labels, and Series, to preserve the label in the grouper.
The test is test_column_name_clashes
In the as_index=False case, this currently detects failure for all inputs. Once I have reset_index(allow_duplicates=True) available then I will make another PR to allow duplicate input column labels through (but not level_n clashes, which I think deserve to fail!)

# We are guaranteed to have the first N levels be the
# user-requested grouping.
levels = list(range(len(self.grouper.groupings), result.index.nlevels))
indexed_group_size = result.groupby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you not use gb here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's meant by gb (self?), but the idea here is to use the result which is typically smaller than the obj in the groupby. By transforming, the index of indexed_group_size and result are the same, meaning there doesn't need to be alignment and this speeds up the division on L1712 below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gb is the groupby for size above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, when grouping by ['a', 'b'] in a DataFrame that has columns ['a', 'b', 'c'], gb will be grouping by all three columns whereas here we only want to group by ['a', 'b'].

Copy link
Contributor Author

@johnzangwill johnzangwill Dec 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each groupby has different options. These ones have the default as_index=True to get Series, whereas self might not have done. Perhaps one could optimize this with a test, but it is pretty fast anyway, and complicated enough!

@jreback jreback added this to the 1.4 milestone Dec 17, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 questions @johnzangwill

@rhshadrach you all ok here?

@jreback
Copy link
Contributor

jreback commented Dec 17, 2021

> =================================== FAILURES ===================================
_____ [doctest] pandas.core.groupby.generic.DataFrameGroupBy.value_counts ______
1633         >>> df
1634             gender 	education 	country
1635         0 	male 	low 	    US
1636         1 	male 	medium 	    FR
1637         2 	female 	high 	    US
1638         3 	male 	low 	    FR
1639         4 	female 	high 	    FR
1640         5 	male 	low 	    FR
1641 
1642         >>> df.groupby('gender').value_counts()
Differences (unified diff with -expected +actual):
    @@ -5,3 +5,3 @@
                        US         1
             medium     FR         1
    -dtype: float64
    +dtype: int64

/home/runner/work/pandas/pandas/pandas/core/groupby/generic.py:1642: DocTestFailure

looks like a doc-test failure

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Dec 18, 2021

All green

@johnzangwill johnzangwill requested a review from jreback December 18, 2021 11:10
@jreback jreback merged commit 539545b into pandas-dev:master Dec 19, 2021
@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

thanks @johnzangwill very nice!

@johnzangwill johnzangwill deleted the DataFrameGroupBy.value_counts branch December 20, 2021 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: DataFrameGroupby.value_counts
5 participants