Skip to content

BUG: DataFrameGroupBy.value_counts includes non-observed categories of non-grouping columns #46798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 7, 2022

Conversation

LucasG0
Copy link
Contributor

@LucasG0 LucasG0 commented Apr 18, 2022

@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from 95c141c to 9b9ffc4 Compare April 18, 2022 16:53
@LucasG0 LucasG0 changed the title DataFrameGroupBy.value_counts includes non-observed categories of non-grouping columns BUG: DataFrameGroupBy.value_counts includes non-observed categories of non-grouping columns Apr 18, 2022
@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from 9b9ffc4 to db359a2 Compare April 18, 2022 21:11
@LucasG0
Copy link
Contributor Author

LucasG0 commented May 17, 2022

This PR has not been reviewed for 1 month. @rhshadrach as you were involved in the issue description, would you like to have a look, or should I wait/ping for the review from a specific maintainer ?

@rhshadrach rhshadrach added Groupby Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Categorical Categorical Data Type Bug labels May 17, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @LucasG0; looks good. Some requests/questions.

@jreback jreback added this to the 1.5 milestone May 19, 2022
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment.

@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from db359a2 to ddf50d4 Compare May 20, 2022 18:24
@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from ddf50d4 to b535f77 Compare May 20, 2022 18:42
@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from 04c1654 to 0683e17 Compare May 23, 2022 09:57
@LucasG0
Copy link
Contributor Author

LucasG0 commented May 31, 2022

@jreback @rhshadrach do you have any other comments to add on this PR or should we consider merging it (after rebase) ?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I'm not sure why the one test is still changing though.

@@ -634,7 +634,8 @@ Categorical
^^^^^^^^^^^
- Bug in :meth:`Categorical.view` not accepting integer dtypes (:issue:`25464`)
- Bug in :meth:`CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`)
-
- Bug in :meth:`.DataFrameGroupBy.value_counts` includes non-observed categories of non-grouping columns regardless of ``observed`` (:issue:`46357`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok since this is changing non-trivially, can you make a sub-section (ok here is fine, where you show the prior and the fixed behavior)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sub-section in notable fixes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-section looks good; need to remove this line.

@LucasG0 LucasG0 force-pushed the value_counts_categorical branch 2 times, most recently from 55cde73 to fc8ea62 Compare June 6, 2022 18:01
@LucasG0
Copy link
Contributor Author

LucasG0 commented Jun 6, 2022

Unsuccessful job seems to be related to the flakyness of #45651

@simonjayhawkins
Copy link
Member

@jreback @rhshadrach DataFrameGroupBy.value_counts was added in pandas-1.4.0

Firstly. do we need the extensive release note. These are normally reserved for bug fixes that have a significant change in behavior that might be considered a breaking change by users.

Also because DataFrameGroupBy.value_counts is a new feature we could consider backporting to 1.4.x.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2022

no backport in this
and i think does need a big note as this is a significant change that's not obvious

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs two fixes from the whatsnew (some from me, one from Simon); otherwise looks good.

@@ -634,7 +634,8 @@ Categorical
^^^^^^^^^^^
- Bug in :meth:`Categorical.view` not accepting integer dtypes (:issue:`25464`)
- Bug in :meth:`CategoricalIndex.union` when the index's categories are integer-dtype and the index contains ``NaN`` values incorrectly raising instead of casting to ``float64`` (:issue:`45362`)
-
- Bug in :meth:`.DataFrameGroupBy.value_counts` includes non-observed categories of non-grouping columns regardless of ``observed`` (:issue:`46357`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sub-section looks good; need to remove this line.

@LucasG0
Copy link
Contributor Author

LucasG0 commented Jun 13, 2022

@rhshadrach I fixed the whatsnew

@LucasG0
Copy link
Contributor Author

LucasG0 commented Jun 13, 2022

Job failure is due to the unavailability of Github Actions

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 15, 2022
@rhshadrach
Copy link
Member

Apologies @LucasG0 - this fell off my radar. Can you resolve conflicts.

@simonjayhawkins - are you good here?

@LucasG0 LucasG0 force-pushed the value_counts_categorical branch from 4cbbcea to fa3869c Compare July 15, 2022 21:37
@LucasG0
Copy link
Contributor Author

LucasG0 commented Jul 15, 2022

No worries @rhshadrach, conflicts resolved.

@rhshadrach rhshadrach removed the Stale label Jul 17, 2022
@rhshadrach
Copy link
Member

@simonjayhawkins - friendly ping

@simonjayhawkins
Copy link
Member

@simonjayhawkins - friendly ping

My comments were related to whether this should have been a targeted backport or not.

But I'm not going to block.

@rhshadrach
Copy link
Member

@jreback - friendly ping

@jreback jreback merged commit d924d0b into pandas-dev:main Aug 7, 2022
@jreback
Copy link
Contributor

jreback commented Aug 7, 2022

thanks @LucasG0 this is very nice! and thanks @rhshadrach for all the work on this as well.

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: value_counts(observed=True) inconsistent between SeriesGroupBy and DataFrameGroupBy
4 participants