-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: SeriesGroupBy.value_counts - index name missing in categorical columns #45625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: SeriesGroupBy.value_counts - index name missing in categorical columns #45625
Conversation
NumberPiOso
commented
Jan 25, 2022
- closes BUG: SeriesGroupBy.value_counts - index name missing when applied on categorical column #44324
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
Value counts tend to preserve index names pandas-dev#45625 Change test test_sorting_with_different_categoricals to comply to this change
def apply_series_value_counts(): | ||
return self.apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you take this opportunity to refactor and inline this logic like
if (bins is not None and np.iterable(bins) or is_categorical_dtype(val.dtype):
# comments
s = self.apply(...)
s.index.names = ...
Also will s always have a MultiIndex here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I run the test code with the following change
if is_categorical_dtype(val.dtype) or (bins and np.iterable(bins)):
# scalar bins cannot be done at top level
# in a backward compatible way
# GH38672
s = self.apply(
Series.value_counts,
normalize=normalize,
sort=sort,
ascending=ascending,
bins=bins,
)
print(s.index)
s.index.names = names
I get these results
df = pd.DataFrame(
{
"gender": ["female"],
"country": ["US"],
}
)
df["gender"] = df["gender"].astype("category")
result2 = df.groupby("country")["gender"].value_counts()
MultiIndex([('US', 'female')],
names=['country', None])
And I think the fact that we are inside a group by should always ensure that we are always using MultiIndex.
Running the following code calls a different method. Specifically, pandas.core.algorithms.value_counts
.
pandas/core/groupby/generic.py
Outdated
if is_categorical_dtype(val.dtype) or (bins and not np.iterable(bins)): | ||
# scalar bins cannot be done at top level | ||
# in a backward compatible way | ||
# GH38672 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify that GH#38672 refers to the categorical case
pandas/core/groupby/generic.py
Outdated
# GH38672 | ||
return apply_series_value_counts() | ||
s.index.names = names | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: s -> ser
pandas/core/groupby/generic.py
Outdated
return self.apply( | ||
names = self.grouper.names + [self.obj.name] | ||
|
||
if is_categorical_dtype(val.dtype) or (bins and np.iterable(bins)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this way of refactoring the conditionals very Pythonic, let me know what you think. @mroeschke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks okay but looks like tests are still failing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, apparently I introduced the error there
> if is_categorical_dtype(val.dtype) or (bins and not np.iterable(bins)):
> ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @NumberPiOso |
…olumns (pandas-dev#45625) * BUG: SeriesGroupBy.value_counts index name missing Issue pandas-dev#44324 * TST: Change test to correct categorical naming Value counts tend to preserve index names pandas-dev#45625 Change test test_sorting_with_different_categoricals to comply to this change * REF: Refactor conditionals in value_counts() * RFT: correct mistake introduced via RFT In line with 44324 * RFT: Change variable names and comment pandas-dev#38672 * BUG: Update conditional to is None to consider series
…olumns (pandas-dev#45625) * BUG: SeriesGroupBy.value_counts index name missing Issue pandas-dev#44324 * TST: Change test to correct categorical naming Value counts tend to preserve index names pandas-dev#45625 Change test test_sorting_with_different_categoricals to comply to this change * REF: Refactor conditionals in value_counts() * RFT: correct mistake introduced via RFT In line with 44324 * RFT: Change variable names and comment pandas-dev#38672 * BUG: Update conditional to is None to consider series