Skip to content

BUG: SeriesGroupBy.value_counts - index name missing in categorical columns #45625

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Feb 5, 2022

Conversation

NumberPiOso
Copy link
Contributor

def apply_series_value_counts():
return self.apply(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you take this opportunity to refactor and inline this logic like

if (bins is not None and np.iterable(bins) or is_categorical_dtype(val.dtype):
    # comments
    s = self.apply(...)
    s.index.names = ...

Also will s always have a MultiIndex here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I run the test code with the following change

    if is_categorical_dtype(val.dtype) or (bins and np.iterable(bins)):
        # scalar bins cannot be done at top level
        # in a backward compatible way
        # GH38672
        s = self.apply(
            Series.value_counts,
            normalize=normalize,
            sort=sort,
            ascending=ascending,
            bins=bins,
        )
        print(s.index)
        s.index.names = names

I get these results

df = pd.DataFrame(
    {
        "gender": ["female"],
        "country": ["US"],
    }
)
df["gender"] = df["gender"].astype("category")
result2 = df.groupby("country")["gender"].value_counts()
MultiIndex([('US', 'female')],
           names=['country', None])

And I think the fact that we are inside a group by should always ensure that we are always using MultiIndex.

Running the following code calls a different method. Specifically, pandas.core.algorithms.value_counts.

@mroeschke mroeschke added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Apply Apply, Aggregate, Transform, Map labels Jan 30, 2022
if is_categorical_dtype(val.dtype) or (bins and not np.iterable(bins)):
# scalar bins cannot be done at top level
# in a backward compatible way
# GH38672
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify that GH#38672 refers to the categorical case

# GH38672
return apply_series_value_counts()
s.index.names = names
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: s -> ser

return self.apply(
names = self.grouper.names + [self.obj.name]

if is_categorical_dtype(val.dtype) or (bins and np.iterable(bins)):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this way of refactoring the conditionals very Pythonic, let me know what you think. @mroeschke

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks okay but looks like tests are still failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, apparently I introduced the error there

>       if is_categorical_dtype(val.dtype) or (bins and not np.iterable(bins)):
>      ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

@NumberPiOso NumberPiOso requested a review from mroeschke February 3, 2022 15:12
@rhshadrach rhshadrach added Bug Groupby Series Series data structure Categorical Categorical Data Type and removed Apply Apply, Aggregate, Transform, Map labels Feb 5, 2022
@rhshadrach rhshadrach added this to the 1.5 milestone Feb 5, 2022
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke merged commit c80b145 into pandas-dev:main Feb 5, 2022
@mroeschke
Copy link
Member

Thanks @NumberPiOso

@NumberPiOso NumberPiOso deleted the index-value-count-cat-col branch February 6, 2022 22:02
phofl pushed a commit to phofl/pandas that referenced this pull request Feb 14, 2022
…olumns (pandas-dev#45625)

* BUG: SeriesGroupBy.value_counts index name missing

Issue  pandas-dev#44324

* TST: Change test to correct categorical naming

Value counts tend to preserve index names pandas-dev#45625
Change test test_sorting_with_different_categoricals to comply
to this change

* REF: Refactor conditionals in value_counts()

* RFT: correct mistake introduced via RFT

In line with 44324

* RFT: Change variable names and comment pandas-dev#38672

* BUG: Update conditional to is None to consider series
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
…olumns (pandas-dev#45625)

* BUG: SeriesGroupBy.value_counts index name missing

Issue  pandas-dev#44324

* TST: Change test to correct categorical naming

Value counts tend to preserve index names pandas-dev#45625
Change test test_sorting_with_different_categoricals to comply
to this change

* REF: Refactor conditionals in value_counts()

* RFT: correct mistake introduced via RFT

In line with 44324

* RFT: Change variable names and comment pandas-dev#38672

* BUG: Update conditional to is None to consider series
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Categorical Categorical Data Type Groupby Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: SeriesGroupBy.value_counts - index name missing when applied on categorical column
4 participants