Skip to content

BUG: DataFrameGroupBy.nunique fails with categorical, multiple groupings, and as_index=False #52848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Apr 22, 2023 · 7 comments · Fixed by #55738
Labels
Bug Categorical Categorical Data Type Groupby Reduction Operations sum, mean, min, max, etc.

Comments

@rhshadrach
Copy link
Member

df = DataFrame({"a1": [0, 0, 1], "a2": [2, 3, 3], "b": [4, 5, 6]})
df = df.astype({"a1": "category", "a2": "category"})
gb = df.groupby(by=["a1", "a2"], as_index=False, observed=False)
gb.nunique()

raises ValueError: Length of values (3) does not match length of index (4)

@rhshadrach rhshadrach added Bug Groupby Categorical Categorical Data Type Reduction Operations sum, mean, min, max, etc. labels Apr 22, 2023
@szczekulskij
Copy link

This looks a bit hard, but I'll give it a try. If I won't be able to solve this within a week, I'll re-assign

@szczekulskij
Copy link

take

@szczekulskij
Copy link

szczekulskij commented Apr 24, 2023

Commenting out lines 1911-1913 solves issue. File: pandas/core/groupby/generic.py

        # if not self.as_index:
        #     res_df.index = default_index(len(res_df))
        #     res_df = self._insert_inaxis_grouper(res_df)

Gonna work on fixing it now.
Initial solution idea w. temp as_index didn't work : /

@szczekulskij
Copy link

szczekulskij commented Apr 24, 2023

Weirdly, It's the same function causing the issue as in #52397 that I'm working on

The name of function: self._insert_inaxis_grouper - in both issues we need to skip call to this function(or fix the way function works)

@szczekulskij
Copy link

Is it okay if I force as_index = True using with com.temp_setattr(self, "as_index", True): ?

With as_index=True forced, the output would always look like this:

#        b
# a1 a2   
# 0  2   1
#    3   1
# 1  2   0
#    3   1

I assume we want the behaviour to differ depending on as_index, but wanted to double check

@rhshadrach

@rhshadrach
Copy link
Member Author

I assume we want the behaviour to differ depending on as_index, but wanted to double check

Correct. Running the OP example without categorical data produces:

   a1  a2  b
0   0   2  1
1   0   3  1
2   1   3  1

@szczekulskij
Copy link

Hey, I'm sorry but I'll need to drop this issue. Pandas is a bit too much for me right now, I'll come back to this in the future

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants