Skip to content

BUG: groupby(..., dropna=False).indices with single group key does not include nan group #35646

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
mroeschke opened this issue Aug 10, 2020 · 2 comments · Fixed by #36842
Closed
3 tasks done
Milestone

Comments

@mroeschke
Copy link
Member

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

# Your code here
In [9]: data = {'group':['g1', 'g1', 'g1', np.nan, 'g1', 'g1', 'g2', 'g2', 'g2', 'g2', np.nan],
   ...:                     'A':[3, 1, 8, 2, 6, -1, 0, 13, -4, 0, 1],
   ...:                     'B':[5, 2, 3, 7, 11, -1, 4,-1, 1, 0, 2]}
   ...: df = pd.DataFrame(data)
   ...: df.groupby('group',dropna=True).indices
Out[9]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9])}

In [10]: df.groupby('group',dropna=False).indices
Out[10]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9])}

In [11]: pd.__version__
Out[11]: '1.2.0.dev0+67.gaefae55e1'

Problem description

The grouping codes + indices are determined for a single group by key here

values = Categorical(self.grouper)

And Categorical does not support nan as a label (only a missing -1 code)

This works correctly if multiple group keys are passed

Once this issue is addressed, #35542 will be fixed

Expected Output

In [10]: df.groupby('group',dropna=False).indices
Out[10]: {'g1': array([0, 1, 2, 4, 5]), 'g2': array([6, 7, 8, 9]), np.nan: array([3, 10]}
@mroeschke mroeschke added Bug Needs Triage Issue that has not been reviewed by a pandas team member Groupby and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 10, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 15, 2020
@phofl
Copy link
Member

phofl commented Sep 6, 2020

@mroeschke

I looked a bit into this and have a question about the preferred solution: Is it ok to replace the nan with a unique string/integer before calling values = Categorical(self.grouper) and changing it back afterwards? Only in case of ``dropna=False```of course.

@mroeschke
Copy link
Member Author

I don't think that would be an ideal solution.

I think a better solution would just be to refactor the code path to use the logic used for multi group keys since I don't think it's planned to support Categorical(..., dropna=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants