Skip to content

ENH: should pandas.core.arrays.Categorical have a dropna=False option? #35162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arw2019 opened this issue Jul 7, 2020 · 6 comments · Fixed by #36125
Closed

ENH: should pandas.core.arrays.Categorical have a dropna=False option? #35162

arw2019 opened this issue Jul 7, 2020 · 6 comments · Fixed by #36125
Labels
Categorical Categorical Data Type Enhancement
Milestone

Comments

@arw2019
Copy link
Member

arw2019 commented Jul 7, 2020

I ran into this while working on #35078. Here's a simple reproducer:

In [1]: from pandas.core.arrays import Categorical
In [2]: Categorical([1, 2, 3,4,4,4, None, None])                                                                                                                                                    
Out[2]: 
[1, 2, 3, 4, 4, 4, NaN, NaN]
Categories (4, int64): [1, 2, 3, 4]

Do people think it's worth having a dropna argument for Categorical, so that one could do:

In [2]: Categorical([1, 2, 3,4,4,4, None, None], dropna=True)                                                                                                                                                    
Out[2]: 
[1, 2, 3, 4, 4, 4, NaN, NaN]
Categories (4, int64): [1, 2, 3, 4, NaN]

If we set dropna=False by default presumably there shouldn't be backward compatibility issues.

I can see arguments either way! In case we do want to add this I'm happy work on it (and the solution in #35078 will be a lot cleaner)

@arw2019 arw2019 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 7, 2020
@arw2019 arw2019 changed the title QST: should pandas.core.arrays.Categorical have a dropna= ENH: should pandas.core.arrays.Categorical have a dropna=False option? Jul 7, 2020
@TomAugspurger
Copy link
Contributor

I think having NA / NaN keys in the categories is probably a bad idea since it leads to so many downstream complexities.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2020

agreed this is a bad idea
iirc we actually started warning in this?

@jorisvandenbossche
Copy link
Member

Indeed, if I remember correctly, the categories initially could contain a missing value as well. But so that way you can have missing data in two ways: missing indicated by the codes (-1) or missing indicated by the category value. And we couldn't think of a good reason to allow both, so opted for only missing data in the codes, and not in the categories.

I didn't look at the linked issue yet, but is there a use case in groupby where having missing categories would be useful?

@arw2019
Copy link
Member Author

arw2019 commented Jul 9, 2020

Ok! Sounds like it's a bad idea.

@jorisvandenbossche In the linked PR I'm fixing a problem with SeriesGroupBy transform where I end up needing the NaN category:

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({"A": [0, 0, 1, None], "B": [1, 2, 3, None]})
In [3]: gb = df.groupby("A", dropna=False)
In [6]: gb['B'].transform(len)

and in the source code we get the categories by calling Categorical on the grouper. It's a small addition in that PR - but I thought I'd check in in case it made sense to work up something more generic for Categorical.

As an aside, #28927 seems related although it looks like there things basically work

@jreback I don't think we have warnings when someone calls Categorical that we're dropping NaN's (though I might have missed them). Happy to work on that if we'd like to add it

@TomAugspurger
Copy link
Contributor

@arw2019 what needs to be done here? Is the original post accurate for what we want?

@TomAugspurger TomAugspurger added Categorical Categorical Data Type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2020
@arw2019
Copy link
Member Author

arw2019 commented Sep 4, 2020

@arw2019 what needs to be done here? Is the original post accurate for what we want?

I think if we add a note in the docs that's good enough. I'll submit a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants