Skip to content

ENH: support pd.NA in "category" dtype #47982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
devmcp opened this issue Aug 5, 2022 · 4 comments
Open
1 of 3 tasks

ENH: support pd.NA in "category" dtype #47982

devmcp opened this issue Aug 5, 2022 · 4 comments
Labels
Categorical Categorical Data Type Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@devmcp
Copy link

devmcp commented Aug 5, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I would like to be able to use pd.NA for missing data in a column of dtype "category"

Currently this:

pd.DataFrame({"A": ["one", "two", pd.NA]}).astype("category")

converts the pd.NA to np.NaN.

Feature Description

I think there should be a "category" dtype that supports pd.NA.

Alternative Solutions

I don't think there is a current workaround

Additional Context

No response

@devmcp devmcp added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2022
@phofl
Copy link
Member

phofl commented Aug 5, 2022

Hi, thanks for your report. We already have this, through string-dtype:

pd.DataFrame({"A": ["one", "two", pd.NA]}, dtype="string").astype("category")

@mzeitlin11
Copy link
Member

Related to #29962

@mzeitlin11 mzeitlin11 added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Duplicate Report Duplicate issue or pull request Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2022
@devmcp
Copy link
Author

devmcp commented Aug 8, 2022

Thanks for your reply @phofl - I didn't quite appreciate how the category "type" sat on top of the actual type of the data. Thanks for linking @mzeitlin11 - good to see I'm not the only one who didn't find this entirely intuitive at first pass.

@devmcp devmcp closed this as completed Aug 8, 2022
@WillAyd
Copy link
Member

WillAyd commented Jun 25, 2024

This is also a rather interesting problem for discussion under the scope of #58988

Going to reopen to track this - having to go the .astype("category") route is pretty inefficient, but the alternative I think brings up some pitfalls:

>>> pd.DataFrame({"A": ["one", "two", pd.NA]}, dtype="category")
     A
0  one
1  two
2  NaN

@WillAyd WillAyd reopened this Jun 25, 2024
@mroeschke mroeschke removed Duplicate Report Duplicate issue or pull request Closing Candidate May be closeable, needs more eyeballs labels Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

5 participants