Skip to content

API: return "correct" missing value scalar from Categorical? #29962

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #44823
jorisvandenbossche opened this issue Dec 2, 2019 · 5 comments
Open
Tracked by #44823
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@jorisvandenbossche
Copy link
Member

From #27929 (comment). In __getitem__ or Categorical.min(..), we always return np.nan as scalar missing value, regardless of the dtype:

In [7]: cat = pd.Categorical([pd.Timestamp("2012"), None], ordered=True)

In [8]: cat  
Out[8]: 
[2012-01-01, NaT]
Categories (1, datetime64[ns]): [2012-01-01]

In [9]: cat[1]   
Out[9]: nan

In [10]: cat.min(skipna=False) 
Out[10]: nan

In the above, this could also be pd.NaT instead?
(similar issue will come up once we can use the EAs that use the new NA scalar in categoricals)

However, CategoricalDtype.na_value now also returns np.nan (which should be consistent with what we return in the cases above):

In [13]: cat.dtype.na_value 
Out[13]: nan

We can of course let the CategoricalDtype.na_value be dependent on the na_value of the dtype of the categories. But I am not fully sure we want such values-dependent behaviour?

@TomAugspurger
Copy link
Contributor

I think returning the NA value for the dtype (NaT in that case) makes the most sense.

But I am not fully sure we want such values-dependent behaviour?

Maybe a meaningless question, but: does this count as values-dependent, or dtype-dependnat?

@jorisvandenbossche
Copy link
Member Author

Maybe a meaningless question, but: does this count as values-dependent, or dtype-dependnat?

To be fully correct, I think it is "dtype-instance-dependent" behaviour (so in the sense that it depends on one of the properties of the dtype instance, not on the dtype class)

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

I agree with @TomAugspurger here, this is strictly dtype dependent of the values. We have an almost identical pattern with Interval, where we have a sub-dtype. Maybe we could formalize this a bit, e.g. making a more explicit ContainerDtype. Not sure its worth the effort, but IIRC we discussed this before about working with sub-dtypes.

@jbrockmendel
Copy link
Member

i think that 1) we should do this and 2) a deprecation cycle for this would be a hassle, so maybe do directly in 2.0?

@randolf-scholz
Copy link
Contributor

randolf-scholz commented Jun 13, 2022

@jorisvandenbossche I honestly think this is the wrong thing to do. Having instead pd.NA as a universal missing indicator should be preferred. The reason is that type-dependent NaN -values can have different semantics other than that off a missing value indicator:

  • float('nan') can naturally arise from a computation.
  • NaT can naturally arise from a bad format specification (e.g. pandas.to_datetime(..., errors="coerce"))

Having both pd.NA and NaT allows us to clearly distinguish these semantically different cases.

@jbrockmendel jbrockmendel added PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint and removed PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint labels Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

6 participants