API: return "correct" missing value scalar from Categorical? #29962

jorisvandenbossche · 2019-12-02T10:46:52Z

From #27929 (comment). In __getitem__ or Categorical.min(..), we always return np.nan as scalar missing value, regardless of the dtype:

In [7]: cat = pd.Categorical([pd.Timestamp("2012"), None], ordered=True)

In [8]: cat  
Out[8]: 
[2012-01-01, NaT]
Categories (1, datetime64[ns]): [2012-01-01]

In [9]: cat[1]   
Out[9]: nan

In [10]: cat.min(skipna=False) 
Out[10]: nan

In the above, this could also be pd.NaT instead?
(similar issue will come up once we can use the EAs that use the new NA scalar in categoricals)

However, CategoricalDtype.na_value now also returns np.nan (which should be consistent with what we return in the cases above):

In [13]: cat.dtype.na_value 
Out[13]: nan

We can of course let the CategoricalDtype.na_value be dependent on the na_value of the dtype of the categories. But I am not fully sure we want such values-dependent behaviour?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-12-02T12:03:08Z

I think returning the NA value for the dtype (NaT in that case) makes the most sense.

But I am not fully sure we want such values-dependent behaviour?

Maybe a meaningless question, but: does this count as values-dependent, or dtype-dependnat?

jorisvandenbossche · 2019-12-02T12:06:22Z

Maybe a meaningless question, but: does this count as values-dependent, or dtype-dependnat?

To be fully correct, I think it is "dtype-instance-dependent" behaviour (so in the sense that it depends on one of the properties of the dtype instance, not on the dtype class)

jreback · 2019-12-02T12:40:15Z

I agree with @TomAugspurger here, this is strictly dtype dependent of the values. We have an almost identical pattern with Interval, where we have a sub-dtype. Maybe we could formalize this a bit, e.g. making a more explicit ContainerDtype. Not sure its worth the effort, but IIRC we discussed this before about working with sub-dtypes.

jbrockmendel · 2021-12-17T20:44:10Z

i think that 1) we should do this and 2) a deprecation cycle for this would be a hassle, so maybe do directly in 2.0?

randolf-scholz · 2022-06-13T09:57:53Z

@jorisvandenbossche I honestly think this is the wrong thing to do. Having instead pd.NA as a universal missing indicator should be preferred. The reason is that type-dependent NaN -values can have different semantics other than that off a missing value indicator:

float('nan') can naturally arise from a computation.
NaT can naturally arise from a bad format specification (e.g. pandas.to_datetime(..., errors="coerce"))

Having both pd.NA and NaT allows us to clearly distinguish these semantically different cases.

jorisvandenbossche added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design Categorical Categorical Data Type labels Dec 2, 2019

jorisvandenbossche mentioned this issue Dec 2, 2019

API/DEPR: Change default skipna behaviour + deprecate numeric_only in Categorical.min and max #27929

Merged

TomAugspurger mentioned this issue Dec 12, 2019

BUG: min/max on empty categorical fails #30227

Merged

4 tasks

mroeschke added the Bug label Jun 28, 2020

mroeschke removed the API Design label Jul 23, 2021

jbrockmendel mentioned this issue Dec 13, 2021

ENH: Should pandas.CategoricalDtype use pandas.NA as the sentinel value instead of float('nan') ? #43836

Closed

This was referenced Dec 17, 2021

API: Breaking Changes in 3.0 (without deprecations) #44823

Open

RLS: 1.4 #41957

Closed

simonjayhawkins mentioned this issue Jun 11, 2022

BUG: drop_duplicates on categorical not preserving NA #44405

Closed

3 tasks

mzeitlin11 mentioned this issue Aug 5, 2022

ENH: support pd.NA in "category" dtype #47982

Open

3 tasks

jbrockmendel mentioned this issue Jan 12, 2023

Proper support of nullable dtypes as the Categorical dtype #50711

Open

jbrockmendel added PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint and removed PDEP missing values Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint labels Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: return "correct" missing value scalar from Categorical? #29962

API: return "correct" missing value scalar from Categorical? #29962

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019

jreback commented Dec 2, 2019

jbrockmendel commented Dec 17, 2021

randolf-scholz commented Jun 13, 2022 •

edited

Loading

API: return "correct" missing value scalar from Categorical? #29962

API: return "correct" missing value scalar from Categorical? #29962

Comments

jorisvandenbossche commented Dec 2, 2019

TomAugspurger commented Dec 2, 2019

jorisvandenbossche commented Dec 2, 2019

jreback commented Dec 2, 2019

jbrockmendel commented Dec 17, 2021

randolf-scholz commented Jun 13, 2022 • edited Loading

randolf-scholz commented Jun 13, 2022 •

edited

Loading