Missing Values and Categoricals - inconsistent dtypes #23242

Dr-Irv · 2018-10-19T21:45:58Z

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas.api.types import CategoricalDtype

In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')

In [5]: s1
Out[5]:
0   NaN
1   NaN
dtype: category
Categories (0, float64): []

In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))

In [7]: s2
Out[7]:
0    NaN
1    NaN
dtype: category
Categories (0, object): []

In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])

C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
    361     if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
    362                for other in to_union[1:]):
--> 363         raise TypeError("dtype of categories must be the same")
    364
    365     ordered = False

TypeError: dtype of categories must be the same

Problem description

In the above, if you convert a Series using astype('category'), and the Series has all NaN values, the underlying dtype is float, while if you pass CategoricalDtype([]), the underlying dtype is object.

There are a couple of issues that I don't know how to deal with:

If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing s1 to be object)
You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals() to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category'). Let's say that one had all missing values and has underlying dtype float and the second one had strings and missing values, so it ends up with dtype O, then I can't do union_categoricals() on them.

I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype.

Alternatively, maybe union_categoricals() should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype when doing the union.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.1
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-22T21:04:07Z

You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

You can with

In [24]: pd.api.types.CategoricalDtype(categories=pd.Index([], dtype=int)).categories
Out[24]: Int64Index([], dtype='int64')

CategoricalDtype.categories is just an index. Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

are both category dtype, and each Series was constructed using astype('category')

I think that's the root issue. .astype('category') is going to use inference, which can fail. If you want full control you'll have to be explicit.

Dr-Irv · 2018-10-22T21:45:09Z

Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

Yes, I think that would help.

The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: #14177 (comment))

TomAugspurger · 2018-10-23T14:53:39Z

I'd prefer to avoid special casing empty / all-NaN columns. I think adding a `dtype` keyword to the CategoricalDtype constructor would be fine, with a default of float for backwards compatibility.

…

On Mon, Oct 22, 2018 at 4:45 PM Dr. Irv ***@***.***> wrote: Would you want to accept a dtype parameter in CategoricalDtype that's passed through? Yes, I think that would help. The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: #14177 (comment) <#14177 (comment)> ) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#23242 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIkhOGh6RdADauT84FmvpWLRCOzYDks5unjxrgaJpZM4XxeJP> .

Dr-Irv · 2018-10-23T15:53:46Z

I think adding a dtype keyword to the CategoricalDtype constructor would
be fine, with a default of float for backwards compatibility.

I think the default would have to be infer, since if you pass no NaNs, then the dtype is inferred from the type of the passed categories. Then if all the values are NaN, it defaults to float.

sinhrks added Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type labels Oct 23, 2018

mroeschke added the Bug label Jun 28, 2020

mroeschke added Enhancement and removed Bug labels Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Values and Categoricals - inconsistent dtypes #23242

Missing Values and Categoricals - inconsistent dtypes #23242

Dr-Irv commented Oct 19, 2018

INSTALLED VERSIONS

TomAugspurger commented Oct 22, 2018

Dr-Irv commented Oct 22, 2018

TomAugspurger commented Oct 23, 2018 via email

Dr-Irv commented Oct 23, 2018

Missing Values and Categoricals - inconsistent dtypes #23242

Missing Values and Categoricals - inconsistent dtypes #23242

Comments

Dr-Irv commented Oct 19, 2018

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Oct 22, 2018

Dr-Irv commented Oct 22, 2018

TomAugspurger commented Oct 23, 2018 via email

Dr-Irv commented Oct 23, 2018

Output of `pd.show_versions()`