Skip to content

Missing Values and Categoricals - inconsistent dtypes #23242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Dr-Irv opened this issue Oct 19, 2018 · 4 comments
Open

Missing Values and Categoricals - inconsistent dtypes #23242

Dr-Irv opened this issue Oct 19, 2018 · 4 comments
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 19, 2018

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: from pandas.api.types import CategoricalDtype

In [4]: s1 = pd.Series([np.nan, np.nan]).astype('category')

In [5]: s1
Out[5]:
0   NaN
1   NaN
dtype: category
Categories (0, float64): []

In [6]: s2 = pd.Series([np.nan, np.nan]).astype(CategoricalDtype([]))

In [7]: s2
Out[7]:
0    NaN
1    NaN
dtype: category
Categories (0, object): []

In [8]: pd.api.types.union_categoricals([s1,s2])
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-8-8e364c994bd7> in <module>
----> 1 pd.api.types.union_categoricals([s1,s2])

C:\Anaconda3\lib\site-packages\pandas\core\dtypes\concat.py in union_categoricals(to_union, sort_categories, ignore_order)
    361     if not all(is_dtype_equal(other.categories.dtype, first.categories.dtype)
    362                for other in to_union[1:]):
--> 363         raise TypeError("dtype of categories must be the same")
    364
    365     ordered = False

TypeError: dtype of categories must be the same

Problem description

In the above, if you convert a Series using astype('category'), and the Series has all NaN values, the underlying dtype is float, while if you pass CategoricalDtype([]), the underlying dtype is object.

There are a couple of issues that I don't know how to deal with:

  1. If you have categories of a certain underlying dtype, there is no way to change that dtype (e.g., in this example, I would want to change the underlying dtype of the categories backing s1 to be object)
  2. You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

Now, you might ask, why does this matter? Let's suppose I have data that I know to be categorical, and I have missing values, and I want to use union_categoricals() to merge the categories of two different Series that are both category dtype, and each Series was constructed using astype('category'). Let's say that one had all missing values and has underlying dtype float and the second one had strings and missing values, so it ends up with dtype O, then I can't do union_categoricals() on them.

I know there are various workarounds for this, but I still think there should be some way to manage the underlying dtype of the categories of a CategoricalDtype.

Alternatively, maybe union_categoricals() should be smart that when you are doing a union of categories and one of the categories has no choices, then it ignores the dtype when doing the union.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 3.8.1
pip: 10.0.1
setuptools: 40.4.3
Cython: 0.28.5
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: 0.10.9
IPython: 7.0.1
sphinx: 1.8.1
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.1
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.11
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

You can't specify the dtype of the underlying categories in the CategoricalDtype constructor

You can with

In [24]: pd.api.types.CategoricalDtype(categories=pd.Index([], dtype=int)).categories
Out[24]: Int64Index([], dtype='int64')

CategoricalDtype.categories is just an index. Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

are both category dtype, and each Series was constructed using astype('category')

I think that's the root issue. .astype('category') is going to use inference, which can fail. If you want full control you'll have to be explicit.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 22, 2018

Would you want to accept a dtype parameter in CategoricalDtype that's passed through?

Yes, I think that would help.

The other thing that would help is if union_categoricals would accept the union of two categories where the dtype was different, and one of the categories was empty. Then the result could have the dtype of the category that had items in it. The reason I need this is that I'm reading a large file in chunks, and I know which columns are category columns, and want to keep doing union_categoricals as new categories are discovered, and if a chunk was all missing values, have the types correctly inferred. (See my comment here: #14177 (comment))

@sinhrks sinhrks added Dtype Conversions Unexpected or buggy dtype conversions Categorical Categorical Data Type labels Oct 23, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 23, 2018 via email

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Oct 23, 2018

I think adding a dtype keyword to the CategoricalDtype constructor would
be fine, with a default of float for backwards compatibility.

I think the default would have to be infer, since if you pass no NaNs, then the dtype is inferred from the type of the passed categories. Then if all the values are NaN, it defaults to float.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

4 participants