Merge error on Categorical Interval columns #28668

harisbal · 2019-09-28T20:28:30Z

Failure on merging on Categorical columns which include intervals.
For instance, the following raises TypeError: data type not understood

bins = np.arange(0, 91, 30)
df1 = pd.DataFrame(np.array([[1, 22], [2, 35], [3, 82]]),
                   columns=['Id', 'Dist']).set_index('Id')

df1['DistGroup'] = pd.cut(df1['Dist'], bins)

idx = pd.IntervalIndex.from_breaks(bins)
df2 = pd.DataFrame(np.array(['g1', 'g2', 'g3']), columns=['GroupId'], index=idx)
df2.index.name = 'DistGroup'

res = pd.merge(df1, df2, left_on='DistGroup', right_index=True).reset_index()

Expected Output

	Dist	DistGroup	GroupId
0	22	(0, 30]	g1
1	35	(30, 60]	g2
2	82	(60, 90]	g3

'

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

dsaxton · 2019-09-29T17:07:54Z

Here's another example that shows it's actually the order of the merge that triggers the error (apparently because you can cast from interval to categorical but not vice versa at the moment):

import numpy as np
import pandas as pd

intvl_arr = pd.Series([pd.Interval(0, 1), pd.Interval(1, 2)], dtype="interval")
cat_arr = intvl_arr.astype("category")

df1 = pd.DataFrame({"a": intvl_arr})
df2 = pd.DataFrame({"a": cat_arr})

pd.merge(df1, df2, how="inner", on="a")
# No error
pd.merge(df2, df1, how="inner", on="a")
# TypeError: data type not understood

I think a code path would need to be added for casting from categorical to an interval dtype in categorical.py around here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L502. @TomAugspurger Do you have any thoughts as to the best way to do this?

jschendel · 2019-09-30T02:17:14Z

This issue is a bit larger than Categorical[Interval] and actually applies to all Categorial[ExtensionDtype], e.g. the same error will occur for periods.

The offending line is the return statement of Categorical.astype where we use the numpy constructor, which doesn't understand pandas extension dtypes:

pandas/pandas/core/arrays/categorical.py

Line 526 in c4489cb

return np.array(self, dtype=dtype, copy=copy)

The workaround would be to add a check for extension dtypes and return a pd.array instead, which should be able to generically handle any pandas extension dtype. Something like:

from pandas.core.construction import array


if is_extension_array_dtype(dtype):
    return array(self, dtype=dtype, copy=copy)
return np.array(self, dtype=dtype, copy=copy)

jschendel added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 30, 2019

dsaxton mentioned this issue Oct 2, 2019

BUG: Allow cast from cat to extension dtype #28762

Merged

4 tasks

jreback added this to the 1.0 milestone Oct 5, 2019

jreback closed this as completed in #28762 Oct 8, 2019

MarcoGorelli mentioned this issue Oct 8, 2019

Can't convert Categorical to 'datetime64[ns, MET]', but can do it with Series #28448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge error on Categorical Interval columns #28668

Merge error on Categorical Interval columns #28668

harisbal commented Sep 28, 2019

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

dsaxton commented Sep 29, 2019

jschendel commented Sep 30, 2019 •

edited

Loading

Merge error on Categorical Interval columns #28668

Merge error on Categorical Interval columns #28668

Comments

harisbal commented Sep 28, 2019

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

dsaxton commented Sep 29, 2019

jschendel commented Sep 30, 2019 • edited Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

jschendel commented Sep 30, 2019 •

edited

Loading