Skip to content

Merge error on Categorical Interval columns #28668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
harisbal opened this issue Sep 28, 2019 · 2 comments · Fixed by #28762
Closed

Merge error on Categorical Interval columns #28668

harisbal opened this issue Sep 28, 2019 · 2 comments · Fixed by #28762
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@harisbal
Copy link
Contributor

Failure on merging on Categorical columns which include intervals.
For instance, the following raises TypeError: data type not understood

bins = np.arange(0, 91, 30)
df1 = pd.DataFrame(np.array([[1, 22], [2, 35], [3, 82]]),
                   columns=['Id', 'Dist']).set_index('Id')

df1['DistGroup'] = pd.cut(df1['Dist'], bins)

idx = pd.IntervalIndex.from_breaks(bins)
df2 = pd.DataFrame(np.array(['g1', 'g2', 'g3']), columns=['GroupId'], index=idx)
df2.index.name = 'DistGroup'

res = pd.merge(df1, df2, left_on='DistGroup', right_index=True).reset_index()

Expected Output

Dist DistGroup GroupId
0 22 (0, 30] g1
1 35 (30, 60] g2
2 82 (60, 90] g3
'

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.5
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@dsaxton
Copy link
Member

dsaxton commented Sep 29, 2019

Here's another example that shows it's actually the order of the merge that triggers the error (apparently because you can cast from interval to categorical but not vice versa at the moment):

import numpy as np
import pandas as pd

intvl_arr = pd.Series([pd.Interval(0, 1), pd.Interval(1, 2)], dtype="interval")
cat_arr = intvl_arr.astype("category")

df1 = pd.DataFrame({"a": intvl_arr})
df2 = pd.DataFrame({"a": cat_arr})

pd.merge(df1, df2, how="inner", on="a")
# No error
pd.merge(df2, df1, how="inner", on="a")
# TypeError: data type not understood

I think a code path would need to be added for casting from categorical to an interval dtype in categorical.py around here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/arrays/categorical.py#L502. @TomAugspurger Do you have any thoughts as to the best way to do this?

@jschendel jschendel added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 30, 2019
@jschendel
Copy link
Member

jschendel commented Sep 30, 2019

This issue is a bit larger than Categorical[Interval] and actually applies to all Categorial[ExtensionDtype], e.g. the same error will occur for periods.

The offending line is the return statement of Categorical.astype where we use the numpy constructor, which doesn't understand pandas extension dtypes:

return np.array(self, dtype=dtype, copy=copy)

The workaround would be to add a check for extension dtypes and return a pd.array instead, which should be able to generically handle any pandas extension dtype. Something like:

from pandas.core.construction import array


if is_extension_array_dtype(dtype):
    return array(self, dtype=dtype, copy=copy)
return np.array(self, dtype=dtype, copy=copy)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants