-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Cannot merge on equal categorical dtypes #22501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@kmax12 : That indeed looks buggy to me! Investigation and patch are welcome! One quick thing: mind posting what you get as output on the last merge? |
@gfyoung updated with the stack trace of the error on the last merge. I will investigate around and see if I can figure it out |
after some investigation, it noticed that pandas/pandas/core/indexes/base.py Line 3884 in bf67634
Based on this observation, I was able to create a more minimal example to reproduce the same error as above that uses just one set of categories.
and the error is
The last line in python is pandas/pandas/core/indexes/base.py Line 4131 in bf67634
sv is At this, point I'm not exactly sure what to do fix or how to workaround even. Any suggestions on how to proceed @gfyoung? |
This looks to have been fixed in the 0.24.0 release. Using the second more minimal example: In [1]: import pandas as pd
...: from pandas.api.types import CategoricalDtype
...: pd.__version__
Out[1]: '0.24.0'
In [2]: cat_type1 = CategoricalDtype(categories=["a", "b", "c"], ordered=False)
In [3]: df1 = pd.DataFrame({
...: 'foo': pd.Series(["a", "b"]).astype(cat_type1),
...: 'left': [1, 2],
...: }).set_index("foo")
In [4]: df2 = pd.DataFrame({
...: 'foo': pd.Series(["a", "b", "c"]).astype(cat_type1),
...: 'right': [3, 2, 1],
...: }).set_index("foo")
In [5]: df1.merge(df2, left_index=True, right_index=True)
Out[5]:
left right
foo
a 1 3
b 2 2 I've also confirmed that the original example produces the expected output on 0.24.0, and that this is all still working on master. I'm not sure which commit was responsible for the fix, but I don't immediately see any tests that recreate the issue. At this point I think we just need a test for this issue to ensure that there's not a regression. |
Code Sample, a copy-pastable example if possible
this is the stacktrace
Problem description
Merging on two equal unordered categorical dtypes fails depending on the order of records in the dataframe.
One work around I found to make it work is to case index to object and then back to categorical
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.23.4
pytest: 3.5.1
pip: 18.0
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: 5.8.0
sphinx: 1.6.4
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.5
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: