Skip to content

BUG: merge_asof raises when grouping on multiple columns with a categorical #16454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adbull opened this issue May 23, 2017 · 4 comments · Fixed by #30653
Closed

BUG: merge_asof raises when grouping on multiple columns with a categorical #16454

adbull opened this issue May 23, 2017 · 4 comments · Fixed by #30653
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@adbull
Copy link
Contributor

adbull commented May 23, 2017

Code Sample, a copy-pastable example if possible

>>> import pandas as pd
>>> x = pd.DataFrame(dict(x=[0],y=[0],z=pd.Categorical([0])))
>>> pd.merge_asof(x, x, on='x', by=['y', 'z'])

Traceback (most recent call last):
  File "bug.py", line 10, in <module>
    pd.merge_asof(x, x, on='x', by=['y', 'z'])
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 486, in merge_asof
    return op.get_result()
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1019, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 734, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1269, in _get_join_indexers
    left_by_values = flip(left_by_values)
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1231, in flip
    return np.array(lzip(*xs), labeled_dtypes)
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 62, in __repr__
    return str(self)
  File "~/anaconda3/envs/pantheon/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 41, in __str__
    return self.__unicode__()
SystemError: PyEval_EvalFrameEx returned a result with an error set

Problem description

merge_asof takes a by argument which defines the groups to merge between. When by is any single column, or multiple non-categorical columns, the merge succeeds. When by includes multiple columns, at least one of which is categorical, an error is raised.

Expected Output

   x  y  z
0  0  0  0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.10-100.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.0
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

cc @chrisaycock should this be supported? If not we can raise a better error message.

@TomAugspurger TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 23, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone May 23, 2017
@jreback
Copy link
Contributor

jreback commented May 23, 2017

this could work, but is not implemented atm because of how the multiple grouping is done.

@chrish42
Copy link
Contributor

I also ran into this:

import pandas as pd


def convert_to_cat(df):
    return df.assign(cat1=pd.Categorical(df.cat1),
                     cat2=pd.Categorical(df.cat2))


left = pd.DataFrame({'time': [1, 2, 3, 6, 7],
                    'cat1': ['a', 'a', 'b', 'b', 'b'],
                    'cat2': ['x', 'y', 'x', 'y', 'x'],
                    'left': [0, 1, 2, 3, 4]})
right = pd.DataFrame({'time': [1, 5, 10],
                      'cat1': ['a', 'b', 'b'],
                      'cat2': ['x', 'y', 'x'],
                      'right': [0, 1, 2]})

left_cat = convert_to_cat(left)
right_cat = convert_to_cat(right)


# This works: multiple by= columns, with object dtype.
result = pd.merge_asof(left, right, on='time', by=['cat1', 'cat2'])

# This also works: one by= column, with category dtype.
result_1cat = pd.merge_asof(left_cat, right_cat, on='time', by='cat1')

# This raises SystemError: multiple by= columns, with category dtype.
result_2cats = pd.merge_asof(left_cat, right_cat, on='time', by=['cat1', 'cat2'])

Here is the backtrace produced by the last line, if that helps:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: data type not understood

The above exception was the direct cause of the following exception:

SystemError                               Traceback (most recent call last)
<ipython-input-29-0d159f08de94> in <module>
     26 
     27 # This
---> 28 result_2cats = pd.merge_asof(left_cat, right_cat, on='time', by=['cat1', 'cat2'])

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
    460                     allow_exact_matches=allow_exact_matches,
    461                     direction=direction)
--> 462     return op.get_result()
    463 
    464 

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/reshape/merge.py in get_result(self)
   1254 
   1255     def get_result(self):
-> 1256         join_index, left_indexer, right_indexer = self._get_join_info()
   1257 
   1258         # this is a bit kludgy

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _get_join_info(self)
    754         else:
    755             (left_indexer,
--> 756              right_indexer) = self._get_join_indexers()
    757 
    758             if self.right_index:

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/reshape/merge.py in _get_join_indexers(self)
   1502                 right_by_values = right_by_values[0]
   1503             else:
-> 1504                 left_by_values = flip(left_by_values)
   1505                 right_by_values = flip(right_by_values)
   1506 

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/reshape/merge.py in flip(xs)
   1455             dtypes = [x.dtype for x in xs]
   1456             labeled_dtypes = list(zip(labels, dtypes))
-> 1457             return np.array(lzip(*xs), labeled_dtypes)
   1458 
   1459         # values to compare

~/.local/share/virtualenvs/pandas-bugs-QyHl3rh2/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py in __repr__(self)
    393     def __repr__(self):
    394         tpl = u'CategoricalDtype(categories={}ordered={})'
--> 395         if self.categories is None:
    396             data = u"None, "
    397         else:

SystemError: PyEval_EvalFrameEx returned a result with an error set

This is on Pandas 0.24.2.

@mroeschke
Copy link
Member

This looks to work on master. Could use a regression test.

In [162]: >>> import pandas as pd
     ...: >>> x = pd.DataFrame(dict(x=[0],y=[0],z=pd.Categorical([0])))
     ...: >>> pd.merge_asof(x, x, on='x', by=['y', 'z'])
Out[162]:
   x  y  z
0  0  0  0

In [163]: pd.__version__
Out[163]: '0.26.0.dev0+555.gf7d162b18'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants