Skip to content

BUG: fix in categorical merges #32079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Feb 27, 2020
Merged

BUG: fix in categorical merges #32079

merged 13 commits into from
Feb 27, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Feb 18, 2020

test adapted (at this point kind of loosely) from #28296

@MarcoGorelli MarcoGorelli changed the title BUG: fix in categorical merges Wip BUG: fix in categorical merges Feb 18, 2020
@MarcoGorelli MarcoGorelli changed the title Wip BUG: fix in categorical merges BUG: fix in categorical merges Feb 19, 2020
@MarcoGorelli MarcoGorelli requested a review from WillAyd February 19, 2020 16:17
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OK generally but @TomAugspurger for any more thoughts.

Does this affect other index types like DTI as well?

@@ -781,6 +782,10 @@ def _delegate_method(self, name: str, *args, **kwargs):
return res
return CategoricalIndex(res, name=self.name)

def _wrap_joined_index(self, joined, other):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate this? Should do that for any new development

@WillAyd WillAyd added Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 20, 2020
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Feb 20, 2020

Does this affect other index types like DTI as well?

It doesn't affect DTI, as DTI inherits from DateTimedeltaMixin:

class DatetimeIndex(DatetimeTimedeltaMixin):

and DateTimedeltaMixin already has its own _wrap_joined_index method defined

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex @jbrockmendel question

@MarcoGorelli
Copy link
Member Author

@jbrockmendel have added a test which covers the int16_t case:

In [1]: from pandas import DataFrame, CategoricalIndex                                                                                                                                                                                                        

In [2]: n_categories = 5                                                                                                                                                                                                                                      

In [3]: import numpy as np                                                                                                                                                                                                                                    

In [4]:     left_index = CategoricalIndex([0] + list(range(n_categories))) 
   ...:     df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index) 
   ...:     df2 = DataFrame( 
   ...:         [[6]], 
   ...:         columns=["value"], 
   ...:         index=CategoricalIndex([0], categories=np.arange(n_categories)), 
   ...:     )                                                                                                                                                                                                                                                 

In [5]: df1.index._ndarray_values                                                                                                                                                                                                                             
Out[5]: array([0, 0, 1, 2, 3, 4], dtype=int8)

In [6]: n_categories = 128                                                                                                                                                                                                                                    

In [7]:     left_index = CategoricalIndex([0] + list(range(n_categories))) 
   ...:     df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index) 
   ...:     df2 = DataFrame( 
   ...:         [[6]], 
   ...:         columns=["value"], 
   ...:         index=CategoricalIndex([0], categories=np.arange(n_categories)), 
   ...:     )                                                                                                                                                                                                                                                 

In [8]: df1.index._ndarray_values                                                                                                                                                                                                                             
Out[8]: 
array([  0,   0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,
        12,  13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,
        25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,
        38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,
        51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,
        64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,
        77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,
        90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102,
       103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115,
       116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127],
      dtype=int16)

The current test fails on master (for both cases, we get No matching signature found) but passes here:

(pandas-dev) m.gorelli@ws-1808:~/pandas$ git fetch upstream master
From https://github.com/pandas-dev/pandas
 * branch                master     -> FETCH_HEAD
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git diff upstream/master --name-only
doc/source/whatsnew/v1.1.0.rst
pandas/_libs/join.pyx
pandas/core/indexes/category.py
pandas/tests/reshape/merge/test_merge.py
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout upstream/master -- pandas/core/indexes/category.py pandas/_libs/join.pyx
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 8
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
running build_ext
building 'pandas._libs.join' extension
gcc -pthread -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -Ipandas/_libs/src/klib -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/include -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/include/python3.7m -c pandas/_libs/join.c -o build/temp.linux-x86_64-3.7/pandas/_libs/join.o
gcc -pthread -shared -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -L/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,-rpath=/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/pandas/_libs/join.o -o /home/SERILOCAL/m.gorelli/pandas/pandas/_libs/join.cpython-37m-x86_64-linux-gnu.so
(pandas-dev) m.gorelli@ws-1808:~/pandas$ pytest pandas/tests/reshape/merge/test_merge.py::test_categorical_non_unique_monotonic
==================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 2 items                                                                                                                                                                                                                                            

pandas/tests/reshape/merge/test_merge.py FF                                                                                                                                                                                                            [100%]

========================================================================================================================== FAILURES ==========================================================================================================================
__________________________________________________________________________________________________________ test_categorical_non_unique_monotonic[5] __________________________________________________________________________________________________________

n_categories = 5

    @pytest.mark.parametrize("n_categories", [5, 128])
    def test_categorical_non_unique_monotonic(n_categories):
        # GH 28189
        left_index = CategoricalIndex([0] + list(range(n_categories)))
        df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index)
        df2 = DataFrame(
            [[6]],
            columns=["value"],
            index=CategoricalIndex([0], categories=np.arange(n_categories)),
        )
    
>       result = merge(df1, df2, how="left", left_index=True, right_index=True)

pandas/tests/reshape/merge/test_merge.py:2179: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/reshape/merge.py:88: in merge
    return op.get_result()
pandas/core/reshape/merge.py:641: in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
pandas/core/reshape/merge.py:848: in _get_join_info
    right_ax, how=self.how, return_indexers=True, sort=self.sort
pandas/core/indexes/base.py:3507: in join
    other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3816: in _join_monotonic
    join_index, lidx, ridx = self._left_indexer(sv, ov)
pandas/core/indexes/base.py:244: in _left_indexer
    return libjoin.left_join_indexer(left, right)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   def left_join_indexer(ndarray[join_t] left, ndarray[join_t] right):
E   TypeError: No matching signature found

pandas/_libs/join.pyx:301: TypeError
_________________________________________________________________________________________________________ test_categorical_non_unique_monotonic[128] _________________________________________________________________________________________________________

n_categories = 128

    @pytest.mark.parametrize("n_categories", [5, 128])
    def test_categorical_non_unique_monotonic(n_categories):
        # GH 28189
        left_index = CategoricalIndex([0] + list(range(n_categories)))
        df1 = DataFrame(range(n_categories + 1), columns=["value"], index=left_index)
        df2 = DataFrame(
            [[6]],
            columns=["value"],
            index=CategoricalIndex([0], categories=np.arange(n_categories)),
        )
    
>       result = merge(df1, df2, how="left", left_index=True, right_index=True)

pandas/tests/reshape/merge/test_merge.py:2179: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/core/reshape/merge.py:88: in merge
    return op.get_result()
pandas/core/reshape/merge.py:641: in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
pandas/core/reshape/merge.py:848: in _get_join_info
    right_ax, how=self.how, return_indexers=True, sort=self.sort
pandas/core/indexes/base.py:3507: in join
    other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3816: in _join_monotonic
    join_index, lidx, ridx = self._left_indexer(sv, ov)
pandas/core/indexes/base.py:244: in _left_indexer
    return libjoin.left_join_indexer(left, right)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   def left_join_indexer(ndarray[join_t] left, ndarray[join_t] right):
E   TypeError: No matching signature found

pandas/_libs/join.pyx:301: TypeError
===================================================================================================================== 2 failed in 1.68s ======================================================================================================================
(pandas-dev) m.gorelli@ws-1808:~/pandas$ git checkout HEAD -- pandas/core/indexes/category.py pandas/_libs/join.pyx
(pandas-dev) m.gorelli@ws-1808:~/pandas$ python setup.py build_ext --inplace -j 8
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx
running build_ext
building 'pandas._libs.join' extension
gcc -pthread -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DNPY_NO_DEPRECATED_API=0 -Ipandas/_libs/src/klib -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/numpy/core/include -I/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/include/python3.7m -c pandas/_libs/join.c -o build/temp.linux-x86_64-3.7/pandas/_libs/join.o
gcc -pthread -shared -B /home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/compiler_compat -L/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,-rpath=/home/SERILOCAL/m.gorelli/miniconda3/envs/pandas-dev/lib -Wl,--no-as-needed -Wl,--sysroot=/ build/temp.linux-x86_64-3.7/pandas/_libs/join.o -o /home/SERILOCAL/m.gorelli/pandas/pandas/_libs/join.cpython-37m-x86_64-linux-gnu.so
(pandas-dev) m.gorelli@ws-1808:~/pandas$ pytest pandas/tests/reshape/merge/test_merge.py::test_categorical_non_unique_monotonic
==================================================================================================================== test session starts =====================================================================================================================
platform linux -- Python 3.7.6, pytest-5.3.4, py-1.8.0, pluggy-0.13.0
rootdir: /home/SERILOCAL/m.gorelli/pandas, inifile: setup.cfg
plugins: forked-1.1.2, hypothesis-5.3.0, asyncio-0.10.0, cov-2.8.1, xdist-1.31.0
collected 2 items                                                                                                                                                                                                                                            

pandas/tests/reshape/merge/test_merge.py ..                                                                                                                                                                                                            [100%]

===================================================================================================================== 2 passed in 0.31s ======================================================================================================================
(pandas-dev) m.gorelli@ws-1808:~/pandas$ 

@jbrockmendel
Copy link
Member

comment requested, otherwise LGTM

@WillAyd WillAyd added this to the 1.1 milestone Feb 27, 2020
@WillAyd WillAyd merged commit ea1d8fa into pandas-dev:master Feb 27, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2020

Thanks @MarcoGorelli !

roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020
@MarcoGorelli MarcoGorelli deleted the 28189 branch October 10, 2020 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge on CategoricalIndex fails if left_index=True & right_index=True, but not if on={index}
3 participants