Skip to content

Commit 94c0d81

Browse files
jrebackTomAugspurger
authored andcommitted
Bug in pd.merge() when merge/join with multiple categorical columns (pandas-dev#16786)
closes pandas-dev#16767 (cherry picked from commit 5e776fb)
1 parent 853ca48 commit 94c0d81

File tree

3 files changed

+31
-6
lines changed

3 files changed

+31
-6
lines changed

doc/source/whatsnew/v0.20.3.txt

+3-2
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ Indexing
5555
I/O
5656
^^^
5757

58-
- Bug in :func:`read_csv`` in which files weren't opened as binary files by the C engine on Windows, causing EOF characters mid-field, which would fail (:issue:`16039`, :issue:`16559`, :issue:`16675`)
59-
- Bug in :func:`read_hdf`` in which reading a ``Series`` saved to an HDF file in 'fixed' format fails when an explicit ``mode='r'`` argument is supplied (:issue:`16583`)
58+
- Bug in :func:`read_csv` in which files weren't opened as binary files by the C engine on Windows, causing EOF characters mid-field, which would fail (:issue:`16039`, :issue:`16559`, :issue:`16675`)
59+
- Bug in :func:`read_hdf` in which reading a ``Series`` saved to an HDF file in 'fixed' format fails when an explicit ``mode='r'`` argument is supplied (:issue:`16583`)
6060

6161
Plotting
6262
^^^^^^^^
@@ -79,6 +79,7 @@ Reshaping
7979
^^^^^^^^^
8080

8181
- Bug in joining on a ``MultiIndex`` with a ``category`` dtype for a level (:issue:`16627`).
82+
- Bug in :func:`merge` when merging/joining with multiple categorical columns (:issue:`16767`)
8283

8384

8485
Numeric

pandas/core/reshape/merge.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -1384,13 +1384,14 @@ def _factorize_keys(lk, rk, sort=True):
13841384
lk = lk.values
13851385
rk = rk.values
13861386

1387-
# if we exactly match in categories, allow us to use codes
1387+
# if we exactly match in categories, allow us to factorize on codes
13881388
if (is_categorical_dtype(lk) and
13891389
is_categorical_dtype(rk) and
13901390
lk.is_dtype_equal(rk)):
1391-
return lk.codes, rk.codes, len(lk.categories)
1392-
1393-
if is_int_or_datetime_dtype(lk) and is_int_or_datetime_dtype(rk):
1391+
klass = libhashtable.Int64Factorizer
1392+
lk = _ensure_int64(lk.codes)
1393+
rk = _ensure_int64(rk.codes)
1394+
elif is_int_or_datetime_dtype(lk) and is_int_or_datetime_dtype(rk):
13941395
klass = libhashtable.Int64Factorizer
13951396
lk = _ensure_int64(com._values_from_object(lk))
13961397
rk = _ensure_int64(com._values_from_object(rk))

pandas/tests/reshape/test_merge.py

+23
Original file line numberDiff line numberDiff line change
@@ -1356,6 +1356,29 @@ def test_dtype_on_merged_different(self, change, how, left, right):
13561356
index=['X', 'Y', 'Z'])
13571357
assert_series_equal(result, expected)
13581358

1359+
def test_self_join_multiple_categories(self):
1360+
# GH 16767
1361+
# non-duplicates should work with multiple categories
1362+
m = 5
1363+
df = pd.DataFrame({
1364+
'a': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * m,
1365+
'b': ['t', 'w', 'x', 'y', 'z'] * 2 * m,
1366+
'c': [letter
1367+
for each in ['m', 'n', 'u', 'p', 'o']
1368+
for letter in [each] * 2 * m],
1369+
'd': [letter
1370+
for each in ['aa', 'bb', 'cc', 'dd', 'ee',
1371+
'ff', 'gg', 'hh', 'ii', 'jj']
1372+
for letter in [each] * m]})
1373+
1374+
# change them all to categorical variables
1375+
df = df.apply(lambda x: x.astype('category'))
1376+
1377+
# self-join should equal ourselves
1378+
result = pd.merge(df, df, on=list(df.columns))
1379+
1380+
assert_frame_equal(result, df)
1381+
13591382

13601383
@pytest.fixture
13611384
def left_df():

0 commit comments

Comments
 (0)