Skip to content

BUG: merging an Integer EA rasises #23262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Dec 19, 2018
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1041,6 +1041,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- :meth:`Series.unstack` and :meth:`DataFrame.unstack` no longer convert extension arrays to object-dtype ndarrays. Each column in the output ``DataFrame`` will now have the same dtype as the input (:issue:`23077`).
- Bug when grouping :meth:`Dataframe.groupby()` and aggregating on ``ExtensionArray`` it was not returning the actual ``ExtensionArray`` dtype (:issue:`23227`).
- A default repr for :class:`ExtensionArray` is now provided (:issue:`23601`).
- Bug in :func:`pandas.merge` when merging on an Integer extension array (:issue:`23020`)

.. _whatsnew_0240.api.incompatibilities:

Expand Down
13 changes: 13 additions & 0 deletions pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -814,6 +814,19 @@ def _check_timedeltalike_freq_compat(self, other):
def _values_for_argsort(self):
return self._data

def _values_for_factorize(self):
return np.asarray(self), np.nan

def factorize(self, na_sentinel=-1):
from pandas.core.algorithms import factorize

arr, na_value = self._values_for_factorize()

labels, uniques = factorize(arr, na_sentinel=na_sentinel)

uniques = period_array(uniques)
return labels, uniques


PeriodArray._add_comparison_ops()

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/arrays/sparse.py
Original file line number Diff line number Diff line change
Expand Up @@ -942,7 +942,7 @@ def unique(self):

def _values_for_factorize(self):
# Still override this for hash_pandas_object
return np.asarray(self), self.fill_value
return np.asarray(self, object), self.fill_value

def factorize(self, na_sentinel=-1):
# Currently, ExtensionArray.factorize -> Tuple[ndarray, EA]
Expand Down
13 changes: 10 additions & 3 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@
from pandas.core.dtypes.common import (
ensure_float64, ensure_int64, ensure_object, is_array_like, is_bool,
is_bool_dtype, is_categorical_dtype, is_datetime64_dtype,
is_datetime64tz_dtype, is_datetimelike, is_dtype_equal, is_float_dtype,
is_int64_dtype, is_integer, is_integer_dtype, is_list_like, is_number,
is_numeric_dtype, needs_i8_conversion)
is_datetime64tz_dtype, is_datetimelike, is_dtype_equal,
is_extension_array_dtype, is_float_dtype, is_int64_dtype, is_integer,
is_integer_dtype, is_list_like, is_number, is_numeric_dtype,
needs_i8_conversion)
from pandas.core.dtypes.missing import isnull, na_value_for_dtype

from pandas import Categorical, DataFrame, Index, MultiIndex, Series, Timedelta
Expand Down Expand Up @@ -1607,6 +1608,12 @@ def _factorize_keys(lk, rk, sort=True):

lk = ensure_int64(lk.codes)
rk = ensure_int64(rk)
elif (is_extension_array_dtype(lk) and
is_extension_array_dtype(rk) and
lk.dtype == rk.dtype):
klass = libhashtable.Factorizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually more complicated here

elif is_extension_array_dtype(lk) and is_extension_array_dtype(rk) and lk.dtype == rk.dtype:
     klass = ...
     lk, _ = lk._values_for_factorize()
     rk, _ = rk._values_for_factorize()

and actually you can do this before any of the other if statements. That way if the _values_for_factorize returns an int64 array then it will just work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc did you try @jreback's suggesting of factorizing the EAs before getting into the main if block?

lk, _ = lk._values_for_factorize()
rk, _ = rk._values_for_factorize()
elif is_integer_dtype(lk) and is_integer_dtype(rk):
# GH#23917 TODO: needs tests for case where lk is integer-dtype
# and rk is datetime-dtype
Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/extension/base/reshaping.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,21 @@ def test_merge(self, data, na_value):
dtype=data.dtype)})
self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])

def test_merge_on_extension_array(self, data):
# GH 23020
df1 = pd.DataFrame({'ext': [1, 2, 3],
'key': data[:3]})

res = pd.merge(df1, df1, on='key')

exp = pd.DataFrame(
{'key': data[:3],
'ext_x': [1, 2, 3],
'ext_y': [1, 2, 3]})

self.assert_frame_equal(res, exp[['ext_x', 'key', 'ext_y']],
check_dtype=True)

@pytest.mark.parametrize("columns", [
["A", "B"],
pd.MultiIndex.from_tuples([('A', 'a'), ('A', 'b')],
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -1326,6 +1326,16 @@ def test_merging_with_bool_or_int_cateorical_column(self, category_column,
CDT(categories, ordered=ordered))
assert_frame_equal(expected, result)

def test_merge_on_int_array(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to pandas/tests/extension/base/reshape (and use the fixtures)

# GH 23020
df = pd.DataFrame({'A': pd.Series([1, 2, np.nan], dtype='Int64'),
'B': 1})
result = pd.merge(df, df, on='A')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this does test merging on an NA key, which the generic test doesn't. So maybe leave it?

expected = pd.DataFrame({'A': pd.Series([1, 2, np.nan], dtype='Int64'),
'B_x': 1,
'B_y': 1})
assert_frame_equal(result, expected, check_dtype=True)


@pytest.fixture
def left_df():
Expand Down