Skip to content

BUG: Index.union() inconsistent with non-unique Indexes #36299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Mar 4, 2021
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
b8f27e2
Fix issues in Index.union with duplicate index values
phofl Sep 11, 2020
070abad
Run black pandas
phofl Sep 11, 2020
7a90556
Parametrize tests
phofl Sep 14, 2020
dbcd0a7
Fix small bug and Pep8 issues
phofl Sep 14, 2020
7a9aec7
Change import order
phofl Sep 14, 2020
01d55c6
Fix black bug
phofl Sep 14, 2020
8463078
Implement reviews
phofl Sep 15, 2020
cddd7f9
Move code away from try except
phofl Sep 15, 2020
3c6b079
Catch r_reindexer None
phofl Sep 15, 2020
90056b7
Resort imports...
phofl Sep 15, 2020
46b7c6c
Simplify resorting
phofl Oct 3, 2020
9f55905
Rename variable
phofl Oct 3, 2020
870c0ac
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 3, 2020
97fe91f
Rename variable
phofl Oct 3, 2020
e724ade
Move resorting to algos
phofl Oct 25, 2020
920f9ff
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 25, 2020
c36ccff
Add test and improve doc
phofl Oct 25, 2020
396c70f
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 25, 2020
1062778
Fix pattern issues
phofl Oct 25, 2020
cf06418
Add check
phofl Oct 25, 2020
6f04408
Rename function
phofl Nov 22, 2020
167d695
Adjust whatsnew
phofl Nov 22, 2020
0b3b548
Change gh reference
phofl Nov 22, 2020
73d9ab3
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Nov 22, 2020
561f835
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Nov 29, 2020
59dbdf6
Remove pd
phofl Nov 29, 2020
51131be
Adress review comments
phofl Nov 30, 2020
04817b4
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Dec 21, 2020
a1887c7
Move whatsnew
phofl Dec 21, 2020
fe41a6f
Fix gh reference
phofl Dec 21, 2020
b63a732
Remove comment and fix test
phofl Dec 22, 2020
fa49dfe
Add test for concat issue
phofl Dec 22, 2020
2a6f6ed
Add whatsnew
phofl Dec 22, 2020
d80949c
Remove concat test
phofl Jan 4, 2021
48e041a
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 4, 2021
b57f00f
Add dropna
phofl Jan 5, 2021
aa4533a
Fix join func
phofl Jan 5, 2021
8fd90a3
Fix bug
phofl Jan 5, 2021
091942a
Fix bug
phofl Jan 5, 2021
446eb50
Refactor code and add tests
phofl Jan 6, 2021
6b8fa64
Run Black
phofl Jan 6, 2021
484f4f8
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 6, 2021
5209bf0
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 6, 2021
6cabad8
Adress review
phofl Jan 9, 2021
b80fbdd
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 9, 2021
60bceec
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 10, 2021
2547f65
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 22, 2021
25885d4
Merge master and adjust condition
phofl Jan 22, 2021
fcc4635
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Feb 12, 2021
ccaa0c1
Remove unused import
phofl Feb 13, 2021
f4ee466
Reformat code
phofl Feb 21, 2021
20e62e6
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Feb 21, 2021
e92ab7a
Add comments and refactor code
phofl Feb 22, 2021
7ffa07a
Adress review
phofl Mar 2, 2021
76ded89
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Mar 2, 2021
0af939d
Fix merge introduced missing object
phofl Mar 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,6 +272,7 @@ Indexing

- Bug in :meth:`PeriodIndex.get_loc` incorrectly raising ``ValueError`` on non-datelike strings instead of ``KeyError``, causing similar errors in :meth:`Series.__geitem__`, :meth:`Series.__contains__`, and :meth:`Series.loc.__getitem__` (:issue:`34240`)
- Bug in :meth:`Index.sort_values` where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order. (:issue:`35584`)
- Bug in :meth:`Index.union` dropped duplicate Index values when Index was not monotonic or ``sort`` was set to ``False`` (:issue:`36289`, :issue:`31326`)
-

Missing
Expand Down
65 changes: 33 additions & 32 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2622,41 +2622,42 @@ def _union(self, other, sort):
lvals = self._values
rvals = other._values

if sort is None and self.is_monotonic and other.is_monotonic:
try:
# if sort is None and self.is_monotonic and other.is_monotonic:
try:
if sort is None and self.is_monotonic and other.is_monotonic:
result = self._outer_indexer(lvals, rvals)[0]
except TypeError:
# incomparable objects
result = list(lvals)

# worth making this faster? a very unusual case
value_set = set(lvals)
result.extend([x for x in rvals if x not in value_set])
result = Index(result)._values # do type inference here
else:
# find indexes of things in "other" that are not in "self"
if self.is_unique:
indexer = self.get_indexer(other)
indexer = (indexer == -1).nonzero()[0]
else:
indexer = algos.unique1d(self.get_indexer_non_unique(other)[1])

if len(indexer) > 0:
other_diff = algos.take_nd(rvals, indexer, allow_fill=False)
result = concat_compat((lvals, other_diff))

else:
result = lvals

if sort is None:
try:
result = algos.safe_sort(result)
except TypeError as err:
warnings.warn(
f"{err}, sort order is undefined for incomparable objects",
RuntimeWarning,
stacklevel=3,
# We calculate the sorted union and resort it afterwards
new_index_sorted = self._outer_indexer(np.sort(lvals), np.sort(rvals))
l_reindexer = self.unique().reindex(new_index_sorted[0])[1]
if l_reindexer is None:
result = lvals
else:
l_result = self.unique().take(
np.sort(l_reindexer[l_reindexer != -1])
)
r_reindexer = other.unique().reindex(new_index_sorted[0])[1]
r_result = other.unique().take(
np.sort(r_reindexer[new_index_sorted[1] == -1])
)
result = np.append(l_result, r_result)
except TypeError:
# incomparable objects
result = list(lvals)

# worth making this faster? a very unusual case
value_set = set(lvals)
result.extend([x for x in rvals if x not in value_set])
result = Index(result)._values # do type inference here
if sort is None and (not self.is_monotonic or not other.is_monotonic):
try:
result = algos.safe_sort(result)
except TypeError as err:
warnings.warn(
f"{err}, sort order is undefined for incomparable objects",
RuntimeWarning,
stacklevel=3,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential separate PR: i think there are 4 places in this file where we do this safe_sort call, and we are not consistent in a) issuing this warning or b) catching the TypeError. Could be refactored into a small helper method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into this in the coming days


# for subclasses
return self._wrap_setop_result(other, result)
Expand Down
45 changes: 44 additions & 1 deletion pandas/tests/indexes/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,15 @@
from pandas.core.dtypes.common import is_dtype_equal

import pandas as pd
from pandas import Float64Index, Int64Index, RangeIndex, UInt64Index
from pandas import (
CategoricalIndex,
DatetimeIndex,
Float64Index,
Int64Index,
RangeIndex,
TimedeltaIndex,
UInt64Index,
)
import pandas._testing as tm
from pandas.api.types import pandas_dtype

Expand Down Expand Up @@ -95,3 +103,38 @@ def test_union_dtypes(left, right, expected):
b = pd.Index([], dtype=right)
result = (a | b).dtype
assert result == expected


def test_union_duplicate_index_subsets_of_each_other():
# GH: 31326
a = pd.Index([1, 2, 2, 3])
b = pd.Index([3, 3, 4])
expected = pd.Index([1, 2, 2, 3, 3, 4])

result = a.union(b)
tm.assert_index_equal(result, expected)
result = a.union(b, sort=False)
tm.assert_index_equal(result, expected)


@pytest.mark.parametrize(
"func",
[
(Int64Index),
(Float64Index),
(DatetimeIndex),
(CategoricalIndex),
(TimedeltaIndex),
],
)
def test_union_with_duplicate_index(func):
# GH: 36289
idx1 = func([1, 0, 0])
idx2 = func([0, 1])
expected = func([0, 0, 1])

result = idx1.union(idx2)
tm.assert_index_equal(result, expected)

result = idx2.union(idx1)
tm.assert_index_equal(result, expected)