Skip to content

BUG: Index.union() inconsistent with non-unique Indexes #36299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Mar 4, 2021
Merged
Show file tree
Hide file tree
Changes from 52 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
b8f27e2
Fix issues in Index.union with duplicate index values
phofl Sep 11, 2020
070abad
Run black pandas
phofl Sep 11, 2020
7a90556
Parametrize tests
phofl Sep 14, 2020
dbcd0a7
Fix small bug and Pep8 issues
phofl Sep 14, 2020
7a9aec7
Change import order
phofl Sep 14, 2020
01d55c6
Fix black bug
phofl Sep 14, 2020
8463078
Implement reviews
phofl Sep 15, 2020
cddd7f9
Move code away from try except
phofl Sep 15, 2020
3c6b079
Catch r_reindexer None
phofl Sep 15, 2020
90056b7
Resort imports...
phofl Sep 15, 2020
46b7c6c
Simplify resorting
phofl Oct 3, 2020
9f55905
Rename variable
phofl Oct 3, 2020
870c0ac
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 3, 2020
97fe91f
Rename variable
phofl Oct 3, 2020
e724ade
Move resorting to algos
phofl Oct 25, 2020
920f9ff
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 25, 2020
c36ccff
Add test and improve doc
phofl Oct 25, 2020
396c70f
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Oct 25, 2020
1062778
Fix pattern issues
phofl Oct 25, 2020
cf06418
Add check
phofl Oct 25, 2020
6f04408
Rename function
phofl Nov 22, 2020
167d695
Adjust whatsnew
phofl Nov 22, 2020
0b3b548
Change gh reference
phofl Nov 22, 2020
73d9ab3
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Nov 22, 2020
561f835
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Nov 29, 2020
59dbdf6
Remove pd
phofl Nov 29, 2020
51131be
Adress review comments
phofl Nov 30, 2020
04817b4
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Dec 21, 2020
a1887c7
Move whatsnew
phofl Dec 21, 2020
fe41a6f
Fix gh reference
phofl Dec 21, 2020
b63a732
Remove comment and fix test
phofl Dec 22, 2020
fa49dfe
Add test for concat issue
phofl Dec 22, 2020
2a6f6ed
Add whatsnew
phofl Dec 22, 2020
d80949c
Remove concat test
phofl Jan 4, 2021
48e041a
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 4, 2021
b57f00f
Add dropna
phofl Jan 5, 2021
aa4533a
Fix join func
phofl Jan 5, 2021
8fd90a3
Fix bug
phofl Jan 5, 2021
091942a
Fix bug
phofl Jan 5, 2021
446eb50
Refactor code and add tests
phofl Jan 6, 2021
6b8fa64
Run Black
phofl Jan 6, 2021
484f4f8
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 6, 2021
5209bf0
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 6, 2021
6cabad8
Adress review
phofl Jan 9, 2021
b80fbdd
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 9, 2021
60bceec
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 10, 2021
2547f65
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Jan 22, 2021
25885d4
Merge master and adjust condition
phofl Jan 22, 2021
fcc4635
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Feb 12, 2021
ccaa0c1
Remove unused import
phofl Feb 13, 2021
f4ee466
Reformat code
phofl Feb 21, 2021
20e62e6
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Feb 21, 2021
e92ab7a
Add comments and refactor code
phofl Feb 22, 2021
7ffa07a
Adress review
phofl Mar 2, 2021
76ded89
Merge branch 'master' of https://github.com/pandas-dev/pandas into 36289
phofl Mar 2, 2021
0af939d
Fix merge introduced missing object
phofl Mar 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -345,6 +345,8 @@ Interval

Indexing
^^^^^^^^

- Bug in :meth:`Index.union` dropping duplicate ``Index`` values when ``Index`` was not monotonic or ``sort`` was set to ``False`` (:issue:`36289`, :issue:`31326`)
- Bug in :meth:`CategoricalIndex.get_indexer` failing to raise ``InvalidIndexError`` when non-unique (:issue:`38372`)
- Bug in inserting many new columns into a :class:`DataFrame` causing incorrect subsequent indexing behavior (:issue:`38380`)
- Bug in :meth:`DataFrame.__setitem__` raising ``ValueError`` when setting multiple values to duplicate columns (:issue:`15695`)
Expand Down
26 changes: 26 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -2277,3 +2277,29 @@ def _sort_tuples(values: np.ndarray):
arrays, _ = to_arrays(values, None)
indexer = lexsort_indexer(arrays, orders=True)
return values[indexer]


def union_with_duplicates(lvals, rvals) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type, these are ndarrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, typed

"""
Extracts the union from lvals and rvals with respect to duplicates and nans in
both arrays.

Parameters
----------
lvals: np.ndarray
left values which is ordered in front.
rvals: np.ndarray
right values ordered after lvals.

Returns
-------
np.ndarray containing the unsorted union of both arrays
"""
indexer = []
l_count = value_counts(lvals, dropna=False)
r_count = value_counts(rvals, dropna=False)
l_count, r_count = l_count._align_series(r_count, fill_value=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call .align directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, thx

unique_array = unique(np.append(lvals, rvals))
for i, value in enumerate(unique_array):
indexer += [i] * int(max(l_count[value], r_count[value]))
return unique_array.take(indexer)
40 changes: 25 additions & 15 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2935,32 +2935,42 @@ def _union(self, other, sort):
lvals = self._values
rvals = other._values

if sort is None and self.is_monotonic and other.is_monotonic:
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on when these branches are taken (similar to what you did below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sort is None
and self.is_monotonic
and other.is_monotonic
and not (self.has_duplicates and other.has_duplicates)
):
try:
result = self._outer_indexer(lvals, rvals)[0]
return self._outer_indexer(lvals, rvals)[0]
except (TypeError, IncompatibleFrequency):
# incomparable objects
result = list(lvals)

# worth making this faster? a very unusual case
value_set = set(lvals)
result.extend([x for x in rvals if x not in value_set])
result = Index(result)._values # do type inference here
else:
# find indexes of things in "other" that are not in "self"
if self.is_unique:
indexer = self.get_indexer(other)
missing = (indexer == -1).nonzero()[0]
else:
missing = algos.unique1d(self.get_indexer_non_unique(other)[1])
return Index(result)._values # do type inference here

if len(missing) > 0:
other_diff = algos.take_nd(rvals, missing, allow_fill=False)
result = concat_compat((lvals, other_diff))
elif not other.is_unique and not self.is_unique:
result = algos.union_with_duplicates(lvals, rvals)
return _maybe_try_sort(result, sort)

else:
result = lvals
# Either other or self is not unique
# find indexes of things in "other" that are not in "self"
if self.is_unique:
indexer = self.get_indexer(other)
missing = (indexer == -1).nonzero()[0]
else:
missing = algos.unique1d(self.get_indexer_non_unique(other)[1])

if len(missing) > 0:
other_diff = algos.take_nd(rvals, missing, allow_fill=False)
result = concat_compat((lvals, other_diff))
else:
result = lvals

if not self.is_monotonic or not other.is_monotonic:
result = _maybe_try_sort(result, sort)

return result
Expand Down
75 changes: 75 additions & 0 deletions pandas/tests/indexes/test_setops.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,6 +501,81 @@ def check_intersection_commutative(left, right):
assert idx.intersection(idx_non_unique).is_unique


@pytest.mark.parametrize(
"cls",
[
Int64Index,
Float64Index,
DatetimeIndex,
CategoricalIndex,
TimedeltaIndex,
lambda x: Index(x, dtype=object),
],
)
def test_union_duplicate_index_subsets_of_each_other(cls):
# GH#31326
a = cls([1, 2, 2, 3])
b = cls([3, 3, 4])
expected = cls([1, 2, 2, 3, 3, 4])
if cls is CategoricalIndex:
expected = Index([1, 2, 2, 3, 3, 4])
result = a.union(b)
tm.assert_index_equal(result, expected)
result = a.union(b, sort=False)
tm.assert_index_equal(result, expected)


@pytest.mark.parametrize(
"cls",
[
Int64Index,
Float64Index,
DatetimeIndex,
CategoricalIndex,
TimedeltaIndex,
lambda x: Index(x, dtype=object),
],
)
def test_union_with_duplicate_index(cls):
# GH#36289
a = cls([1, 0, 0])
b = cls([0, 1])
expected = cls([0, 0, 1])

result = a.union(b)
tm.assert_index_equal(result, expected)

result = a.union(b)
tm.assert_index_equal(result, expected)


def test_union_duplicate_index_different_dtypes():
# GH#36289
a = Index([1, 2, 2, 3])
b = Index(["1", "0", "0"])
expected = Index([1, 2, 2, 3, "1", "0", "0"])
result = a.union(b, sort=False)
tm.assert_index_equal(result, expected)


def test_union_same_value_duplicated_in_both():
# GH#36289
a = Index([0, 0, 1])
b = Index([0, 0, 1, 2])
result = a.union(b)
expected = Index([0, 0, 1, 2])
tm.assert_index_equal(result, expected)


def test_union_nan_in_both():
# GH#36289
a = Index([np.nan, 1, 2, 2])
b = Index([np.nan, 1, 1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter if the duplicated entry is nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should not, parametrized to test

result = a.union(b, sort=False)
expected = Index([np.nan, 1.0, 1.0, 2.0, 2.0])
tm.assert_index_equal(result, expected)


class TestSetOpsUnsorted:
# These may eventually belong in a dtype-specific test_setops, or
# parametrized over a more general fixture
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -2416,3 +2416,12 @@ def test_diff_low_precision_int(self, dtype):
result = algos.diff(arr, 1)
expected = np.array([np.nan, 1, 0, -1, 0], dtype="float32")
tm.assert_numpy_array_equal(result, expected)


def test_union_with_duplicates():
# GH#36289
lvals = np.array([3, 1, 3, 4])
rvals = np.array([2, 3, 1, 1])
result = algos.union_with_duplicates(lvals, rvals)
expected = np.array([3, 3, 1, 1, 4, 2])
tm.assert_numpy_array_equal(result, expected)