-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Index.union() inconsistent with non-unique Indexes #36299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try to simplify
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
� Conflicts: � pandas/core/indexes/base.py
not sure of status here, @phofl if you can merge master |
@jreback merged master, last commit is for unused import. This should fix inconsistencies with duplicates in Index.union |
� Conflicts: � pandas/tests/indexes/test_setops.py
pandas/core/algorithms.py
Outdated
@@ -2277,3 +2277,29 @@ def _sort_tuples(values: np.ndarray): | |||
arrays, _ = to_arrays(values, None) | |||
indexer = lexsort_indexer(arrays, orders=True) | |||
return values[indexer] | |||
|
|||
|
|||
def union_with_duplicates(lvals, rvals) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type, these are ndarrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, typed
pandas/core/algorithms.py
Outdated
indexer = [] | ||
l_count = value_counts(lvals, dropna=False) | ||
r_count = value_counts(rvals, dropna=False) | ||
l_count, r_count = l_count._align_series(r_count, fill_value=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call .align directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes of course, thx
@@ -2935,32 +2935,42 @@ def _union(self, other, sort): | |||
lvals = self._values | |||
rvals = other._values | |||
|
|||
if sort is None and self.is_monotonic and other.is_monotonic: | |||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment here on when these branches are taken (similar to what you did below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/algorithms.py
Outdated
|
||
Parameters | ||
---------- | ||
lvals: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we duplicating them if signature is typed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
pandas/tests/indexes/test_setops.py
Outdated
def test_union_nan_in_both(): | ||
# GH#36289 | ||
a = Index([np.nan, 1, 2, 2]) | ||
b = Index([np.nan, 1, 1, 2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter if the duplicated entry is nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it should not, parametrized to test
merge master as well |
Done |
thanks @phofl this took quite some effort, keep em coming! |
seriously, way to go |
Thx, feels good to get it done |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The else-block was pretty buggy in case of non unique indexes (was dependent on order of indexes and if one was subset of the other). The function in the try block worked perfectly, so I figured it would be way safer to let that function compute the new index and sort it based on the input indexes. If there is a faster way to achieve the resorting, I would be happy to change the code block.
We could delete the second sort block with the RunTimeWarning too, but then we would no longer get a Warning when incomparable objects should be sorted.