Skip to content

BUG: Index.union() inconsistent with non-unique Indexes #36299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Mar 4, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 11, 2020

The else-block was pretty buggy in case of non unique indexes (was dependent on order of indexes and if one was subset of the other). The function in the try block worked perfectly, so I figured it would be way safer to let that function compute the new index and sort it based on the input indexes. If there is a faster way to achieve the resorting, I would be happy to change the code block.

We could delete the second sort block with the RunTimeWarning too, but then we would no longer get a Warning when incomparable objects should be sorted.

@rhshadrach rhshadrach added Index Related to the Index class or subclasses Bug labels Sep 13, 2020
@pep8speaks
Copy link

pep8speaks commented Sep 14, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-03-02 20:43:35 UTC

@dsaxton dsaxton changed the title [BUG]: Index.union() inconsisten with non-unique Indexes [BUG]: Index.union() inconsistent with non-unique Indexes Sep 16, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try to simplify

@jreback
Copy link
Contributor

jreback commented Feb 11, 2021

not sure of status here, @phofl if you can merge master

@phofl
Copy link
Member Author

phofl commented Feb 13, 2021

@jreback merged master, last commit is for unused import.

This should fix inconsistencies with duplicates in Index.union

@jreback jreback added this to the 1.3 milestone Feb 15, 2021
@@ -2277,3 +2277,29 @@ def _sort_tuples(values: np.ndarray):
arrays, _ = to_arrays(values, None)
indexer = lexsort_indexer(arrays, orders=True)
return values[indexer]


def union_with_duplicates(lvals, rvals) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type, these are ndarrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, typed

indexer = []
l_count = value_counts(lvals, dropna=False)
r_count = value_counts(rvals, dropna=False)
l_count, r_count = l_count._align_series(r_count, fill_value=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call .align directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes of course, thx

@@ -2935,32 +2935,42 @@ def _union(self, other, sort):
lvals = self._values
rvals = other._values

if sort is None and self.is_monotonic and other.is_monotonic:
if (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment here on when these branches are taken (similar to what you did below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Parameters
----------
lvals:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

types here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we duplicating them if signature is typed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

def test_union_nan_in_both():
# GH#36289
a = Index([np.nan, 1, 2, 2])
b = Index([np.nan, 1, 1, 2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter if the duplicated entry is nan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it should not, parametrized to test

@jreback
Copy link
Contributor

jreback commented Feb 27, 2021

merge master as well

@phofl
Copy link
Member Author

phofl commented Mar 2, 2021

Done

@jreback jreback merged commit 9eaf63f into pandas-dev:master Mar 4, 2021
@jreback
Copy link
Contributor

jreback commented Mar 4, 2021

thanks @phofl this took quite some effort, keep em coming!

@jbrockmendel
Copy link
Member

thanks @phofl this took quite some effort, keep em coming!

seriously, way to go

@phofl
Copy link
Member Author

phofl commented Mar 4, 2021

Thx, feels good to get it done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Index Related to the Index class or subclasses
Projects
None yet
5 participants