Skip to content

BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 6, 2022

Conversation

lukemanley
Copy link
Member

ASV added:

       before           after         ratio
     [d76b9f2c]       [6760a19a]
     <main>           <union-with-duplicates>
-      18.6±0.2ms      5.53±0.07ms     0.30  index_object.UnionWithDuplicates.time_union_with_duplicates

@lukemanley lukemanley added Performance Memory or execution speed performance MultiIndex Index Related to the Index class or subclasses labels Oct 1, 2022
for i, value in enumerate(unique_array):
indexer += [i] * int(max(l_count.at[value], r_count.at[value]))
return unique_array.take(indexer)
repeats = final_count.reindex(unique_array).values # type: ignore[attr-defined]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good,

can you refactor this to return the indexer and operate on the MultiIndex in _union()? This avoids losing the dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to preserve EA dtypes. The indexer applies to the combined unique values not the original MI's so I don't think we need to return it.

@lukemanley lukemanley changed the title PERF: algos.union_with_duplicates BUG/PERF: algos.union_with_duplicates losing EA dtypes Oct 2, 2022
for i, value in enumerate(unique_array):
indexer += [i] * int(max(l_count.at[value], r_count.at[value]))
return unique_array.take(indexer)
final_count = np.maximum(l_count, r_count).astype("int", copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got 2 MultiIndex checks now. I'd rather handle this on the _union level. Is there any performance benefit of doing it like this?

Additionally, I think you can use unique_vals.take(repeats)
That's more in line with what we are doing elsewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the MultiIndex checks within union_with_duplicates, otherwise we lose dtypes in there. I'm not sure how how this could be done only on the _union level. Let me know if you have any suggestions.

The repeat logic is not quite a simple take:

np.repeat(unique_vals, repeats)

is equivalent to:

indexer = np.arange(len(unique_vals))
indexer = np.repeat(indexer, repeats)
unique_vals.take(indexer)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could simply return final_count and handle the rest In the _union level. Alternatively, you could just pass in self and other and handle everything on the lower level. Or add another argument unique_vals to the function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, makes sense. I pushed the type checking down into union_with_duplicates.

else:
unique_vals = unique(concat_compat([lvals, rvals]))
unique_vals = ensure_wrapped_if_datetimelike(unique_vals)
repeats = final_count.reindex(unique_vals).values # type: ignore[attr-defined]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the mypy error as a comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed/removed the mypy ignore

@@ -578,6 +578,30 @@ def test_union_keep_ea_dtype(any_numeric_ea_dtype, val):
tm.assert_index_equal(result, expected)


def test_union_with_duplicates_keep_ea_dtype(any_numeric_ea_dtype):
# GH48900
mi1 = MultiIndex.from_arrays(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if theres duplicate NAs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works the same as non-NA. I added the dupe NA case to the test.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @phofl

@mroeschke mroeschke added this to the 1.6 milestone Oct 4, 2022
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, could you add the pr number to the entry in the MultiIndex section for union losing extension array dtype?

@phofl phofl merged commit c34da50 into pandas-dev:main Oct 6, 2022
@phofl
Copy link
Member

phofl commented Oct 6, 2022

thx @lukemanley

@mroeschke mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022
@lukemanley lukemanley deleted the union-with-duplicates branch October 26, 2022 10:18
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants