BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

lukemanley · 2022-10-01T01:15:26Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v1.6.0.rst file if fixing a bug or adding a new feature.

ASV added:

       before           after         ratio
     [d76b9f2c]       [6760a19a]
     <main>           <union-with-duplicates>
-      18.6±0.2ms      5.53±0.07ms     0.30  index_object.UnionWithDuplicates.time_union_with_duplicates

phofl · 2022-10-01T23:07:22Z

pandas/core/algorithms.py

-    for i, value in enumerate(unique_array):
-        indexer += [i] * int(max(l_count.at[value], r_count.at[value]))
-    return unique_array.take(indexer)
+    repeats = final_count.reindex(unique_array).values  # type: ignore[attr-defined]


Looks good,

can you refactor this to return the indexer and operate on the MultiIndex in _union()? This avoids losing the dtype.

Refactored to preserve EA dtypes. The indexer applies to the combined unique values not the original MI's so I don't think we need to return it.

phofl · 2022-10-02T11:06:52Z

pandas/core/algorithms.py

-    for i, value in enumerate(unique_array):
-        indexer += [i] * int(max(l_count.at[value], r_count.at[value]))
-    return unique_array.take(indexer)
+    final_count = np.maximum(l_count, r_count).astype("int", copy=False)


You've got 2 MultiIndex checks now. I'd rather handle this on the _union level. Is there any performance benefit of doing it like this?

Additionally, I think you can use unique_vals.take(repeats)
That's more in line with what we are doing elsewhere

I think we need the MultiIndex checks within union_with_duplicates, otherwise we lose dtypes in there. I'm not sure how how this could be done only on the _union level. Let me know if you have any suggestions.

The repeat logic is not quite a simple take:

np.repeat(unique_vals, repeats)

is equivalent to:

indexer = np.arange(len(unique_vals)) indexer = np.repeat(indexer, repeats) unique_vals.take(indexer)

You could simply return final_count and handle the rest In the _union level. Alternatively, you could just pass in self and other and handle everything on the lower level. Or add another argument unique_vals to the function

Thx, makes sense. I pushed the type checking down into union_with_duplicates.

phofl · 2022-10-02T11:41:35Z

pandas/core/algorithms.py

+    else:
+        unique_vals = unique(concat_compat([lvals, rvals]))
+        unique_vals = ensure_wrapped_if_datetimelike(unique_vals)
+    repeats = final_count.reindex(unique_vals).values  # type: ignore[attr-defined]


Could you add the mypy error as a comment?

I fixed/removed the mypy ignore

mroeschke · 2022-10-03T16:44:41Z

pandas/tests/indexes/multi/test_setops.py

@@ -578,6 +578,30 @@ def test_union_keep_ea_dtype(any_numeric_ea_dtype, val):
    tm.assert_index_equal(result, expected)


+def test_union_with_duplicates_keep_ea_dtype(any_numeric_ea_dtype):
+    # GH48900
+    mi1 = MultiIndex.from_arrays(


What happens if theres duplicate NAs?

Works the same as non-NA. I added the dupe NA case to the test.

mroeschke

LGTM merge when ready @phofl

phofl

lgtm, could you add the pr number to the entry in the MultiIndex section for union losing extension array dtype?

phofl · 2022-10-06T08:00:43Z

thx @lukemanley

)

lukemanley added 2 commits September 30, 2022 20:55

perf: union_with_duplicates

6760a19

whatsnew

0d8237f

lukemanley added Performance Memory or execution speed performance MultiIndex Index Related to the Index class or subclasses labels Oct 1, 2022

mypy

5d32385

phofl reviewed Oct 1, 2022

View reviewed changes

preserve ea dtypes

eadf5f1

lukemanley changed the title ~~PERF: algos.union_with_duplicates~~ BUG/PERF: algos.union_with_duplicates losing EA dtypes Oct 2, 2022

lukemanley added 2 commits October 1, 2022 21:02

mypy

3b0adbe

fix

5180e1a

phofl reviewed Oct 2, 2022

View reviewed changes

lukemanley added 2 commits October 2, 2022 08:03

push type checking into union_with_duplicates

8df7a38

mypy

335614d

mroeschke reviewed Oct 3, 2022

View reviewed changes

add test for duplicate NA

51bc249

mroeschke approved these changes Oct 4, 2022

View reviewed changes

mroeschke added this to the 1.6 milestone Oct 4, 2022

phofl reviewed Oct 5, 2022

View reviewed changes

lukemanley added 2 commits October 5, 2022 20:20

Merge remote-tracking branch 'upstream/main' into union-with-duplicates

e466ade

add gh ref

9c2a2a8

phofl merged commit c34da50 into pandas-dev:main Oct 6, 2022

mroeschke modified the milestones: 1.6, 2.0 Oct 13, 2022

lukemanley deleted the union-with-duplicates branch October 26, 2022 10:18

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

BUG/PERF: algos.union_with_duplicates losing EA dtypes (pandas-dev#48900

1c40fb2

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

lukemanley commented Oct 1, 2022

phofl Oct 1, 2022

lukemanley Oct 2, 2022

phofl Oct 2, 2022

lukemanley Oct 2, 2022

phofl Oct 2, 2022

lukemanley Oct 2, 2022

phofl Oct 2, 2022

lukemanley Oct 2, 2022

mroeschke Oct 3, 2022

lukemanley Oct 4, 2022

mroeschke left a comment

phofl left a comment

phofl commented Oct 6, 2022

BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

BUG/PERF: algos.union_with_duplicates losing EA dtypes #48900

Conversation

lukemanley commented Oct 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

phofl commented Oct 6, 2022