Skip to content

BUG: MultiIndex.union dropping duplicates from result #38977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
May 26, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Jan 5, 2021

This more or less sits on top of #36299. The current base-_union implementation works only for sorted indexes correctly if indexes containing duplicates. Hence I've only added tests for sorted indexes,

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2021

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 5, 2021
phofl added 3 commits April 10, 2021 00:18
…ltiindex_union

� Conflicts:
�	pandas/_libs/lib.pyx
�	pandas/tests/indexes/multi/test_setops.py
�	pandas/tests/libs/test_lib.py
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any performance implications?

@phofl
Copy link
Member Author

phofl commented Apr 11, 2021

Thx, code itself is good, but this depends on #40862, after this is merged we can remove the sort_values in the tests. Will run asvs afterwards to check for performance implications

@mroeschke mroeschke removed the Stale label Apr 11, 2021
@mroeschke mroeschke added this to the 1.3 milestone Apr 11, 2021
@jreback
Copy link
Contributor

jreback commented Apr 13, 2021

wait for #40862 right?

@phofl
Copy link
Member Author

phofl commented Apr 13, 2021

Yes

@@ -714,6 +714,7 @@ MultiIndex
- Bug in :meth:`MultiIndex.intersection` duplicating ``NaN`` in result (:issue:`38623`)
- Bug in :meth:`MultiIndex.equals` incorrectly returning ``True`` when :class:`MultiIndex` containing ``NaN`` even when they are differently ordered (:issue:`38439`)
- Bug in :meth:`MultiIndex.intersection` always returning empty when intersecting with :class:`CategoricalIndex` (:issue:`38653`)
- Bug in :meth:`MultiIndex.union` dropping duplicates from result (:issue:`38977`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this on the original whatsnew note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, will adjust here after the other is merged and ping then

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending merge conflicts

phofl added 2 commits April 14, 2021 20:17
…ltiindex_union

� Conflicts:
�	doc/source/whatsnew/v1.3.0.rst
�	pandas/tests/indexes/test_setops.py
@phofl
Copy link
Member Author

phofl commented Apr 14, 2021

Unfortunately this looks like a pretty high performance penalty

       before           after         ratio
     [84d9c5ed]       [f77120cc]
     <multiindex_union~1^2>       <multiindex_union>
+        77.4±2ms          159±4ms     2.05  multiindex_object.SetOperations.time_operation('non_monotonic', 'int', 'union')
+      33.4±0.5ms         57.9±1ms     1.73  multiindex_object.GetLoc.time_large_get_loc_warm
+         102±1ms          152±2ms     1.49  multiindex_object.SetOperations.time_operation('non_monotonic', 'string', 'union')
+      99.4±0.9ms          144±1ms     1.45  multiindex_object.SetOperations.time_operation('monotonic', 'string', 'union')
+         520±4ms         640±20ms     1.23  multiindex_object.SetOperations.time_operation('non_monotonic', 'datetime', 'union')

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I think this is okay given that it fixes a bug. cc @jreback if this is okay despite the performance penalty.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

where is the perf diff coming from now?

@phofl
Copy link
Member Author

phofl commented Apr 30, 2021

The Index union for duplicates is not that fast

@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

ok this is fine if this slowdown only occurs when we have duplicates (which i think is the case)

@phofl
Copy link
Member Author

phofl commented Apr 30, 2021

I'll check the benchmarks

@phofl
Copy link
Member Author

phofl commented May 1, 2021

Unfortunately the slowdown is for also for non monotonic indexes. We could do it with only dispatching to Index.union in case of duplicates or nans, but this is makes it more complicated unfortunately.

       before           after         ratio
     [0158382a]       [692272cb]
                      <multiindex_union>
+         685±5ms         887±20ms     1.30  multiindex_object.SetOperations.time_operation('non_monotonic', 'datetime', 'intersection')

@simonjayhawkins
Copy link
Member

@phofl can you merge master

@phofl
Copy link
Member Author

phofl commented May 25, 2021

Done

@jreback jreback merged commit 282a0fb into pandas-dev:master May 26, 2021
@jreback
Copy link
Contributor

jreback commented May 26, 2021

thanks @phofl

@phofl phofl deleted the multiindex_union branch May 27, 2021 10:26
TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants