-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: MultiIndex intersection with sort=False does not preserve order #31312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: MultiIndex intersection with sort=False does not preserve order #31312
Conversation
Very nice! @jeffzi could you pls also create an issue and reference this PR to the issue? |
7a066e6
to
8320d41
Compare
I benchmarked the intersection of 2 arrays of tuples: master Vs this PR Vs suggestion by @jreback to take inspiration of difference. I also looked at the implementation of @jreback If that's ok with you, I can commit the "mixed" version below. iimport numpy as np
import pandas as pd
from pandas.testing import assert_index_equal
def master(self, other):
return set(self._ndarray_values) & set(other._ndarray_values)
def pr(self, other):
lvals = self._ndarray_values
rvals = other._ndarray_values
other_uniq = set(rvals)
seen = set()
return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]
def indexer(self, other):
lvals = self._ndarray_values
if self.is_monotonic and other.is_monotonic:
rvals = other._ndarray_values
return self._inner_indexer(lvals, rvals)[0]
other_uniq = other.unique()
indexer = other_uniq.get_indexer(lvals)
indexer = indexer.take((indexer != -1).nonzero()[0])
return other_uniq.take(indexer)._ndarray_values
def mixed(self, other):
lvals = self._ndarray_values
rvals = other._ndarray_values
if self.is_monotonic and other.is_monotonic:
return self._inner_indexer(lvals, rvals)[0]
other_uniq = set(rvals)
seen = set()
return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]
size = 100000
left = pd.MultiIndex.from_arrays([np.arange(1, size), np.arange(1, size)])
right = pd.MultiIndex.from_arrays([np.arange(1, size//10)[::-1], np.arange(1, size//10)[::-1]])
right_monotonic = pd.MultiIndex.from_arrays([np.arange(1, size//10), np.arange(1, size//10)])
for r in [right, right_monotonic]:
print(f"\nBoth monotonic: {r.is_monotonic}\n---------------")
expected = pd.MultiIndex.from_tuples(pr(left, r))
actual_indexer = pd.MultiIndex.from_tuples(indexer(left, r))
assert_index_equal(expected, actual_indexer)
actual_mixed = pd.MultiIndex.from_tuples(mixed(left, r))
assert_index_equal(expected, actual_mixed)
print("master implementation:")
%timeit -n 10 master(left, r)
print("pr implementation:")
%timeit -n 10 pr(left, r)
print("Suggested indexer implementation:")
%timeit -n 10 indexer(left, r)
print("Mixed implementation:")
%timeit -n 10 mixed(left, r)
|
83518d1
to
5273652
Compare
632f23e
to
6b8f2d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. can you add a whatsnew note in bugfixes for MI. ping on green.
pls merge master as well |
74f5fde
to
7fb060d
Compare
@jreback All green :) |
lgtm. @jbrockmendel |
Couple of comments/questions, no deal-breakers. looks nice |
Thanks @jeffzi great first PR |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The intersection of 2
MultiIndex
withsort=False
does not preserve the order, whereasIndex. intersection()
does. This behavior does not seem to be intentional since the tests fordifference
andsymmetric_difference
are testing for order preservation.For example:
Created on 2020-01-25 by the reprexpy package
I verified that my PR does not decrease performances.
I also modified the test for
union
. The implementation was preserving the order withsort=False
but the tests were not verifying it.