-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/indexes/interval.py
Outdated
try: | ||
lindexer = self.left.get_indexer(other.left) | ||
rindexer = self.right.get_indexer(other.right) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer to check for duplicates rather than catch an exception
Codecov Report
@@ Coverage Diff @@
## master #26225 +/- ##
==========================================
- Coverage 91.97% 91.96% -0.02%
==========================================
Files 175 175
Lines 52379 52398 +19
==========================================
+ Hits 48178 48190 +12
- Misses 4201 4208 +7
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26225 +/- ##
==========================================
- Coverage 91.87% 91.87% -0.01%
==========================================
Files 174 174
Lines 50661 50700 +39
==========================================
+ Hits 46547 46582 +35
- Misses 4114 4118 +4
Continue to review full report at Codecov.
|
Thanks, will look at this in the next day or so. |
pandas/core/indexes/interval.py
Outdated
def intersection(self, other, sort=False): | ||
other = self._as_like_interval_index(other) | ||
|
||
# GH 19016: ensure set op will not return a prohibited dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I it worth factoring out the this check from here and _setop to a helper routine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I add the decorator for the check when more setops in IntervalIndex's own form is implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can do it here would be great; this is pretty common code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things after poking around at this a bit.
It looks like this isn't behaving as expected for nested intervals:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+472.ge9ddf02b5'
In [2]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3), (1, 4)])
In [3]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3)])
In [4]: ii1.intersection(ii2)
Out[4]:
IntervalIndex([],
closed='right',
dtype='interval[int64]')
Haven't looked into this enough to know exactly where it's failing.
It also looks like this fixes an issue that was present in the previous implementation of intersection
(probably a bug with MultiIndex.intersection
), specifically in relation to how duplicates are handled:
In [5]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 2), (2, 3), (3, 4)])
In [6]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (2, 3)])
In [7]: ii1.intersection(ii2)
Out[7]:
IntervalIndex([(1, 2], (1, 2], (2, 3]],
closed='right',
dtype='interval[int64]')
On master this only returns one instance of (1, 2]
but returning both dupes is consistent with other index types, so I think the new behavior is correct:
In [8]: idx1 = pd.Index(list('aabc'))
In [9]: idx2 = pd.Index(list('ab'))
In [10]: idx1.intersection(idx2)
Out[10]: Index(['a', 'a', 'b'], dtype='object')
It'd be nice to include both the failing and newly passing examples above as test cases. I wouldn't be opposed to creating a test_setops.py
file and moving all the setops tests there, as a similar file exists for other index types.
pandas/core/indexes/interval.py
Outdated
taken = self.take(indexer) | ||
else: | ||
# duplicates | ||
lmiss = other.left.get_indexer_non_unique(self.left)[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really complex can u split to a searate function (each case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_intersection_unique
and _intersection_non_unique
works in an opposite way.
_intersecton_unique
gathers the matches on both hand sides. _intersection_non_unique
eliminates the unmatched on the left hand side first and then the unmatched on the right hand side.
doc/source/whatsnew/v0.25.0.rst
Outdated
<<<<<<< HEAD | ||
|
||
======= | ||
>>>>>>> Remove duplicate line in whatsnew note |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like git merge
conflict cruft
return request.param | ||
|
||
|
||
class TestIntervalIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you just copied and pasted, but I think is a great opportunity to move from an OOP class and go functional (and smaller tests) instead per pytest
idiom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung Would you give me more guidelines or examples? I don't know what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely: pytest
prefers that you define test functions directly instead of via test classes (less overhead). Notice that in a lot of your tests, you don't really use the self
parameter. And here's why: you have create_index
as an instance method, but it's really a static method (you don't use self
anywhere there).
So the first thing you do is move create_index
outside (just like your name
fixture0 and drop the self
parameter from the function definition:
def create_index(closed="right"):
return IntervalIndex.from_breaks(range(11), closed=closed)
Now let's take a look at test_union
. That's a massive test! There's actually about five tests in that one test alone. But we'll get to that step later on (separate comment). Let's focus first making this test "functional". Essentially, what that means is simply moving the test out of your class, and dropping the self
parameter from the definition.
@pytest.mark.parametrize(...)
def test_union(closed, sort):
index = create_index(closed=closed)
... # rest of test as before
This is a functional test because it's just a test function, no separate wrapping by a class definition (it doesn't have a self
parameter). Notice that I leverage the create_index
function that we just moved out of our test class right above.
BTW, for more info on pytest
to get a better feel for how it works:
https://docs.pytest.org/en/latest/
https://docs.pytest.org/en/latest/example/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfyoung Thanks for your detail reply.
pandas/core/indexes/interval.py
Outdated
def intersection(self, other, sort=False): | ||
other = self._as_like_interval_index(other) | ||
|
||
# GH 19016: ensure set op will not return a prohibited dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you can do it here would be great; this is pretty common code
pandas/core/indexes/interval.py
Outdated
lmiss = other.left.get_indexer_non_unique(self.left)[1] | ||
rmiss = other.right.get_indexer_non_unique(self.right)[1] | ||
|
||
import functools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imports at the top of the module
|
||
class TestIntervalIndex: | ||
|
||
@pytest.mark.parametrize("sort", [None, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could move this to a fixture as well (or we may already have this defined in a higher level conftest)
Did you intend to close this? |
Yes, the present approach doesn't work. And I work in another approach which compares the left hand side and the corresponding right hand side. But the order is Thanks all for the review. I will come back when I have a better idea. |
The function
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only a comment about typing, but if @jschendel is ok can do that as a followup
Can you merge master? Looks like merging the Looks good overall! Just one small corner case when duplicate NA values are present: In [1]: import pandas as pd; import numpy as np
In [2]: ii = pd.IntervalIndex([np.nan, np.nan])
In [3]: ii
Out[3]:
IntervalIndex([nan, nan],
closed='right',
dtype='interval[float64]')
In [4]: ii.intersection(ii)
Out[4]:
IntervalIndex([],
closed='right',
dtype='interval[float64]') |
It seems the failure of test_constructor_list_frames isn't due to our change in this PR. |
That's fixed on master, if you merge master and repush.
…On Tue, May 28, 2019 at 2:01 AM Mak Sze Chun ***@***.***> wrote:
It seems the failure of test_constructor_list_frames isn't due to our
change in this PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26225?email_source=notifications&email_token=AAKAOIV5E2CZHTBQVZDKYGDPXTKDXA5CNFSM4HI4KVK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWLFBOI#issuecomment-496390329>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIXDETYIIC7LKHWE2WLPXTKDXANCNFSM4HI4KVKQ>
.
|
------- | ||
taken : IntervalIndex | ||
""" | ||
mask = np.zeros(len(self), dtype=bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be an issue with this approach when dupes are present in self
and other
. For other index types, such a scenario can result in more dupes being present in the intersection than in self
. This behavior looks a bit buggy and inconsistent though, so I'm not sure if we actually want IntervalIndex
to be consistent with it.
Some examples of the buggy and inconsistent behavior with Index
:
In [2]: idx2 = pd.Index(list('aa'))
...: idx3 = pd.Index(list('aaa'))
...: idx3b = pd.Index(list('baaa'))
In [3]: idx2.intersection(idx3)
Out[3]: Index(['a', 'a', 'a', 'a'], dtype='object')
In [4]: idx3.intersection(idx3)
Out[4]: Index(['a', 'a', 'a'], dtype='object')
In [5]: idx2.intersection(idx3)
Out[5]: Index(['a', 'a', 'a', 'a'], dtype='object')
In [6]: idx2.intersection(idx3b)
Out[6]: Index(['a', 'a', 'a'], dtype='object')
It seems strange that [3]
has more dupes present than in either original index but [4]
does not. Similarly, it seems like [5]
and [6]
should be identical, as the presence of a non-intersecting element shouldn't impact the number of dupes returned.
@jreback : Do you know what the expected behavior for intersection
with dupes should be? Or if there are any dependencies on the behavior of intersection
that would dictate this?
If we treat indexes like multisets, then the intersection should contain the minimum multiplicity of dupes, e.g. idx2.intersection(idx3)
and idx3.intersection(idx2)
should both have length 2, so you maintain the property of the intersection being a subset of the original indexes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this is weird as these are set ops
what happens (meaning how much breakage) if
- raise if left and right are not unique
- uniquify left and right
prob need to do this for all set ops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't had time to extensively test this out but I made the two changes you suggested in indexes/base.py
for intersection
and both resulted in some breakage. Aside some from breakage in the index set ops tests, there was also some breakage in tests/reshape/test_merge.py
.
@jschendel comments here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from the behavior with dupes, which is inconsistent across all implementations of intersection
. I'm fine merging this as-is and then resolving the dupe behavior globally at a later point in time once the expected behavior is determined.
@jreback : feel free to merge if the above sounds good to you. I can investigate the dupe behavior across the different set ops/indexes and create an issue at some point during this coming week.
@jreback all tests passed. |
thanks @makbigc very nice.
yeah can we create an issue to resolve this, might involve some work. |
not sure why the CI didin't fail here, but I think this PR made the benchmarks fail, see #26703: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12407&view=logs @makbigc can you take a look please? |
…rsection (pandas-dev#26225)" This reverts commit 5d2d6b4.
git diff upstream/master -u -- "*.py" | flake8 --diff