[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225

makbigc · 2019-04-27T14:32:30Z

xref PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

       before           after         ratio
     [64104ec3]       [e9ddf02b]
     <master>         <GH24813-inter>
-     1.76±0.01ms        233±0.5μs     0.13  index_object.IntervalIndexMethod.time_intersection(1000)
-      58.0±0.4ms       3.94±0.2ms     0.07  index_object.IntervalIndexMethod.time_intersection(100000)

jreback · 2019-04-27T14:47:09Z

pandas/core/indexes/interval.py

+        try:
+            lindexer = self.left.get_indexer(other.left)
+            rindexer = self.right.get_indexer(other.right)
+        except Exception:


would prefer to check for duplicates rather than catch an exception

codecov · 2019-04-27T15:09:34Z

Codecov Report

Merging #26225 into master will decrease coverage by 0.01%.
The diff coverage is 85%.

@@            Coverage Diff             @@
##           master   #26225      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.02%     
==========================================
  Files         175      175              
  Lines       52379    52398      +19     
==========================================
+ Hits        48178    48190      +12     
- Misses       4201     4208       +7

Flag	Coverage Δ
#multiple	`90.52% <85%> (-0.01%)`	⬇️
#single	`40.69% <10%> (-0.16%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`94.87% <85%> (-0.36%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...e9ddf02. Read the comment docs.

codecov · 2019-04-27T15:09:34Z

Codecov Report

Merging #26225 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26225      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      174              
  Lines       50661    50700      +39     
==========================================
+ Hits        46547    46582      +35     
- Misses       4114     4118       +4

Flag	Coverage Δ
#multiple	`90.41% <100%> (ø)`	⬆️
#single	`41.77% <29.31%> (-0.1%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`96.42% <100%> (+0.22%)`	⬆️
pandas/core/indexes/base.py	`96.72% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...ea2550a. Read the comment docs.

jschendel · 2019-04-27T16:08:46Z

Thanks, will look at this in the next day or so.

pandas/core/indexes/interval.py

jreback · 2019-04-28T15:33:56Z

pandas/core/indexes/interval.py

+    def intersection(self, other, sort=False):
+        other = self._as_like_interval_index(other)
+
+        # GH 19016: ensure set op will not return a prohibited dtype


I it worth factoring out the this check from here and _setop to a helper routine

May I add the decorator for the check when more setops in IntervalIndex's own form is implemented?

if you can do it here would be great; this is pretty common code

jschendel

A few things after poking around at this a bit.

It looks like this isn't behaving as expected for nested intervals:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+472.ge9ddf02b5'

In [2]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3), (1, 4)])

In [3]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3)])

In [4]: ii1.intersection(ii2)
Out[4]: 
IntervalIndex([],
              closed='right',
              dtype='interval[int64]')

Haven't looked into this enough to know exactly where it's failing.

It also looks like this fixes an issue that was present in the previous implementation of intersection (probably a bug with MultiIndex.intersection), specifically in relation to how duplicates are handled:

In [5]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 2), (2, 3), (3, 4)])

In [6]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (2, 3)])

In [7]: ii1.intersection(ii2)
Out[7]: 
IntervalIndex([(1, 2], (1, 2], (2, 3]],
              closed='right',
              dtype='interval[int64]')

On master this only returns one instance of (1, 2] but returning both dupes is consistent with other index types, so I think the new behavior is correct:

In [8]: idx1 = pd.Index(list('aabc'))

In [9]: idx2 = pd.Index(list('ab'))

In [10]: idx1.intersection(idx2)
Out[10]: Index(['a', 'a', 'b'], dtype='object')

It'd be nice to include both the failing and newly passing examples above as test cases. I wouldn't be opposed to creating a test_setops.py file and moving all the setops tests there, as a similar file exists for other index types.

jreback · 2019-05-01T12:14:58Z

pandas/core/indexes/interval.py

+            taken = self.take(indexer)
+        else:
+            # duplicates
+            lmiss = other.left.get_indexer_non_unique(self.left)[1]


this is really complex can u split to a searate function (each case)

_intersection_unique and _intersection_non_unique works in an opposite way.

_intersecton_unique gathers the matches on both hand sides. _intersection_non_unique eliminates the unmatched on the left hand side first and then the unmatched on the right hand side.

gfyoung · 2019-05-02T03:54:53Z

doc/source/whatsnew/v0.25.0.rst

+<<<<<<< HEAD
+
+=======
+>>>>>>> Remove duplicate line in whatsnew note


Looks like git merge conflict cruft

gfyoung · 2019-05-02T03:56:00Z

pandas/tests/indexes/interval/test_setops.py

+    return request.param
+
+
+class TestIntervalIndex:


I know you just copied and pasted, but I think is a great opportunity to move from an OOP class and go functional (and smaller tests) instead per pytest idiom.

@gfyoung Would you give me more guidelines or examples? I don't know what you want.

Absolutely: pytest prefers that you define test functions directly instead of via test classes (less overhead). Notice that in a lot of your tests, you don't really use the self parameter. And here's why: you have create_index as an instance method, but it's really a static method (you don't use self anywhere there).

So the first thing you do is move create_index outside (just like your name fixture0 and drop the self parameter from the function definition:

def create_index(closed="right"): return IntervalIndex.from_breaks(range(11), closed=closed)

Now let's take a look at test_union. That's a massive test! There's actually about five tests in that one test alone. But we'll get to that step later on (separate comment). Let's focus first making this test "functional". Essentially, what that means is simply moving the test out of your class, and dropping the self parameter from the definition.

@pytest.mark.parametrize(...) def test_union(closed, sort): index = create_index(closed=closed) ... # rest of test as before

This is a functional test because it's just a test function, no separate wrapping by a class definition (it doesn't have a self parameter). Notice that I leverage the create_index function that we just moved out of our test class right above.

BTW, for more info on pytest to get a better feel for how it works:

https://docs.pytest.org/en/latest/
https://docs.pytest.org/en/latest/example/

@gfyoung Thanks for your detail reply.

pandas/core/indexes/interval.py

jreback · 2019-05-05T21:48:52Z

pandas/core/indexes/interval.py

+    def intersection(self, other, sort=False):
+        other = self._as_like_interval_index(other)
+
+        # GH 19016: ensure set op will not return a prohibited dtype


if you can do it here would be great; this is pretty common code

pandas/core/indexes/interval.py

jreback · 2019-05-05T21:49:22Z

pandas/core/indexes/interval.py

+        lmiss = other.left.get_indexer_non_unique(self.left)[1]
+        rmiss = other.right.get_indexer_non_unique(self.right)[1]
+
+        import functools


imports at the top of the module

jreback · 2019-05-05T21:50:15Z

pandas/tests/indexes/interval/test_setops.py

+
+class TestIntervalIndex:
+
+    @pytest.mark.parametrize("sort", [None, False])


could move this to a fixture as well (or we may already have this defined in a higher level conftest)

TomAugspurger · 2019-05-24T12:11:48Z

Did you intend to close this?

makbigc · 2019-05-24T14:23:22Z

Yes, the present approach doesn't work. And I work in another approach which compares the left hand side and the corresponding right hand side. But the order is n**2. The performance is worsen.

Thanks all for the review. I will come back when I have a better idea.

makbigc · 2019-05-25T16:50:18Z

The function get_indexer_non_unique is used to screen out avoiding the worst case of order n**2.

All benchmarks:

       before           after         ratio
     [ae409049]       [402b09cf]
     <master>         <GH24813-inter>
-     1.76±0.02ms          234±3μs     0.13  index_object.IntervalIndexMethod.time_intersection(1000)
-        58.7±2ms      3.82±0.05ms     0.07  index_object.IntervalIndexMethod.time_intersection(100000)
-     1.61±0.01ms          504±2μs     0.31  index_object.IntervalIndexMethod.time_intersection_duplicate(1000)
-        61.4±2ms       34.5±0.7ms     0.56  index_object.IntervalIndexMethod.time_intersection_duplicate(100000)

jreback

only a comment about typing, but if @jschendel is ok can do that as a followup

pandas/core/indexes/interval.py

jschendel · 2019-05-28T04:34:04Z

Can you merge master? Looks like merging the IntervalIndex.is_unique PR caused a conflict.

Looks good overall! Just one small corner case when duplicate NA values are present:

In [1]: import pandas as pd; import numpy as np

In [2]: ii = pd.IntervalIndex([np.nan, np.nan])

In [3]: ii
Out[3]: 
IntervalIndex([nan, nan],
              closed='right',
              dtype='interval[float64]')

In [4]: ii.intersection(ii)
Out[4]: 
IntervalIndex([],
              closed='right',
              dtype='interval[float64]')

makbigc · 2019-05-28T07:01:07Z

It seems the failure of test_constructor_list_frames isn't due to our change in this PR.

TomAugspurger · 2019-05-28T20:38:17Z

That's fixed on master, if you merge master and repush.

…

On Tue, May 28, 2019 at 2:01 AM Mak Sze Chun ***@***.***> wrote: It seems the failure of test_constructor_list_frames isn't due to our change in this PR. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26225?email_source=notifications&email_token=AAKAOIV5E2CZHTBQVZDKYGDPXTKDXA5CNFSM4HI4KVK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWLFBOI#issuecomment-496390329>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIXDETYIIC7LKHWE2WLPXTKDXANCNFSM4HI4KVKQ> .

jreback · 2019-05-30T01:35:38Z

@jschendel

jschendel · 2019-05-30T06:09:54Z

pandas/core/indexes/interval.py

+        -------
+        taken : IntervalIndex
+        """
+        mask = np.zeros(len(self), dtype=bool)


There might be an issue with this approach when dupes are present in self and other. For other index types, such a scenario can result in more dupes being present in the intersection than in self. This behavior looks a bit buggy and inconsistent though, so I'm not sure if we actually want IntervalIndex to be consistent with it.

Some examples of the buggy and inconsistent behavior with Index:

In [2]: idx2 = pd.Index(list('aa')) ...: idx3 = pd.Index(list('aaa')) ...: idx3b = pd.Index(list('baaa')) In [3]: idx2.intersection(idx3) Out[3]: Index(['a', 'a', 'a', 'a'], dtype='object') In [4]: idx3.intersection(idx3) Out[4]: Index(['a', 'a', 'a'], dtype='object') In [5]: idx2.intersection(idx3) Out[5]: Index(['a', 'a', 'a', 'a'], dtype='object') In [6]: idx2.intersection(idx3b) Out[6]: Index(['a', 'a', 'a'], dtype='object')

It seems strange that [3] has more dupes present than in either original index but [4] does not. Similarly, it seems like [5] and [6] should be identical, as the presence of a non-intersecting element shouldn't impact the number of dupes returned.

@jreback : Do you know what the expected behavior for intersection with dupes should be? Or if there are any dependencies on the behavior of intersection that would dictate this?

If we treat indexes like multisets, then the intersection should contain the minimum multiplicity of dupes, e.g. idx2.intersection(idx3) and idx3.intersection(idx2) should both have length 2, so you maintain the property of the intersection being a subset of the original indexes.

yeah this is weird as these are set ops

what happens (meaning how much breakage) if

raise if left and right are not unique

uniquify left and right

prob need to do this for all set ops

Haven't had time to extensively test this out but I made the two changes you suggested in indexes/base.py for intersection and both resulted in some breakage. Aside some from breakage in the index set ops tests, there was also some breakage in tests/reshape/test_merge.py.

jreback · 2019-06-02T23:57:45Z

@jschendel comments here

jschendel

LGTM aside from the behavior with dupes, which is inconsistent across all implementations of intersection. I'm fine merging this as-is and then resolving the dupe behavior globally at a later point in time once the expected behavior is determined.

@jreback : feel free to merge if the above sounds good to you. I can investigate the dupe behavior across the different set ops/indexes and create an issue at some point during this coming week.

makbigc · 2019-06-06T08:53:10Z

@jreback all tests passed.

jreback · 2019-06-06T14:31:01Z

thanks @makbigc very nice.

LGTM aside from the behavior with dupes, which is inconsistent across all implementations of intersection. I'm fine merging this as-is and then resolving the dupe behavior globally at a later point in time once the expected behavior is determined.

yeah can we create an issue to resolve this, might involve some work.

datapythonista · 2019-06-07T12:24:49Z

not sure why the CI didin't fail here, but I think this PR made the benchmarks fail, see #26703: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12407&view=logs

@makbigc can you take a look please?

…rsection (pandas-dev#26225)" This reverts commit 5d2d6b4.

jreback requested changes Apr 27, 2019

View reviewed changes

jschendel added Interval Interval data type Performance Memory or execution speed performance labels Apr 27, 2019

jreback added this to the 0.25.0 milestone Apr 28, 2019

jschendel mentioned this pull request Apr 28, 2019

PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813

Closed

9 tasks

jreback requested changes Apr 28, 2019

View reviewed changes

jschendel suggested changes Apr 30, 2019

View reviewed changes

makbigc force-pushed the GH24813-inter branch from e9ddf02 to cdf8209 Compare May 1, 2019 10:47

jreback reviewed May 1, 2019

View reviewed changes

makbigc force-pushed the GH24813-inter branch from cbd1616 to b212dd9 Compare May 1, 2019 14:49

gfyoung reviewed May 2, 2019

View reviewed changes

makbigc force-pushed the GH24813-inter branch from b212dd9 to 7059a0b Compare May 5, 2019 06:17

jreback requested changes May 5, 2019

View reviewed changes

makbigc added 13 commits May 10, 2019 16:04

Gid rid of MultiIndex conversion in IntervalIndex.intersection

a5a1272

Add benchmark for IntervalIndex.intersection

3cd095a

clear code

0486a4e

Add whatsnew note

09c89f1

Modity the case for duplicate index

841a0b7

Combine the set operation to find indexer into one

8b22623

Move setops tests to test_setops.py and add two tests

32d4005

Remove relundant line

d502fcb

Remove duplicate line in whatsnew note

8ec6366

Isort interval/test_setops.py

6000904

Split the intersection into two sub-functions

7cb7d2c

Functionalize some indexes

bcf36bb

Remove relundant lines in whatsnew

745c0bb

makbigc closed this May 24, 2019

makbigc added 2 commits May 26, 2019 00:00

2nd approach

ab67edd

Add a new benchmark

402b09c

makbigc reopened this May 25, 2019

makbigc added 2 commits May 26, 2019 01:06

Fix linting issue

b4f130d

Change the decorator name to SetopCheck

3ff4c64

jreback approved these changes May 26, 2019

View reviewed changes

pandas/core/indexes/interval.py Show resolved Hide resolved

makbigc added 2 commits May 28, 2019 13:46

Amend and add test for a more corner case

3db3130

Merge commit to resolve conflict

1f25adb

merge master

4a9cd29

jschendel reviewed May 30, 2019

View reviewed changes

jschendel approved these changes Jun 3, 2019

View reviewed changes

makbigc added 2 commits June 4, 2019 22:15

merge

1467e94

merge again

ea2550a

jreback merged commit 5d2d6b4 into pandas-dev:master Jun 6, 2019

datapythonista mentioned this pull request Jun 7, 2019

Broken benchmarks #26709

Closed

TomAugspurger added a commit to TomAugspurger/pandas that referenced this pull request Jun 7, 2019

Revert "[PERF] Get rid of MultiIndex conversion in IntervalIndex.inte…

e0fbcaf

…rsection (pandas-dev#26225)" This reverts commit 5d2d6b4.

makbigc mentioned this pull request Jun 15, 2019

Add typing annotation to IntervalIndex.intersection #26870

Merged

qwhelan mentioned this pull request Jul 20, 2019

PERF: speed up IntervalIndex._intersection_non_unique by ~50x #27489

Merged

5 tasks


		class TestIntervalIndex:

		@pytest.mark.parametrize("sort", [None, False])

Uh oh!

[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225

[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225

Uh oh!

Conversation

makbigc commented Apr 27, 2019 • edited by jreback Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 27, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jschendel commented Apr 27, 2019

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschendel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gfyoung May 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented May 24, 2019

Uh oh!

makbigc commented May 24, 2019

Uh oh!

makbigc commented May 25, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jschendel commented May 28, 2019

Uh oh!

makbigc commented May 28, 2019

Uh oh!

TomAugspurger commented May 28, 2019 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented May 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

makbigc commented Apr 27, 2019 •

edited by jreback

Loading

codecov bot commented Apr 27, 2019 •

edited

Loading

gfyoung May 2, 2019 •

edited

Loading

gfyoung May 4, 2019 •

edited

Loading

TomAugspurger commented May 28, 2019 via email •

edited

Loading

jschendel left a comment •

edited

Loading