Skip to content

[PERF] Get rid of MultiIndex conversion in IntervalIndex.intersection #26225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jun 6, 2019

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Apr 27, 2019

       before           after         ratio
     [64104ec3]       [e9ddf02b]
     <master>         <GH24813-inter>
-     1.76±0.01ms        233±0.5μs     0.13  index_object.IntervalIndexMethod.time_intersection(1000)
-      58.0±0.4ms       3.94±0.2ms     0.07  index_object.IntervalIndexMethod.time_intersection(100000)

try:
lindexer = self.left.get_indexer(other.left)
rindexer = self.right.get_indexer(other.right)
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer to check for duplicates rather than catch an exception

@codecov
Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26225 into master will decrease coverage by 0.01%.
The diff coverage is 85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26225      +/-   ##
==========================================
- Coverage   91.97%   91.96%   -0.02%     
==========================================
  Files         175      175              
  Lines       52379    52398      +19     
==========================================
+ Hits        48178    48190      +12     
- Misses       4201     4208       +7
Flag Coverage Δ
#multiple 90.52% <85%> (-0.01%) ⬇️
#single 40.69% <10%> (-0.16%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 94.87% <85%> (-0.36%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 64104ec...e9ddf02. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 27, 2019

Codecov Report

Merging #26225 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26225      +/-   ##
==========================================
- Coverage   91.87%   91.87%   -0.01%     
==========================================
  Files         174      174              
  Lines       50661    50700      +39     
==========================================
+ Hits        46547    46582      +35     
- Misses       4114     4118       +4
Flag Coverage Δ
#multiple 90.41% <100%> (ø) ⬆️
#single 41.77% <29.31%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 96.42% <100%> (+0.22%) ⬆️
pandas/core/indexes/base.py 96.72% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 891a419...ea2550a. Read the comment docs.

@jschendel jschendel added Interval Interval data type Performance Memory or execution speed performance labels Apr 27, 2019
@jschendel
Copy link
Member

Thanks, will look at this in the next day or so.

def intersection(self, other, sort=False):
other = self._as_like_interval_index(other)

# GH 19016: ensure set op will not return a prohibited dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I it worth factoring out the this check from here and _setop to a helper routine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I add the decorator for the check when more setops in IntervalIndex's own form is implemented?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can do it here would be great; this is pretty common code

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things after poking around at this a bit.

It looks like this isn't behaving as expected for nested intervals:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.25.0.dev0+472.ge9ddf02b5'

In [2]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3), (1, 4)])

In [3]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (1, 3)])

In [4]: ii1.intersection(ii2)
Out[4]: 
IntervalIndex([],
              closed='right',
              dtype='interval[int64]')

Haven't looked into this enough to know exactly where it's failing.

It also looks like this fixes an issue that was present in the previous implementation of intersection (probably a bug with MultiIndex.intersection), specifically in relation to how duplicates are handled:

In [5]: ii1 = pd.IntervalIndex.from_tuples([(1, 2), (1, 2), (2, 3), (3, 4)])

In [6]: ii2 = pd.IntervalIndex.from_tuples([(1, 2), (2, 3)])

In [7]: ii1.intersection(ii2)
Out[7]: 
IntervalIndex([(1, 2], (1, 2], (2, 3]],
              closed='right',
              dtype='interval[int64]')

On master this only returns one instance of (1, 2] but returning both dupes is consistent with other index types, so I think the new behavior is correct:

In [8]: idx1 = pd.Index(list('aabc'))

In [9]: idx2 = pd.Index(list('ab'))

In [10]: idx1.intersection(idx2)
Out[10]: Index(['a', 'a', 'b'], dtype='object')

It'd be nice to include both the failing and newly passing examples above as test cases. I wouldn't be opposed to creating a test_setops.py file and moving all the setops tests there, as a similar file exists for other index types.

taken = self.take(indexer)
else:
# duplicates
lmiss = other.left.get_indexer_non_unique(self.left)[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really complex can u split to a searate function (each case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_​intersection_unique and _intersection_non_unique works in an opposite way.

_intersecton_unique gathers the matches on both hand sides. _intersection_non_unique eliminates the unmatched on the left hand side first and then the unmatched on the right hand side.

<<<<<<< HEAD

=======
>>>>>>> Remove duplicate line in whatsnew note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like git merge conflict cruft

return request.param


class TestIntervalIndex:
Copy link
Member

@gfyoung gfyoung May 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you just copied and pasted, but I think is a great opportunity to move from an OOP class and go functional (and smaller tests) instead per pytest idiom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung Would you give me more guidelines or examples? I don't know what you want.

Copy link
Member

@gfyoung gfyoung May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely: pytest prefers that you define test functions directly instead of via test classes (less overhead). Notice that in a lot of your tests, you don't really use the self parameter. And here's why: you have create_index as an instance method, but it's really a static method (you don't use self anywhere there).

So the first thing you do is move create_index outside (just like your name fixture0 and drop the self parameter from the function definition:

def create_index(closed="right"):
    return IntervalIndex.from_breaks(range(11), closed=closed)

Now let's take a look at test_union. That's a massive test! There's actually about five tests in that one test alone. But we'll get to that step later on (separate comment). Let's focus first making this test "functional". Essentially, what that means is simply moving the test out of your class, and dropping the self parameter from the definition.

@pytest.mark.parametrize(...)
def test_union(closed, sort):
    index = create_index(closed=closed)
    ...   # rest of test as before

This is a functional test because it's just a test function, no separate wrapping by a class definition (it doesn't have a self parameter). Notice that I leverage the create_index function that we just moved out of our test class right above.

BTW, for more info on pytest to get a better feel for how it works:

https://docs.pytest.org/en/latest/
https://docs.pytest.org/en/latest/example/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung Thanks for your detail reply.

def intersection(self, other, sort=False):
other = self._as_like_interval_index(other)

# GH 19016: ensure set op will not return a prohibited dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you can do it here would be great; this is pretty common code

lmiss = other.left.get_indexer_non_unique(self.left)[1]
rmiss = other.right.get_indexer_non_unique(self.right)[1]

import functools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imports at the top of the module


class TestIntervalIndex:

@pytest.mark.parametrize("sort", [None, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could move this to a fixture as well (or we may already have this defined in a higher level conftest)

@makbigc makbigc closed this May 24, 2019
@TomAugspurger
Copy link
Contributor

Did you intend to close this?

@makbigc
Copy link
Contributor Author

makbigc commented May 24, 2019

Yes, the present approach doesn't work. And I work in another approach which compares the left hand side and the corresponding right hand side. But the order is n**2. The performance is worsen.

Thanks all for the review. I will come back when I have a better idea.

@makbigc makbigc reopened this May 25, 2019
@makbigc
Copy link
Contributor Author

makbigc commented May 25, 2019

The function get_indexer_non_unique is used to screen out avoiding the worst case of order n**2.

All benchmarks:

       before           after         ratio
     [ae409049]       [402b09cf]
     <master>         <GH24813-inter>
-     1.76±0.02ms          234±3μs     0.13  index_object.IntervalIndexMethod.time_intersection(1000)
-        58.7±2ms      3.82±0.05ms     0.07  index_object.IntervalIndexMethod.time_intersection(100000)
-     1.61±0.01ms          504±2μs     0.31  index_object.IntervalIndexMethod.time_intersection_duplicate(1000)
-        61.4±2ms       34.5±0.7ms     0.56  index_object.IntervalIndexMethod.time_intersection_duplicate(100000)

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only a comment about typing, but if @jschendel is ok can do that as a followup

@jschendel
Copy link
Member

Can you merge master? Looks like merging the IntervalIndex.is_unique PR caused a conflict.

Looks good overall! Just one small corner case when duplicate NA values are present:

In [1]: import pandas as pd; import numpy as np

In [2]: ii = pd.IntervalIndex([np.nan, np.nan])

In [3]: ii
Out[3]: 
IntervalIndex([nan, nan],
              closed='right',
              dtype='interval[float64]')

In [4]: ii.intersection(ii)
Out[4]: 
IntervalIndex([],
              closed='right',
              dtype='interval[float64]')

@makbigc
Copy link
Contributor Author

makbigc commented May 28, 2019

It seems the failure of test_constructor_list_frames isn't due to our change in this PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 28, 2019 via email

@jreback
Copy link
Contributor

jreback commented May 30, 2019

@jschendel

-------
taken : IntervalIndex
"""
mask = np.zeros(len(self), dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be an issue with this approach when dupes are present in self and other. For other index types, such a scenario can result in more dupes being present in the intersection than in self. This behavior looks a bit buggy and inconsistent though, so I'm not sure if we actually want IntervalIndex to be consistent with it.

Some examples of the buggy and inconsistent behavior with Index:

In [2]: idx2 = pd.Index(list('aa')) 
   ...: idx3 = pd.Index(list('aaa')) 
   ...: idx3b = pd.Index(list('baaa'))

In [3]: idx2.intersection(idx3)
Out[3]: Index(['a', 'a', 'a', 'a'], dtype='object')

In [4]: idx3.intersection(idx3)
Out[4]: Index(['a', 'a', 'a'], dtype='object')

In [5]: idx2.intersection(idx3)
Out[5]: Index(['a', 'a', 'a', 'a'], dtype='object')

In [6]: idx2.intersection(idx3b)
Out[6]: Index(['a', 'a', 'a'], dtype='object')

It seems strange that [3] has more dupes present than in either original index but [4] does not. Similarly, it seems like [5] and [6] should be identical, as the presence of a non-intersecting element shouldn't impact the number of dupes returned.

@jreback : Do you know what the expected behavior for intersection with dupes should be? Or if there are any dependencies on the behavior of intersection that would dictate this?

If we treat indexes like multisets, then the intersection should contain the minimum multiplicity of dupes, e.g. idx2.intersection(idx3) and idx3.intersection(idx2) should both have length 2, so you maintain the property of the intersection being a subset of the original indexes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is weird as these are set ops

what happens (meaning how much breakage) if

  • raise if left and right are not unique
  • uniquify left and right

prob need to do this for all set ops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had time to extensively test this out but I made the two changes you suggested in indexes/base.py for intersection and both resulted in some breakage. Aside some from breakage in the index set ops tests, there was also some breakage in tests/reshape/test_merge.py.

@jreback
Copy link
Contributor

jreback commented Jun 2, 2019

@jschendel comments here

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from the behavior with dupes, which is inconsistent across all implementations of intersection. I'm fine merging this as-is and then resolving the dupe behavior globally at a later point in time once the expected behavior is determined.

@jreback : feel free to merge if the above sounds good to you. I can investigate the dupe behavior across the different set ops/indexes and create an issue at some point during this coming week.

@makbigc
Copy link
Contributor Author

makbigc commented Jun 6, 2019

@jreback all tests passed.

@jreback jreback merged commit 5d2d6b4 into pandas-dev:master Jun 6, 2019
@jreback
Copy link
Contributor

jreback commented Jun 6, 2019

thanks @makbigc very nice.

LGTM aside from the behavior with dupes, which is inconsistent across all implementations of intersection. I'm fine merging this as-is and then resolving the dupe behavior globally at a later point in time once the expected behavior is determined.

yeah can we create an issue to resolve this, might involve some work.

@datapythonista
Copy link
Member

not sure why the CI didin't fail here, but I think this PR made the benchmarks fail, see #26703: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=12407&view=logs

@makbigc can you take a look please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants