Skip to content

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Feb 5, 2019

Actually, this modification doesn't improve the performance. But the code is more explanatory.

@codecov
Copy link

codecov bot commented Feb 5, 2019

Codecov Report

Merging #25159 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25159      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52408    52408              
==========================================
- Hits        48412    48411       -1     
- Misses       3996     3997       +1
Flag Coverage Δ
#multiple 90.79% <100%> (ø) ⬆️
#single 42.86% <0%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.25% <100%> (ø) ⬆️
pandas/util/testing.py 88.04% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3b0950...9c85d79. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 5, 2019

Codecov Report

Merging #25159 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25159      +/-   ##
==========================================
- Coverage   91.45%   91.27%   -0.19%     
==========================================
  Files         172      173       +1     
  Lines       52892    53002     +110     
==========================================
+ Hits        48373    48376       +3     
- Misses       4519     4626     +107
Flag Coverage Δ
#multiple 89.83% <100%> (-0.19%) ⬇️
#single 41.76% <0%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.25% <100%> (ø) ⬆️
pandas/compat/chainmap.py 61.9% <0%> (-4.77%) ⬇️
pandas/core/dtypes/cast.py 88.16% <0%> (-3.2%) ⬇️
pandas/compat/__init__.py 58.03% <0%> (-0.72%) ⬇️
pandas/io/date_converters.py 100% <0%> (ø) ⬆️
pandas/compat/numpy/function.py 87.91% <0%> (ø) ⬆️
pandas/io/parsers.py 95.34% <0%> (ø) ⬆️
pandas/compat/chainmap_impl.py 0% <0%> (ø)
pandas/core/strings.py 98.59% <0%> (ø) ⬆️
pandas/tseries/offsets.py 96.69% <0%> (ø) ⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e0f9a9...b382b17. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Feb 5, 2019

does this affect perf at all? do we have an asv for this?

Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find any existing benchmarks for this in asv_bench/benchmarks, so will need to add benchmarks there and show a performance improvement.

@@ -463,7 +463,7 @@ def is_unique(self):
"""
Return True if the IntervalIndex contains unique elements, else False
"""
return self._multiindex.is_unique
return len(self) == len(self.unique())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a few ad hoc %timeit's locally and it looks like this actually decreases performance (after removing any caching to ensure accurate timings). The issue here is that self.unique() falls back to operating on an object dtype array, so basically has similar overhead issues that going through a MultiIndex conversion has.

Will probably require writing a custom interval-specific implementation; not sure how much of the existing code will work without adding overhead.

@jschendel jschendel added Performance Memory or execution speed performance Interval Interval data type labels Feb 6, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

can you merge master and report on the asv's for this (and add if needbe)

@makbigc
Copy link
Contributor Author

makbigc commented Mar 21, 2019

As jschendel said, this change slowed down the performance. IntervalArray.unique and algo.unique also convert to an array of Interval in the first place.


All benchmarks:

       before           after         ratio
     [33f91d8f]       [9c23aac9]
     <master>         <is_unique>
+         249±4ns          378±3ms 1519132.92  index_object.IntervalIndexMethod.time_is_unique

@jreback
Copy link
Contributor

jreback commented Mar 22, 2019

@makbigc the asv's that you added look good. can you remove the actual perf change.

@@ -181,4 +181,16 @@ def time_get_loc(self):
self.ind.get_loc(0)


class IntervalIndexMethod(object):
# GH 24813
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this where we asv on is_unique?

@makbigc
Copy link
Contributor Author

makbigc commented Mar 24, 2019

I wrote a new IntervalIndex.is_unique which did improve the performance.

All benchmarks:

       before           after         ratio
     [33f91d8f]       [b382b173]
     <v0.24.0~79>       <is_unique>
-         249±4ns        113±0.5ns     0.45  index_object.IntervalIndexMethod.time_is_unique

right = np.append(np.arange(1, N + 1), np.array(1))
self.intv = IntervalIndex.from_arrays(left, right)

def time_is_unique(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another benchmark with N**3, e.g. s small case

@@ -463,7 +463,13 @@ def is_unique(self):
"""
Return True if the IntervalIndex contains unique elements, else False
"""
return self._multiindex.is_unique
left = self.values.left
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isnt the answer just: self.left.is_unique & self.right.is_unique

Copy link
Member

@jschendel jschendel Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a little too strict; you don't necessarily need left or right uniqueness, only pairwise uniqueness, as you can have duplicate endpoints on one side as long as the other side also isn't the same, e.g. [Interval(0, 1), Interval(0, 2), Interval(0, 3)] should be unique.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think u can simply construct a numpy array and check that then via np.unique by first reshaping

Copy link
Contributor Author

@makbigc makbigc Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried something below

arr = np.array(list(zip(self.values.left, self.values.right)))
np.unique(arr, axis=0)

But np.unique cannot filter out the duplicate np.nan

In [2]: arr = np.array([1, 3, 3, np.nan, np.nan])                               

In [3]: np.unique(arr)                                                          
Out[3]: array([ 1.,  3., nan, nan])

I​'m still looking for other way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use pd.unique

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

can you merge master and update

@jreback
Copy link
Contributor

jreback commented May 12, 2019

can you merge master

@makbigc
Copy link
Contributor Author

makbigc commented May 14, 2019

I broke something when merging. This PR is continued in #26391

@makbigc makbigc closed this May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants