-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25159 +/- ##
==========================================
- Coverage 92.37% 92.37% -0.01%
==========================================
Files 166 166
Lines 52408 52408
==========================================
- Hits 48412 48411 -1
- Misses 3996 3997 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25159 +/- ##
==========================================
- Coverage 91.45% 91.27% -0.19%
==========================================
Files 172 173 +1
Lines 52892 53002 +110
==========================================
+ Hits 48373 48376 +3
- Misses 4519 4626 +107
Continue to review full report at Codecov.
|
does this affect perf at all? do we have an asv for this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find any existing benchmarks for this in asv_bench/benchmarks
, so will need to add benchmarks there and show a performance improvement.
pandas/core/indexes/interval.py
Outdated
@@ -463,7 +463,7 @@ def is_unique(self): | |||
""" | |||
Return True if the IntervalIndex contains unique elements, else False | |||
""" | |||
return self._multiindex.is_unique | |||
return len(self) == len(self.unique()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran a few ad hoc %timeit
's locally and it looks like this actually decreases performance (after removing any caching to ensure accurate timings). The issue here is that self.unique()
falls back to operating on an object
dtype array, so basically has similar overhead issues that going through a MultiIndex
conversion has.
Will probably require writing a custom interval-specific implementation; not sure how much of the existing code will work without adding overhead.
can you merge master and report on the asv's for this (and add if needbe) |
As jschendel said, this change slowed down the performance. IntervalArray.unique and algo.unique also convert to an array of Interval in the first place.
|
@makbigc the asv's that you added look good. can you remove the actual perf change. |
@@ -181,4 +181,16 @@ def time_get_loc(self): | |||
self.ind.get_loc(0) | |||
|
|||
|
|||
class IntervalIndexMethod(object): | |||
# GH 24813 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this where we asv on is_unique?
I wrote a new
|
right = np.append(np.arange(1, N + 1), np.array(1)) | ||
self.intv = IntervalIndex.from_arrays(left, right) | ||
|
||
def time_is_unique(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add another benchmark with N**3, e.g. s small case
@@ -463,7 +463,13 @@ def is_unique(self): | |||
""" | |||
Return True if the IntervalIndex contains unique elements, else False | |||
""" | |||
return self._multiindex.is_unique | |||
left = self.values.left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why isnt the answer just: self.left.is_unique & self.right.is_unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a little too strict; you don't necessarily need left or right uniqueness, only pairwise uniqueness, as you can have duplicate endpoints on one side as long as the other side also isn't the same, e.g. [Interval(0, 1), Interval(0, 2), Interval(0, 3)]
should be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think u can simply construct a numpy array and check that then via np.unique by first reshaping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried something below
arr = np.array(list(zip(self.values.left, self.values.right)))
np.unique(arr, axis=0)
But np.unique cannot filter out the duplicate np.nan
In [2]: arr = np.array([1, 3, 3, np.nan, np.nan])
In [3]: np.unique(arr)
Out[3]: array([ 1., 3., nan, nan])
I'm still looking for other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use pd.unique
can you merge master and update |
can you merge master |
I broke something when merging. This PR is continued in #26391 |
git diff upstream/master -u -- "*.py" | flake8 --diff
Actually, this modification doesn't improve the performance. But the code is more explanatory.