-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26391 +/- ##
==========================================
- Coverage 91.69% 91.68% -0.01%
==========================================
Files 174 174
Lines 50749 50763 +14
==========================================
+ Hits 46534 46543 +9
- Misses 4215 4220 +5
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26391 +/- ##
==========================================
- Coverage 91.69% 91.69% -0.01%
==========================================
Files 174 174
Lines 50749 50754 +5
==========================================
+ Hits 46534 46537 +3
- Misses 4215 4217 +2
Continue to review full report at Codecov.
|
lgtm. @jschendel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the approach looks good overall, just have some comments on the implementation.
pandas/core/indexes/interval.py
Outdated
return False | ||
return True | ||
|
||
if len(self) - len(self.dropna()) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think self.isna().sum() > 1
is a little more idiomatic and performant.
Doing single runs to avoid caching:
In [2]: ii = pd.interval_range(0, 10**5)
In [3]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)
In [4]: %timeit -r 1 -n 1 ii.isna().sum() > 1
435 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [5]: %timeit -r 1 -n 1 ii_nan.isna().sum() > 1
444 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [6]: ii = pd.interval_range(0, 10**5)
In [7]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)
In [8]: %timeit -r 1 -n 1 len(ii) - len(ii.dropna()) > 1
677 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [9]: %timeit -r 1 -n 1 len(ii_nan) - len(ii_nan.dropna()) > 1
2.18 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
pandas/core/indexes/interval.py
Outdated
@@ -461,7 +461,28 @@ def is_unique(self): | |||
""" | |||
Return True if the IntervalIndex contains unique elements, else False | |||
""" | |||
return self._multiindex.is_unique | |||
left = self.values.left | |||
right = self.values.right |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.values.left
should be equivalent to self.left
, so I think we can get by without needing to define these, and just refer to them as self.left
/self.right
where needed
pandas/core/indexes/interval.py
Outdated
left = self.values.left | ||
right = self.values.right | ||
|
||
def _is_unique(left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my previous comment is correct, I don't think we need this to be a function anymore since it's only called once, so you can just put the function's logic at the end of the method.
Can you also test out the following variant of _is_unique
:
from collections import defaultdict
def _is_unique2(left, right):
seen_pairs = defaultdict(bool)
check_idx = np.where(left.duplicated(keep=False))[0]
for idx in check_idx:
pair = (left[idx], right[idx])
if seen_pairs[pair]:
return False
seen_pairs[pair] = True
return True
I did a sample run of this, and it appears to be a bit more efficient:
In [3]: np.random.seed(123)
...: left = pd.Index(np.random.randint(5, size=10**5))
...: right = pd.Index(np.random.randint(10**5/4, size=10**5))
In [4]: %timeit _is_unique(left, right)
3.84 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit _is_unique2(left, right)
1.13 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I haven't fully tested this in all scenarios though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEAD adopts _is_unique2
and HEAD~3 adopts _is_unique
. The performance is slightly worse but the code is more explanatory.
before after ratio
[4ec1fe97] [202b2cfa]
<intv-is-unique~3> <intv-is-unique>
230±0.2ns 217±2ns 0.94 index_object.IntervalIndexMethod.time_is_unique(1000)
+ 277±2ns 316±5ns 1.14 index_object.IntervalIndexMethod.time_is_unique(100000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these asv's really short? maybe have a longer one and see how this scales
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering this too; is_unique
is cached, so I wonder if the asv is just timing the cache lookup? Does anything special need to be done to handle things that are cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HEAD~6 adopts _is_unique
while HEAD adopts _is_unique2
.
before after ratio
[4ec1fe97] [d3af9c91]
<intv-is-unique~6> <intv-is-unique>
223±1ns 207±1ns 0.93 index_object.IntervalIndexMethod.time_is_unique(1000)
+ 270±5ns 302±4ns 1.12 index_object.IntervalIndexMethod.time_is_unique(100000)
1.50±0.01s 1.84±0s ~1.22 index_object.IntervalIndexMethod.time_is_unique(10000000)
pandas/core/indexes/interval.py
Outdated
if left.is_unique or right.is_unique: | ||
return True | ||
|
||
seen_pairs = defaultdict(bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just use a set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yes, that's a lot cleaner - originally was treating endpoints separately and needed the dict structure but should be unnecessary when dealing with tuples
return True | ||
|
||
seen_pairs = set() | ||
check_idx = np.where(left.duplicated(keep=False))[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
acutally can't you just do this
pairs = [(left[idx], right[idx] for idx in checks_idx]
return len(set(pairs)) == len(pairs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The present approach may be better in which False
is returned once a duplicate is found.
To compare the length, we run over all potential duplicates.
@jreback Would you tell me if you have anything to implement? |
lgtm. @jschendel merge if you are ok with this. |
thanks @makbigc |
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR follows #25159 (in which I broke something when merging).
The approach this time doesn't worsen the performance.