-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
4ec1fe9
51d6910
d11acd6
202b2cf
8e8384b
8dde393
d3af9c9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -461,7 +461,28 @@ def is_unique(self): | |
""" | ||
Return True if the IntervalIndex contains unique elements, else False | ||
""" | ||
return self._multiindex.is_unique | ||
left = self.values.left | ||
right = self.values.right | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
def _is_unique(left, right): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If my previous comment is correct, I don't think we need this to be a function anymore since it's only called once, so you can just put the function's logic at the end of the method. Can you also test out the following variant of from collections import defaultdict
def _is_unique2(left, right):
seen_pairs = defaultdict(bool)
check_idx = np.where(left.duplicated(keep=False))[0]
for idx in check_idx:
pair = (left[idx], right[idx])
if seen_pairs[pair]:
return False
seen_pairs[pair] = True
return True I did a sample run of this, and it appears to be a bit more efficient: In [3]: np.random.seed(123)
...: left = pd.Index(np.random.randint(5, size=10**5))
...: right = pd.Index(np.random.randint(10**5/4, size=10**5))
In [4]: %timeit _is_unique(left, right)
3.84 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit _is_unique2(left, right)
1.13 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) I haven't fully tested this in all scenarios though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HEAD adopts
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are these asv's really short? maybe have a longer one and see how this scales There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I was wondering this too; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HEAD~6 adopts
|
||
# left must have at least one common point | ||
duplicates = left[left.duplicated()].unique() | ||
for dup in duplicates: | ||
# Check whether the Intervals having the same left endpoint | ||
# also have the same right endpoint | ||
if not right[left == dup].is_unique: | ||
return False | ||
return True | ||
|
||
if len(self) - len(self.dropna()) > 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think Doing single runs to avoid caching: In [2]: ii = pd.interval_range(0, 10**5)
In [3]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)
In [4]: %timeit -r 1 -n 1 ii.isna().sum() > 1
435 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [5]: %timeit -r 1 -n 1 ii_nan.isna().sum() > 1
444 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [6]: ii = pd.interval_range(0, 10**5)
In [7]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)
In [8]: %timeit -r 1 -n 1 len(ii) - len(ii.dropna()) > 1
677 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
In [9]: %timeit -r 1 -n 1 len(ii_nan) - len(ii_nan.dropna()) > 1
2.18 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) |
||
return False | ||
|
||
if left.is_unique and right.is_unique: | ||
jschendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return True | ||
elif not left.is_unique: | ||
return _is_unique(left, right) | ||
else: | ||
return _is_unique(right, left) | ||
|
||
@cache_readonly | ||
@Appender(_interval_shared_docs['is_non_overlapping_monotonic'] | ||
|
Uh oh!
There was an error while loading. Please reload this page.