[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391

makbigc · 2019-05-14T15:16:05Z

xref PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813
benchmark added
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR follows #25159 (in which I broke something when merging).

All benchmarks:

       before           after         ratio
     [304d8d4b]       [4ec1fe97]
     <master>         <intv-is-unique>
        228±0.2ns        230±0.2ns     1.01  index_object.IntervalIndexMethod.time_is_unique(1000)
-         450±7ns          277±2ns     0.62  index_object.IntervalIndexMethod.time_is_unique(100000)

The approach this time doesn't worsen the performance.

codecov · 2019-05-14T15:58:41Z

Codecov Report

Merging #26391 into master will decrease coverage by <.01%.
The diff coverage is 93.33%.

@@            Coverage Diff             @@
##           master   #26391      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50749    50763      +14     
==========================================
+ Hits        46534    46543       +9     
- Misses       4215     4220       +5

Flag	Coverage Δ
#multiple	`90.19% <93.33%> (ø)`	⬆️
#single	`41.15% <0%> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`95.17% <93.33%> (-0.06%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.02% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...d11acd6. Read the comment docs.

codecov · 2019-05-14T15:58:47Z

Codecov Report

Merging #26391 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26391      +/-   ##
==========================================
- Coverage   91.69%   91.69%   -0.01%     
==========================================
  Files         174      174              
  Lines       50749    50754       +5     
==========================================
+ Hits        46534    46537       +3     
- Misses       4215     4217       +2

Flag	Coverage Δ
#multiple	`90.2% <100%> (ø)`	⬆️
#single	`41.15% <0%> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`95.33% <100%> (+0.1%)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/compat/__init__.py	`93.54% <0%> (-0.4%)`	⬇️
pandas/plotting/_style.py	`76.92% <0%> (-0.26%)`	⬇️
pandas/plotting/_misc.py	`38.23% <0%> (-0.23%)`	⬇️
pandas/core/frame.py	`97.02% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.6% <0%> (-0.11%)`	⬇️
pandas/plotting/_converter.py	`63.66% <0%> (-0.06%)`	⬇️
pandas/core/groupby/generic.py	`88.96% <0%> (-0.05%)`	⬇️
pandas/core/dtypes/dtypes.py	`96.65% <0%> (-0.02%)`	⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...d3af9c9. Read the comment docs.

jreback · 2019-05-14T16:02:58Z

lgtm. @jschendel

jschendel

Thanks, the approach looks good overall, just have some comments on the implementation.

jschendel · 2019-05-14T17:27:55Z

pandas/core/indexes/interval.py

+                    return False
+            return True
+
+        if len(self) - len(self.dropna()) > 1:


I think self.isna().sum() > 1 is a little more idiomatic and performant.

Doing single runs to avoid caching:

In [2]: ii = pd.interval_range(0, 10**5) In [3]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan) In [4]: %timeit -r 1 -n 1 ii.isna().sum() > 1 435 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) In [5]: %timeit -r 1 -n 1 ii_nan.isna().sum() > 1 444 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) In [6]: ii = pd.interval_range(0, 10**5) In [7]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan) In [8]: %timeit -r 1 -n 1 len(ii) - len(ii.dropna()) > 1 677 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) In [9]: %timeit -r 1 -n 1 len(ii_nan) - len(ii_nan.dropna()) > 1 2.18 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

jschendel · 2019-05-14T17:31:52Z

pandas/core/indexes/interval.py

@@ -461,7 +461,28 @@ def is_unique(self):
        """
        Return True if the IntervalIndex contains unique elements, else False
        """
-        return self._multiindex.is_unique
+        left = self.values.left
+        right = self.values.right


self.values.left should be equivalent to self.left, so I think we can get by without needing to define these, and just refer to them as self.left/self.right where needed

pandas/core/indexes/interval.py

jschendel · 2019-05-14T23:09:06Z

pandas/core/indexes/interval.py

+        left = self.values.left
+        right = self.values.right
+
+        def _is_unique(left, right):


If my previous comment is correct, I don't think we need this to be a function anymore since it's only called once, so you can just put the function's logic at the end of the method.

Can you also test out the following variant of _is_unique:

from collections import defaultdict def _is_unique2(left, right): seen_pairs = defaultdict(bool) check_idx = np.where(left.duplicated(keep=False))[0] for idx in check_idx: pair = (left[idx], right[idx]) if seen_pairs[pair]: return False seen_pairs[pair] = True return True

I did a sample run of this, and it appears to be a bit more efficient:

In [3]: np.random.seed(123) ...: left = pd.Index(np.random.randint(5, size=10**5)) ...: right = pd.Index(np.random.randint(10**5/4, size=10**5)) In [4]: %timeit _is_unique(left, right) 3.84 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [5]: %timeit _is_unique2(left, right) 1.13 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I haven't fully tested this in all scenarios though.

HEAD adopts _is_unique2 and HEAD~3 adopts _is_unique. The performance is slightly worse but the code is more explanatory.

before after ratio [4ec1fe97] [202b2cfa] <intv-is-unique~3> <intv-is-unique> 230±0.2ns 217±2ns 0.94 index_object.IntervalIndexMethod.time_is_unique(1000) + 277±2ns 316±5ns 1.14 index_object.IntervalIndexMethod.time_is_unique(100000)

are these asv's really short? maybe have a longer one and see how this scales

Yeah, I was wondering this too; is_unique is cached, so I wonder if the asv is just timing the cache lookup? Does anything special need to be done to handle things that are cached?

HEAD~6 adopts _is_unique while HEAD adopts _is_unique2.

before after ratio [4ec1fe97] [d3af9c91] <intv-is-unique~6> <intv-is-unique> 223±1ns 207±1ns 0.93 index_object.IntervalIndexMethod.time_is_unique(1000) + 270±5ns 302±4ns 1.12 index_object.IntervalIndexMethod.time_is_unique(100000) 1.50±0.01s 1.84±0s ~1.22 index_object.IntervalIndexMethod.time_is_unique(10000000)

jreback · 2019-05-15T23:39:17Z

pandas/core/indexes/interval.py

+        if left.is_unique or right.is_unique:
+            return True
+
+        seen_pairs = defaultdict(bool)


can you just use a set?

ah, yes, that's a lot cleaner - originally was treating endpoints separately and needed the dict structure but should be unnecessary when dealing with tuples

jreback · 2019-05-16T11:55:05Z

pandas/core/indexes/interval.py

+            return True
+
+        seen_pairs = set()
+        check_idx = np.where(left.duplicated(keep=False))[0]


acutally can't you just do this

pairs = [(left[idx], right[idx] for idx in checks_idx] return len(set(pairs)) == len(pairs)

The present approach may be better in which False is returned once a duplicate is found.
To compare the length, we run over all potential duplicates.

asv_bench/benchmarks/index_object.py

makbigc · 2019-05-23T03:14:53Z

@jreback Would you tell me if you have anything to implement?

jreback · 2019-05-26T16:01:42Z

lgtm. @jschendel merge if you are ok with this.

jschendel · 2019-05-28T03:56:18Z

thanks @makbigc

makbigc added 3 commits May 14, 2019 22:06

New IntervalIndex.is_unique

4ec1fe9

Add benchmark

51d6910

Add whatsnew note

d11acd6

makbigc mentioned this pull request May 14, 2019

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

Closed

jschendel mentioned this pull request May 14, 2019

PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813

Closed

9 tasks

jreback added Performance Memory or execution speed performance Interval Interval data type labels May 14, 2019

jreback added this to the 0.25.0 milestone May 14, 2019

jschendel suggested changes May 14, 2019

View reviewed changes

makbigc added 2 commits May 15, 2019 22:23

Change after 1st review

202b2cf

Remove relundant code

8e8384b

jreback reviewed May 15, 2019

View reviewed changes

makbigc added 2 commits May 16, 2019 14:24

Use set instead of defaultdict

8dde393

Lengthen the array in benchmark

d3af9c9

jreback requested changes May 16, 2019

View reviewed changes

jreback reviewed May 16, 2019

View reviewed changes

asv_bench/benchmarks/index_object.py Show resolved Hide resolved

jreback approved these changes May 26, 2019

View reviewed changes

jschendel approved these changes May 28, 2019

View reviewed changes

jschendel merged commit 998a0de into pandas-dev:master May 28, 2019

Uh oh!

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391

Uh oh!

Conversation

makbigc commented May 14, 2019 • edited by jreback Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 14, 2019

Codecov Report

Uh oh!

codecov bot commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented May 14, 2019

Uh oh!

jschendel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

makbigc commented May 23, 2019

Uh oh!

jreback commented May 26, 2019

Uh oh!

jschendel commented May 28, 2019

Uh oh!

Uh oh!

makbigc commented May 14, 2019 •

edited by jreback

Loading

codecov bot commented May 14, 2019 •

edited

Loading