Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

makbigc · 2019-02-05T15:57:18Z

xref PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813
passes git diff upstream/master -u -- "*.py" | flake8 --diff

Actually, this modification doesn't improve the performance. But the code is more explanatory.

codecov · 2019-02-05T16:31:23Z

Codecov Report

Merging #25159 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25159      +/-   ##
==========================================
- Coverage   92.37%   92.37%   -0.01%     
==========================================
  Files         166      166              
  Lines       52408    52408              
==========================================
- Hits        48412    48411       -1     
- Misses       3996     3997       +1

Flag	Coverage Δ
#multiple	`90.79% <100%> (ø)`	⬆️
#single	`42.86% <0%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`95.25% <100%> (ø)`	⬆️
pandas/util/testing.py	`88.04% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3b0950...9c85d79. Read the comment docs.

codecov · 2019-02-05T16:31:25Z

Codecov Report

Merging #25159 into master will decrease coverage by 0.18%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #25159      +/-   ##
==========================================
- Coverage   91.45%   91.27%   -0.19%     
==========================================
  Files         172      173       +1     
  Lines       52892    53002     +110     
==========================================
+ Hits        48373    48376       +3     
- Misses       4519     4626     +107

Flag	Coverage Δ
#multiple	`89.83% <100%> (-0.19%)`	⬇️
#single	`41.76% <0%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/interval.py	`95.25% <100%> (ø)`	⬆️
pandas/compat/chainmap.py	`61.9% <0%> (-4.77%)`	⬇️
pandas/core/dtypes/cast.py	`88.16% <0%> (-3.2%)`	⬇️
pandas/compat/__init__.py	`58.03% <0%> (-0.72%)`	⬇️
pandas/io/date_converters.py	`100% <0%> (ø)`	⬆️
pandas/compat/numpy/function.py	`87.91% <0%> (ø)`	⬆️
pandas/io/parsers.py	`95.34% <0%> (ø)`	⬆️
pandas/compat/chainmap_impl.py	`0% <0%> (ø)`
pandas/core/strings.py	`98.59% <0%> (ø)`	⬆️
pandas/tseries/offsets.py	`96.69% <0%> (ø)`	⬆️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e0f9a9...b382b17. Read the comment docs.

jreback · 2019-02-05T20:55:31Z

does this affect perf at all? do we have an asv for this?

jschendel

I can't find any existing benchmarks for this in asv_bench/benchmarks, so will need to add benchmarks there and show a performance improvement.

jschendel · 2019-02-05T23:26:11Z

pandas/core/indexes/interval.py

@@ -463,7 +463,7 @@ def is_unique(self):
        """
        Return True if the IntervalIndex contains unique elements, else False
        """
-        return self._multiindex.is_unique
+        return len(self) == len(self.unique())


I ran a few ad hoc %timeit's locally and it looks like this actually decreases performance (after removing any caching to ensure accurate timings). The issue here is that self.unique() falls back to operating on an object dtype array, so basically has similar overhead issues that going through a MultiIndex conversion has.

Will probably require writing a custom interval-specific implementation; not sure how much of the existing code will work without adding overhead.

jreback · 2019-03-20T02:08:15Z

can you merge master and report on the asv's for this (and add if needbe)

makbigc · 2019-03-21T15:30:39Z

As jschendel said, this change slowed down the performance. IntervalArray.unique and algo.unique also convert to an array of Interval in the first place.


All benchmarks:

       before           after         ratio
     [33f91d8f]       [9c23aac9]
     <master>         <is_unique>
+         249±4ns          378±3ms 1519132.92  index_object.IntervalIndexMethod.time_is_unique

jreback · 2019-03-22T11:54:38Z

@makbigc the asv's that you added look good. can you remove the actual perf change.

jreback · 2019-03-22T11:55:06Z

asv_bench/benchmarks/index_object.py

@@ -181,4 +181,16 @@ def time_get_loc(self):
        self.ind.get_loc(0)


+class IntervalIndexMethod(object):
+    # GH 24813


is this where we asv on is_unique?

makbigc · 2019-03-24T13:31:47Z

I wrote a new IntervalIndex.is_unique which did improve the performance.

All benchmarks:

       before           after         ratio
     [33f91d8f]       [b382b173]
     <v0.24.0~79>       <is_unique>
-         249±4ns        113±0.5ns     0.45  index_object.IntervalIndexMethod.time_is_unique

jreback · 2019-03-24T15:25:53Z

asv_bench/benchmarks/index_object.py

+        right = np.append(np.arange(1, N + 1), np.array(1))
+        self.intv = IntervalIndex.from_arrays(left, right)
+
+    def time_is_unique(self):


can you add another benchmark with N**3, e.g. s small case

jreback · 2019-03-24T15:29:43Z

pandas/core/indexes/interval.py

@@ -463,7 +463,13 @@ def is_unique(self):
        """
        Return True if the IntervalIndex contains unique elements, else False
        """
-        return self._multiindex.is_unique
+        left = self.values.left


why isnt the answer just: self.left.is_unique & self.right.is_unique

That's a little too strict; you don't necessarily need left or right uniqueness, only pairwise uniqueness, as you can have duplicate endpoints on one side as long as the other side also isn't the same, e.g. [Interval(0, 1), Interval(0, 2), Interval(0, 3)] should be unique.

i think u can simply construct a numpy array and check that then via np.unique by first reshaping

I have tried something below

arr = np.array(list(zip(self.values.left, self.values.right))) np.unique(arr, axis=0)

But np.unique cannot filter out the duplicate np.nan

In [2]: arr = np.array([1, 3, 3, np.nan, np.nan]) In [3]: np.unique(arr) Out[3]: array([ 1., 3., nan, nan])

I'm still looking for other way.

use pd.unique

jreback · 2019-04-05T00:52:44Z

can you merge master and update

jreback · 2019-05-12T21:19:57Z

can you merge master

makbigc · 2019-05-14T15:16:47Z

I broke something when merging. This PR is continued in #26391

jschendel suggested changes Feb 6, 2019

View reviewed changes

jschendel added Performance Memory or execution speed performance Interval Interval data type labels Feb 6, 2019

jreback reviewed Mar 22, 2019

View reviewed changes

makbigc added 4 commits March 24, 2019 21:15

Remove MultiIndex conversion in is_unique

fbce54f

Add benchmark

986ce9a

New IntervalIndex.is_unique

fda328b

Add whatsnew entry

b382b17

makbigc force-pushed the is_unique branch from 9c23aac to b382b17 Compare March 24, 2019 13:29

jreback requested changes Mar 24, 2019

View reviewed changes

makbigc mentioned this pull request Apr 21, 2019

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_monotonic methods #25820

Merged

makbigc mentioned this pull request May 14, 2019

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391

Merged

makbigc closed this May 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

makbigc commented Feb 5, 2019 •

edited by jreback

Loading

codecov bot commented Feb 5, 2019

codecov bot commented Feb 5, 2019 •

edited

Loading

jreback commented Feb 5, 2019

jschendel left a comment

jschendel Feb 5, 2019

jreback commented Mar 20, 2019

makbigc commented Mar 21, 2019

jreback commented Mar 22, 2019

jreback Mar 22, 2019

makbigc commented Mar 24, 2019

jreback Mar 24, 2019

jreback Mar 24, 2019

jschendel Mar 24, 2019 •

edited

Loading

jreback Mar 24, 2019

makbigc Apr 7, 2019 •

edited

Loading

jreback Apr 7, 2019

jreback commented Apr 5, 2019

jreback commented May 12, 2019

makbigc commented May 14, 2019

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

Get rid of MultiIndex conversion in IntervalIndex.is_unique #25159

Conversation

makbigc commented Feb 5, 2019 • edited by jreback Loading

codecov bot commented Feb 5, 2019

Codecov Report

codecov bot commented Feb 5, 2019 • edited Loading

Codecov Report

jreback commented Feb 5, 2019

jschendel left a comment

Choose a reason for hiding this comment

jschendel Feb 5, 2019

Choose a reason for hiding this comment

jreback commented Mar 20, 2019

makbigc commented Mar 21, 2019

jreback commented Mar 22, 2019

jreback Mar 22, 2019

Choose a reason for hiding this comment

makbigc commented Mar 24, 2019

jreback Mar 24, 2019

Choose a reason for hiding this comment

jreback Mar 24, 2019

Choose a reason for hiding this comment

jschendel Mar 24, 2019 • edited Loading

Choose a reason for hiding this comment

jreback Mar 24, 2019

Choose a reason for hiding this comment

makbigc Apr 7, 2019 • edited Loading

Choose a reason for hiding this comment

jreback Apr 7, 2019

Choose a reason for hiding this comment

jreback commented Apr 5, 2019

jreback commented May 12, 2019

makbigc commented May 14, 2019

makbigc commented Feb 5, 2019 •

edited by jreback

Loading

codecov bot commented Feb 5, 2019 •

edited

Loading

jschendel Mar 24, 2019 •

edited

Loading

makbigc Apr 7, 2019 •

edited

Loading