Skip to content

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_monotonic methods #25820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 21, 2019

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Mar 21, 2019

All benchmarks:

       before           after         ratio
     [85572de5]       [445fbe56]
     <master>         <GH24813-inc-dec>
-         261±6ns          134±3ns     0.51  index_object.IntervalIndexMethod.time_monotonic_inc

@codecov
Copy link

codecov bot commented Mar 21, 2019

Codecov Report

Merging #25820 into master will decrease coverage by <.01%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25820      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       53002    53007       +5     
==========================================
+ Hits        48375    48379       +4     
- Misses       4627     4628       +1
Flag Coverage Δ
#multiple 89.83% <75%> (-0.01%) ⬇️
#single 41.76% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 94.94% <75%> (-0.31%) ⬇️
pandas/util/testing.py 89.4% <0%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f39a9ce...445fbe5. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 21, 2019

Codecov Report

Merging #25820 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25820      +/-   ##
==========================================
- Coverage   91.98%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52377    52377              
==========================================
- Hits        48180    48176       -4     
- Misses       4197     4201       +4
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.71% <0%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.22% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab84a0...480b353. Read the comment docs.

@jschendel jschendel added Performance Memory or execution speed performance Interval Interval data type labels Mar 21, 2019
@jreback jreback added this to the 0.25.0 milestone Mar 22, 2019
@jschendel
Copy link
Member

Can you add a whatsnew note to pandas/doc/source/whatsnew/v0.25.0.rst under the "Performance Improvements" section?

class IntervalIndexMethod(object):
# GH 24813
def setup(self):
N = 10**5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add N=10**3 as well

@jreback
Copy link
Contributor

jreback commented Mar 30, 2019

can you update

values = [self.right, self.left]

sort_order = np.lexsort(values)
return Index(sort_order).is_monotonic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @makbigc,

Thansk for contributing.

I worked on CategoricalIndex.is_monotonic_increasing a while ago and I can remember that instantiating an Index was quite heavy, probably because a lot of things are being set up when instantiation an Index. Can you try using pd._libs.algos.is_monotonic(sort_order, False) instead of making an index. Should be even faster.

I get from a quick timeit test:

>>> N = 10**5
>>> left = np.append(np.arange(N), np.array(0))
>>> right = np.append(np.arange(1, N + 1), np.array(1))
>>> sort_order = np.lexsort([left, right])
>>> %timeit  pd.Index(sort_order).is_monotonic
42.8 µs ± 3.48 µs per loop
>>> %timeit pd._libs.algos.is_monotonic(sort_order, False)[0]
4.15 µs ± 123 ns per loop

So a 10x improvement.

Architectually, can you also see if it makes sense to add InterValTree.is_monotonic_[increasing|decreasing] methods using _libs.algos.is_monotonic and then in IntervalIndex.is_monotonic_increasing just make a call to self._engine.is_monotonic_increasing. That would make the call structure for IntervalIndex similar to the one for other indexes, see e.g. Index.is_monotonic_increasing.

@jreback, do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@topper-123 Thanks for your elucidative advice. The use of algos.is_monotonic does improve the performance.

All benchmarks:

       before           after         ratio
     [23650ef5]       [3e32db5b]
     <GH24813-inc-dec~2>       <GH24813-inc-dec>
          213±3ns          210±2ns     0.98  index_object.IntervalIndexMethod.time_monotonic_inc(1000)
          268±1ns          260±2ns     0.97  index_object.IntervalIndexMethod.time_monotonic_inc(100000)

While putting is_monotonic_increasing in IntervalIndex._engine slows down the performance. I guess this is due to the formation of IntervalTree is expensive in the _engine method. The order is increased from ns to ms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not really measuring the speed of the operation in the above benchmark. Notice that @cache_readonly above the is_monotonic_increasing method. That means that the result of the call to is_monotonic_increasing replaces the method itelf upon first call, so you're just measuring attribute access speed.

The formation of IntervalTree should happen only once, so should be in the ns-range also on subsequent calls. Are you forgtting to cache the result of IntervalTree.is_monotonic_increasing? If it resly is being constructed anew each time, can you post an issue to the issue list.

self.intv = IntervalIndex.from_arrays(left, right)

def time_monotonic_inc(self):
self.intv.is_monotonic_increasing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just note that this measures access speed to the cached value, not the time cost of the underlying operation. To time the underlying method, you need to do pd.IntervalIndex.is_monotonic_increasing.func(self.intv), but that hits a cached value at self.intv._multiindex.is_monotonic_increasing in master, so will still seem to be in the nanosecond range, but really won't be because it's just the caching doing tricks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The performance is even better. The _engine is set in prior in asv setup method.

All benchmarks:

       before           after         ratio
     [a1fee919]       [6fb6d2cb]
     <master>         <GH24813-inc-dec>
          229±1ns          209±1ns     0.91  index_object.IntervalIndexMethod.time_monotonic_inc(1000)
-        450±10ns          260±2ns     0.58  index_object.IntervalIndexMethod.time_monotonic_inc(100000)

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@makbigc can you merge master, also @topper-123 offered some nice suggestions, see if they can work


@cache_readonly
def is_monotonic_decreasing(self):
"""
Return True if the IntervalIndex is monotonic decreasing (only equal or
decreasing values), else False
"""
return self._multiindex.is_monotonic_decreasing
return self[::-1].is_monotonic_increasing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer checking monotonic_decreasing directly, so:

values = [self.right, self.left]
sort_order = np.lexsort(values)
return is_monotonic(sort_order, False)[1]

(But adding the properties to IntervalTree is even better).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried in this way. But is_monotonic(sort_order, False)[1] returns False for stationary IntervalIndex, e.g. IntervalIndex([(0, 1], (0, 1]], closed='right', dtype='interval[int64]'). By definition, is_monotonic_decreasing should be True for equal values.

is_monotonic(sort_order, False)[0] returns True for the stationary IntervalIndex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, as lexsort does treats equal values as increasing their index location, so is_monotonic(...)[1] will fail.

It would still deliver efficiency gains from calculating is_monotonic_increasing and unique simultaneous using is_monotonic in IntervalTree as described above. Would you be up for adding this? Else this PR is good IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When implementing is_unique in this way, I got a similar problem. That is, is_monotonic(sort_order, False)[2] returns False for an IntervalIndex with consecutively identity element, e.g. IntervalIndex([(0, 1], (0, 1], (2, 3]]) .

IntervalIndex.is_unique is being handled in #25159

@jreback
Copy link
Contributor

jreback commented Apr 20, 2019

can you merge master and ping on green.

@jreback jreback merged commit ef3bec2 into pandas-dev:master Apr 21, 2019
@jreback
Copy link
Contributor

jreback commented Apr 21, 2019

thanks @makbigc

@makbigc
Copy link
Contributor Author

makbigc commented Apr 22, 2019

​thanks @topper-123 @jschendel and @jreback

@makbigc makbigc deleted the GH24813-inc-dec branch April 22, 2019 02:41
ryanreh99 added a commit to ryanreh99/pandas that referenced this pull request Apr 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants