-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_monotonic methods #25820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #25820 +/- ##
==========================================
- Coverage 91.27% 91.26% -0.01%
==========================================
Files 173 173
Lines 53002 53007 +5
==========================================
+ Hits 48375 48379 +4
- Misses 4627 4628 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25820 +/- ##
==========================================
- Coverage 91.98% 91.97% -0.01%
==========================================
Files 175 175
Lines 52377 52377
==========================================
- Hits 48180 48176 -4
- Misses 4197 4201 +4
Continue to review full report at Codecov.
|
Can you add a whatsnew note to |
asv_bench/benchmarks/index_object.py
Outdated
class IntervalIndexMethod(object): | ||
# GH 24813 | ||
def setup(self): | ||
N = 10**5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add N=10**3 as well
can you update |
pandas/core/indexes/interval.py
Outdated
values = [self.right, self.left] | ||
|
||
sort_order = np.lexsort(values) | ||
return Index(sort_order).is_monotonic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @makbigc,
Thansk for contributing.
I worked on CategoricalIndex.is_monotonic_increasing
a while ago and I can remember that instantiating an Index was quite heavy, probably because a lot of things are being set up when instantiation an Index. Can you try using pd._libs.algos.is_monotonic(sort_order, False)
instead of making an index. Should be even faster.
I get from a quick timeit test:
>>> N = 10**5
>>> left = np.append(np.arange(N), np.array(0))
>>> right = np.append(np.arange(1, N + 1), np.array(1))
>>> sort_order = np.lexsort([left, right])
>>> %timeit pd.Index(sort_order).is_monotonic
42.8 µs ± 3.48 µs per loop
>>> %timeit pd._libs.algos.is_monotonic(sort_order, False)[0]
4.15 µs ± 123 ns per loop
So a 10x improvement.
Architectually, can you also see if it makes sense to add InterValTree.is_monotonic_[increasing|decreasing]
methods using _libs.algos.is_monotonic
and then in IntervalIndex.is_monotonic_increasing
just make a call to self._engine.is_monotonic_increasing
. That would make the call structure for IntervalIndex similar to the one for other indexes, see e.g. Index.is_monotonic_increasing
.
@jreback, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@topper-123 Thanks for your elucidative advice. The use of algos.is_monotonic does improve the performance.
All benchmarks:
before after ratio
[23650ef5] [3e32db5b]
<GH24813-inc-dec~2> <GH24813-inc-dec>
213±3ns 210±2ns 0.98 index_object.IntervalIndexMethod.time_monotonic_inc(1000)
268±1ns 260±2ns 0.97 index_object.IntervalIndexMethod.time_monotonic_inc(100000)
While putting is_monotonic_increasing
in IntervalIndex._engine
slows down the performance. I guess this is due to the formation of IntervalTree
is expensive in the _engine
method. The order is increased from ns to ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not really measuring the speed of the operation in the above benchmark. Notice that @cache_readonly
above the is_monotonic_increasing
method. That means that the result of the call to is_monotonic_increasing
replaces the method itelf upon first call, so you're just measuring attribute access speed.
The formation of IntervalTree should happen only once, so should be in the ns-range also on subsequent calls. Are you forgtting to cache the result of IntervalTree.is_monotonic_increasing
? If it resly is being constructed anew each time, can you post an issue to the issue list.
self.intv = IntervalIndex.from_arrays(left, right) | ||
|
||
def time_monotonic_inc(self): | ||
self.intv.is_monotonic_increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just note that this measures access speed to the cached value, not the time cost of the underlying operation. To time the underlying method, you need to do pd.IntervalIndex.is_monotonic_increasing.func(self.intv)
, but that hits a cached value at self.intv._multiindex.is_monotonic_increasing
in master, so will still seem to be in the nanosecond range, but really won't be because it's just the caching doing tricks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The performance is even better. The _engine
is set in prior in asv setup method.
All benchmarks:
before after ratio
[a1fee919] [6fb6d2cb]
<master> <GH24813-inc-dec>
229±1ns 209±1ns 0.91 index_object.IntervalIndexMethod.time_monotonic_inc(1000)
- 450±10ns 260±2ns 0.58 index_object.IntervalIndexMethod.time_monotonic_inc(100000)
@makbigc can you merge master, also @topper-123 offered some nice suggestions, see if they can work |
|
||
@cache_readonly | ||
def is_monotonic_decreasing(self): | ||
""" | ||
Return True if the IntervalIndex is monotonic decreasing (only equal or | ||
decreasing values), else False | ||
""" | ||
return self._multiindex.is_monotonic_decreasing | ||
return self[::-1].is_monotonic_increasing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer checking monotonic_decreasing directly, so:
values = [self.right, self.left]
sort_order = np.lexsort(values)
return is_monotonic(sort_order, False)[1]
(But adding the properties to IntervalTree
is even better).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried in this way. But is_monotonic(sort_order, False)[1]
returns False
for stationary IntervalIndex, e.g. IntervalIndex([(0, 1], (0, 1]], closed='right', dtype='interval[int64]')
. By definition, is_monotonic_decreasing
should be True
for equal values.
is_monotonic(sort_order, False)[0]
returns True
for the stationary IntervalIndex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, as lexsort does treats equal values as increasing their index location, so is_monotonic(...)[1] will fail.
It would still deliver efficiency gains from calculating is_monotonic_increasing
and unique
simultaneous using is_monotonic
in IntervalTree
as described above. Would you be up for adding this? Else this PR is good IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When implementing is_unique
in this way, I got a similar problem. That is, is_monotonic(sort_order, False)[2]
returns False
for an IntervalIndex with consecutively identity element, e.g. IntervalIndex([(0, 1], (0, 1], (2, 3]])
.
IntervalIndex.is_unique
is being handled in #25159
4f53e80
to
34ad0f2
Compare
can you merge master and ping on green. |
thanks @makbigc |
thanks @topper-123 @jschendel and @jreback |
git diff upstream/master -u -- "*.py" | flake8 --diff