ENH: slicing with decreasing monotonic indexes #8680

shoyer · 2014-10-30T06:15:51Z

The first commit adds Index.is_monotonic_decreasing and Index.is_monotonic_increasing (alias for is_monotonic).

is_monotonic will have a performance degradation (still O(n) time) in cases where the Index is decreasing monotonic. If necessary, we could work around this, but I think we can probably get away with this because the fall-back options are much slower and in many cases (e.g., for slice indexing) the next thing we'll want to know is if it's decreasing monotonic, anyways.

Next up will be handling indexing monotonic decreasing indexes with reversed slices, e.g., s.loc[30:10].

CC @jreback @hugadams @immerrr (thank you, specialities wiki!)

immerrr · 2014-10-30T06:35:05Z

Great addition, I thought that this was already implemented.

I think is_monotonic should become an alias of is_monotonic_increasing, not vice versa. Intuitively it should be an OR of direction-specific ones, but it is mentioned in the public API, so it is probably better off as a property with a deprecation warning till 0.17.0.

jreback · 2014-10-30T06:41:20Z

pandas/src/generate_code.py

-            return False, None
+            is_monotonic_inc = 0
+            if not is_monotonic_dec:
+                return False, False, None


I think you want to short circuit this and return early if both incr and decr monotonoc are 0 (break out of loop)
so each branch should be something like

if is_monotonoc_incr & cur < prev:
....
in the else
if not is_monotonoc_incr & not is_monotonic_decr:
return False, False, False

I think the existing logic did the same thing, but definitely not as clear

shoyer · 2014-10-30T07:01:52Z

@immerrr Agreed on all those points. I switched IndexEngine to use is_monotonic_increasing in the second to last commit (still need to clean up Index).

hughesadam87 · 2014-10-30T15:19:39Z

Looks awesome, thanks!

On Thu, Oct 30, 2014 at 3:02 AM, Stephan Hoyer [email protected]
wrote:

@immerrr https://github.com/immerrr Agreed on all those points. I
switched IndexEngine to use is_monotonic_increasing in the second to last
commit (still need to clean up Index).

—
Reply to this email directly or view it on GitHub
#8680 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

shoyer · 2014-10-30T20:00:37Z

By the way, the current behavior of Index.slice_locs has a bug for non-unique, monotonic integer indexes:

>>> idx = pd.Index([10, 12, 12, 14])
>>> idx.slice_locs(12, 12)
(1, 3)
>>> idx.slice_locs(11, 13)
(11, 13)

I'll add a fix (and tests) when I add support for slicing monotonic decreasing indexes.

jreback · 2014-10-30T21:10:08Z

I'd u want to add this fix separately with tests would be ok

shoyer · 2014-10-30T21:32:06Z

Latest commit adds support for monotonic decreasing indexes and should fix that bug. I ended up just removing the special case for non-unique, monotonic integer indexes, because I couldn't think of any scenarios where that shortcut was valid. Please correct me if I'm mistaken there.

shoyer · 2014-10-31T01:19:40Z

One of the tests is still failing. The current version looks like this (I mangled ts2 to ensure that it wasn't monotonic decreasing, either):

    def test_ix_getitem_not_monotonic(self):
        d1, d2 = self.ts.index[[5, 15]]

        ts2 = self.ts.iloc[lrange(15) + lrange(15, len(self.ts))[::-1]]

        self.assertRaises(KeyError, ts2.ix.__getitem__, slice(d1, d2))
        self.assertRaises(KeyError, ts2.ix.__setitem__, slice(d1, d2), 0)

I think this test should not have been passing before (not quite sure how it was), because d1 and d2 are in the index (a datetimeindex), so slicing should not raise an error, even if isn't monotonic

We don't make those sort of guarantees for slice_loc, and honestly we can't, unless we always check if the index is monotonic first. I'm not even sure if that would actually be desirable, given that I think the new behavior (not raising) is what is actually documented.

Thoughts?

shoyer · 2014-10-31T01:21:24Z

pandas/core/generic.py

@@ -1461,7 +1461,7 @@ def xs(self, key, axis=0, level=None, copy=None, drop_level=True):
                            name=self.index[loc])

        else:
-            result = self[loc]
+            result = self.iloc[loc]


This was surprising to encounter (needed to change it to fix some tests), but maybe it was there for a reason? I don't think there is any reason for loc here to do label based rather than integer indexing?

could be
I have tried to not touch xs recently. it seems s but fragile and multi slicing obviates the need for it anyhow

shoyer · 2014-10-31T20:31:13Z

@jreback thoughts on this test case I want to remove? That is to do on this (besides release notes, etc.).

jreback · 2014-10-31T20:41:46Z

no that test is on ts2 which DOESN't have d1 or d2 in it (they both are missing)

But the index is unique and monotonic (increasing). It tries to find d1 but KeyError so it raises.

I think its correct for both monotonic increase and decrease.

You have to have at least 1 end point in the index for it 2 work

shoyer · 2014-10-31T20:53:03Z

pandas/tests/test_series.py

@@ -1521,7 +1521,7 @@ def test_ix_getitem(self):
    def test_ix_getitem_not_monotonic(self):


This is the failing test (that I think should be removed)

d1 and d2 are in the index, so slicing should not raise a KeyError

nevermind.... I misread the ::2

will add a fix

shoyer · 2014-10-31T21:59:38Z

OK, still needs a rebase, but I added what's new notes and tests are passing now.

jreback · 2014-10-31T22:35:15Z

ok, looks reasonable. pls do a perf test and post if anything comes up.

shoyer · 2014-11-01T00:15:40Z

Here are the perf tests that came up 10% slower or more. This all looks like noise to me (would be nice if vbench included an auto-repeat thing like timeit):

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_dtypes                                 |   0.1370 |   0.1240 |   1.1051 |
groupby_ngroups_100_head                     |   0.9100 |   0.8230 |   1.1056 |
frame_from_records_generator_nrows           |   1.1386 |   1.0297 |   1.1058 |
frame_mult_no_ne                             |   7.7243 |   6.9840 |   1.1060 |
frame_mult                                   |   7.6423 |   6.8657 |   1.1131 |
frame_isnull                                 |   0.8477 |   0.7557 |   1.1217 |
frame_multi_and                              |  33.9894 |  30.2606 |   1.1232 |
groupby_first_float64                        |   3.3354 |   2.9689 |   1.1234 |
timeseries_custom_bmonthend_decr_n           |   0.2897 |   0.2549 |   1.1362 |
index_datetime_intersection                  |  13.2007 |  11.6110 |   1.1369 |
datetime_index_union                         |   0.0977 |   0.0856 |   1.1411 |
strings_startswith                           |   4.4993 |   3.9283 |   1.1453 |
strings_endswith                             |   4.5290 |   3.9484 |   1.1471 |
frame_mask_bools                             |  13.8410 |  12.0500 |   1.1486 |
frame_getitem_single_column                  |  30.3197 |  26.3140 |   1.1522 |
frame_mask_floats                            |   9.1396 |   7.8970 |   1.1574 |
frame_mult_st                                |   8.5073 |   7.3420 |   1.1587 |
frame_shift_axis0                            |  16.8714 |  14.4614 |   1.1667 |
datetime_index_intersection                  |   0.3490 |   0.2923 |   1.1939 |
timeseries_day_apply                         |   0.0103 |   0.0083 |   1.2381 |
dtype_infer_float32                          |   0.4770 |   0.3787 |   1.2596 |
groupby_ngroups_100_first                    |   0.6906 |   0.5420 |   1.2742 |
dtype_infer_int32                            |   0.6863 |   0.5360 |   1.2805 |
reindex_daterange_pad                        |   0.9470 |   0.7266 |   1.3033 |
frame_float_equal                            |   2.9230 |   2.2174 |   1.3182 |
timeseries_1min_5min_mean                    |   0.9743 |   0.7337 |   1.3280 |
reindex_fillna_backfill                      |   0.4797 |   0.3503 |   1.3693 |
timeseries_custom_bday_cal_incr_n            |   0.0437 |   0.0317 |   1.3784 |

any of these look suspicious?

jreback · 2014-11-01T00:28:03Z

these r all ok
FYI - their are a bunch of options to vbench
I think -n -r will do repeating (it's like timeit)

jreback · 2014-11-01T00:32:41Z

can u add some tests if the index had nan/NaT bedded ? (I think will work just verifying)

shoyer · 2014-11-01T00:38:32Z

Good idea on NA checks, have to admit I'm a little worried about that.

Actually, I'm pretty sure the existing code for is_monotonic does not handle NA properly. I guess any NA values should imply not monotonic? I believe Numpy's searchsorted assumes a particular sort order for NaN.

On Fri, Oct 31, 2014 at 5:32 PM, jreback [email protected] wrote:

can u add some tests if the index had nan/NaT bedded ? (I think will work just verifying)

Reply to this email directly or view it on GitHub:
#8680 (comment)

jreback · 2014-11-01T00:43:21Z

I think na should be ignored for monotonic checks? eg it doesn't change anything
you can test with cur != cur (though for i8 u have to check the tslib.iNaT value)

searchsorted will work but I think u have to trick it a bit - eg see how argmax and such are done (in base.py)

shoyer · 2014-11-01T04:27:19Z

@jreback OK, I looked into searchsorted with missing values.

argmax and the like can use the NA skipping versions from bottleneck with no speed penalty, since those functions need to look at all the elements, anyways. However, for binary search there is a steep price, because skipping an arbitrary number of missing values means the algorithm runs in O(n) time instead of O(log(n)).

And indeed, np.searchsorted does not handle out of order NaN:

In [9]: x = np.array([1, 2, np.nan, np.nan, np.nan, 3, 4, np.nan, 5])

In [10]: x.searchsorted(3)
Out[10]: 2

In [11]: x.searchsorted(4)
Out[11]: 2

In [12]: x.searchsorted(5)
Out[12]: 2

So, given that our current implementation is buggy, we could write our own version of searchsorted that skips NA values (at least if the array has any missing values), even though in the worst case scenario is no better than checking every element. But this seems like a lot of work, and given how pathological indexes with NA values are, I question if it is worth the effort.

Indeed, even if searchsorted sometimes makes conceptual sense when there are missing values, we would have to come up with a rule for cases where the missing value is a adjacent to the search value, i.e., what should the result of Index([1, nan, 2]).slice_locs(1.5, None) be?

My preferred choice is to simply declare that the presence of NA means an array is not monotonic. I think this is actually a reasonable interpretation (it should be documented), given that the presence of missing values really means the monotonicity is unknown. Users are also likely to assume that monotonicity implies binary search is possible.

jreback · 2014-11-01T04:32:43Z

that seems reasonable - and easy enough to detect in is_monotonic_* routines

shoyer · 2014-11-01T17:06:38Z

Okay, the latest commit should ensure that the presence of NaN or NaT means the index is not monotonic.

Include NaT in the same template ended up pretty awkward... perhaps it would be best to write another function specifically for time types? Or perhaps we don't even really need to check NaT, given that iNaT has a well defined sort order?

jreback · 2014-11-01T17:36:46Z

we had to do this here as well: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L1194

I think you can just do:

if cur != cur or cur == iNaT or prev == iNat:
    # is a nan

possibly maybe kill perf though

shoyer · 2014-11-01T21:42:26Z

Turned up a bug for get_loc with np.nan on numpy 1.7. The root cause is numpy was raising an IndexError when it should have raised (and now does raise) ValueError:

In [1]: import numpy as np

In [2]: x = np.array([0.1, np.nan])

In [3]: x.item()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-f89491118da8> in <module>()
----> 1 x.item()

IndexError: index 511 is out of bounds for axis 0 with size 2

Added a test and a note to what's new.

@jreback

Fixes pandas-dev#7640, pandas-dev#8625 This is a work in progress, but it's far enough along that I'd love to get some feedback. TODOs (more called out in the code): - [ ] documentation + docstrings - [ ] finish the index methods: - [ ] `get_loc` - [ ] `get_indexer` - [ ] `slice_locs` - [ ] comparison operations - [ ] fix `is_monotonic` (pending pandas-dev#8680) - [ ] ensure sorting works - [ ] arithmetic operations (not essential for MVP) - [ ] cythonize the bottlenecks: - [ ] `from_breaks` - [ ] `_data` - [ ] `Interval`? - [ ] `MultiIndex` - [ ] `Categorical`/`cut` - [ ] serialization - [ ] lots more tests CC @jreback @cpcloud @immerrr

jreback · 2014-11-02T14:52:40Z

pandas/tseries/tests/test_base.py

@@ -85,7 +85,7 @@ def test_asobject_tolist(self):
    def test_minmax(self):
        for tz in self.tz:
            # monotonic
-            idx1 = pd.DatetimeIndex([pd.NaT, '2011-01-01', '2011-01-02',
+            idx1 = pd.DatetimeIndex(['2011-01-01', '2011-01-02',


maybe add a test here (and below), with the nat and assertsFalse for monotonic_incr/decr

see test_is_monotonic_na

jreback · 2014-11-02T18:24:23Z

ok, if you'd rebase/squash. (and maybe do a perf sanity check) I think ok to merge.

Index.is_monotonic will have a performance degradation (still O(n) time) in cases where the Index is decreasing monotonic. If necessary, we could work around this, but I think we can probably get away with this because the fall- back options are much slower and in many cases (e.g., for slice indexing) the next thing we'll want to know is if it's decreasing monotonic, anyways. see GH7860

also: NaN and NaT imply not monotonic

shoyer · 2014-11-02T22:20:29Z

OK, rebased and squashed (I put the first three commits together, which had all passed earlier on Travis pre-squashing).

Ran another performance test and everything looks OK! (I found the -r option for quickly only redoing slow tests.)

So I think we are good to merge

ENH: slicing with decreasing monotonic indexes

jreback · 2014-11-02T22:41:33Z

thank you sir! (may I have another?)

immerrr · 2014-11-06T10:31:38Z

~~Oh, this made it to master with slice bounds switch in reverse monotonic case. I thought I managed to dissuade @shoyer from doing that, bummer :(~~

UPD: disregard this message, I've misread the code that does searchsorted on monotonically decreasing index.

jorisvandenbossche · 2014-11-06T10:49:58Z

@immerrr Nothing is too late if we are convinced things should change!
But, what do you mean exactly?

This is on master:

In [30]: s = pd.Series([1,2,3,4], index=[4,3,2,1])

In [32]: s.loc[3:1]
Out[32]:
3    2
2    3
1    4
dtype: int64

In [33]: s.loc[1:3]
Out[33]: Series([], dtype: int64)

In [34]: s.index.is_monotonic_decreasing
Out[34]: True

So you argue it should be the other way around? s.loc[1:3] still returning the result, and s.loc[3:1] not? (while in s.loc[3:1] the labels are in order, following the order of the index)

immerrr · 2014-11-06T11:03:04Z

Oh, disregard that message, I've misread the code that does searchsorted on monotonically decreasing index. I agree with your example.

jreback reviewed Oct 30, 2014
View reviewed changes

jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 30, 2014

shoyer force-pushed the reversed-monotonic-slices branch from c9b3da6 to c1627f6 Compare October 31, 2014 00:45

shoyer reviewed Oct 31, 2014
View reviewed changes

shoyer force-pushed the reversed-monotonic-slices branch from c1627f6 to c93afd1 Compare October 31, 2014 21:11

jreback added this to the 0.15.1 milestone Oct 31, 2014

shoyer changed the title ~~WIP: slicing with decreasing monotonic indexes~~ ENH: slicing with decreasing monotonic indexes Oct 31, 2014

shoyer mentioned this pull request Nov 2, 2014

WIP/API/ENH: IntervalIndex #8707

Closed

35 tasks

jreback reviewed Nov 2, 2014
View reviewed changes

shoyer added 2 commits November 2, 2014 15:09

ENH: support monotonic decreasing indexes in Index.slice_locs

db6f8fd

also: NaN and NaT imply not monotonic

shoyer force-pushed the reversed-monotonic-slices branch from d41610f to db6f8fd Compare November 2, 2014 20:12

jreback added a commit that referenced this pull request Nov 2, 2014

Merge pull request #8680 from shoyer/reversed-monotonic-slices

1382aa3

ENH: slicing with decreasing monotonic indexes

jreback merged commit 1382aa3 into pandas-dev:master Nov 2, 2014

shoyer deleted the reversed-monotonic-slices branch November 2, 2014 22:46

jorisvandenbossche mentioned this pull request Feb 20, 2016

API: behaviour of label indexing with floats on integer index #12333

Closed

		@@ -1521,7 +1521,7 @@ def test_ix_getitem(self):
		def test_ix_getitem_not_monotonic(self):

ENH: slicing with decreasing monotonic indexes #8680

ENH: slicing with decreasing monotonic indexes #8680

Conversation

shoyer commented Oct 30, 2014

immerrr commented Oct 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 30, 2014

hughesadam87 commented Oct 30, 2014

shoyer commented Oct 30, 2014

jreback commented Oct 30, 2014

shoyer commented Oct 30, 2014

shoyer commented Oct 31, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 31, 2014

jreback commented Oct 31, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 31, 2014

jreback commented Oct 31, 2014

shoyer commented Nov 1, 2014

jreback commented Nov 1, 2014

jreback commented Nov 1, 2014

shoyer commented Nov 1, 2014

can u add some tests if the index had nan/NaT bedded ? (I think will work just verifying)

jreback commented Nov 1, 2014

shoyer commented Nov 1, 2014

jreback commented Nov 1, 2014

shoyer commented Nov 1, 2014

jreback commented Nov 1, 2014

shoyer commented Nov 1, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Nov 2, 2014

shoyer commented Nov 2, 2014

jreback commented Nov 2, 2014

immerrr commented Nov 6, 2014

jorisvandenbossche commented Nov 6, 2014

immerrr commented Nov 6, 2014