Introduction of RangeIndex #9977

ARF1 · 2015-04-24T11:21:15Z

closes #939

RangeIndex(1, 10, 2) is a memory saving alternative to Index(np.arange(1, 10,2)): c.f. #939.

This re-implementation is compatible with the current Index() api and is a drop-in replacement for Int64Index(). It automatically converts to Int64Index() when required by operations.

At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation.

This PR also extends the functionality of the Index() constructor to allow creation of RangeIndex() with

Index(20)
Index(2, 20)
Index(0, 20, 2)

in analogy to

range(20)
range(2, 20)
range(0, 20, 2)

TODO:

In need of consensus:

If we disallow Index(10) should we also disallow RangeIndex(10)? That would seem like an unfortunate api break with respect to built-in range(), xrange() and numpy arange(). Decision: allow
Should RangeIndex() be allowed as a constructor alternative to the kludgy Index(np.empty(0, dtype='...'))? Decision: allow
Should RangeIndex accept array inputs? If so, what should it do with them? Should it try to infer a corresponding RangeIndex? What should it do with inputs it cannot infer? Decision: delegate all array inputs to Int64Index.
Should the RangeIndex constructor accept dtype as an argument? Decision: accept, but only for array inputs.
If so, should it accept non-int64 dtypes as valid input? Decision: Irrelevant due to delegation.
If so, does it need to check that the other inputs are within the range of the dtype? Decision: Irrelevant due to delegation.
Related to constructor question, should slicing with a list return? Always Int64Index or RangeIndex when possible and Int64Index when not? Decision: slicing with list will always return Int64Index.
What should RangeIndex(10).view('i8') return? I believe RangeIndex(10). Decision: return RangeIndex(10).
What should RangeIndex(10).view('i4') return? I believe Index(RangeIndex(10), dtype='i4'). Decision: return Index(RangeIndex(10), dtype='i4').

`RangeIndex(1, 10, 2)` is a memory saving alternative to `Index(np.arange(1, 10,2))`: c.f. #939. This re-implementation is compatible with the current `Index()` api and is a drop-in replacement for `Int64Index()`. It automatically converts to Int64Index() when required by operations. At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation. This PR also extends the functionality of the `Index()` constructor to allow creation of `RangeIndexes()` with ``` Index(20) Index(2, 20) Index(0, 20, 2) ``` in analogy to ``` range(20) range(2, 20) range(0, 20, 2) ```

jreback · 2015-04-24T19:07:10Z

@ARF1 looks like a good start, but need some more testing

https://github.com/pydata/pandas/blob/master/pandas/core/common.py#L2145 needs to default to a RangeIndex, otherwise the entire point of this change is lost. There might be other things like this where Int64Index are hard-coded. I think need to change them (but look thru them as well)
need a section in tests/test_frame.py, where you check .set_index/reset_index and assigning an index. e.g. df.index = range(5) should tests that its doing the right thing (and work !)
need a section (make a separate class like I did for TestCategoricalIndex in tests/test_indexing) to cover various indexing cases. This might be somewhat duplicative, but you can simply cover the most common (e.g.. []/iloc/loc with scalar/slice/list/index/array indexers)
eventually will need a small doc/warning for 0.17.0 (you can't actually write this atm as it doesn't exist)
my brief look saw a couple of changed/deleted tests. you cannot do that for this kind of non-API change. That is the biggest issue here, that you inadvertantly change some API guarantees in (maybe subtle) ways. That is why we need a lot of tests/validation for this.
some kind of vbench / metric to see that this indeed a good idea (I think we agree it is, but a couple of simple metrics for showing memory savings / faster ops, whatever). These tend to become part of the whatsnew (like hey pandas changed this and here's why it was done).

put up a list of todo's at the top of the PR, as detailed as possible, and check off as you are done.

We can do an in-depth review once some of these are checked off.

jreback · 2015-04-24T19:09:39Z

pandas/core/index.py

+                    copy = None
+                elif isinstance(copy, int):
+                    range_constructor = True
+                else:


this seems very kludgy, what is the issue here?

No real issue. It is not required for RangeIndex. I just thought it might be convenient for people to be able to use e.g. Index(5) to create a RangeIndex(5).

It is kludgy since it needs to ensure, that the intent of the user really was to create a RangeIndex when calling Index(...) with a given set of parameters. I don't want to grab inputs that I shouldn't.

If the whole Index(...) calling options extension is an issue or just too complicated I am happy to take it out.

ok, see my other comment. if data is a scalar, then its an error (and btw use lib.isscalar)

I would rather we only implement the range like construction when using RangeIndex explicitly. To make a range index from the index constructor, you could write something like Index(range(...)) (or xrange on Python 2). This could call an alternate constructor, which might just be _simple_new. I'll follow up with a more explicit example when I'm not on my phone.

shoyer · 2015-04-24T20:34:52Z

Here's thought on the constructor:

class Index(IndexOpsMixin, PandasObject):
    # ...

    def __new__(cls, data=None, dtype=None, copy=False, name=None, fastpath=False,
                 tupleize_cols=True, **kwargs):
        # ...

        if all(hasattr(data, a) for a in ['start', 'stop', 'step']):
            return RangeIndex._simple_new(data, name)

# ...

class RangeIndex(Int64Index):
   def __new__(cls, *args, **kwargs):
       # range should be xrange on Python 2
       range_obj = range(*args)
       name = kwargs.pop('name', None)
       if kwargs:
           raise TypeError('too many keyword arguments')
       return _simple_new(range_obj, name)

    @classmethod
    def _simple_new(cls, range_obj, name=None):
        result = object.__new__(cls)
        result._start = range_obj.start
        result._stop = range_obj.stop
        result._step = range_obj.step
        result.name = name
        return result

There are two nice things about this approach:

1, the Index constructor remains entirely backwards compatible.
2. you can explicitly make a range index with either RangeIndex(...) or Index(range(...)).

jreback · 2015-04-24T20:56:24Z

I think there are a ton of edge cases that can be dealt with by having a proper constructor. E.g.

Index(np.arange(5)) -> should return RangeIndex
Index(range(5)) -> RangeIndex
Index([0,1,2,3,4,5]) -> what does this return?

I don't think there will be much explict usage of RangeIndex, but it should be a drop-in replacement for Int64Index, thus should be able to accept the same exact input

ARF1 · 2015-04-24T21:06:06Z

@jreback Thank you very much for taking the time for extensively commenting the code. I really appreciate it.

On some points I will need clarification (see comments in the line notes). Two more fundamental questions I think we should discuss follow below:

Making `RangeIndex` the default index

I held off on making RangeIndex the default index class since I thought a "soft phase-in" might be a better approach than an "all or nothing" approach: Once it is in the code base, people could start to use it and remaining bugs / performance issues could be ironed out before making it the default.

Also, until really all conceivable operations are conserving the RangeIndex type, there will probably be performance implications in some cases since materialization to Int64Index will be triggered by many operations (probably) which will be slower than the use of an already existing index as is the case currently.

Materialization of the index will happen repeatedly if the user calls non-RangeIndex-conserving functions repeatedly on an object indexed with a RangeIndex. (See also my line comments regarding the slicing operation.)

In its present form I do not see RangeIndex as a performance PR. In the long term I have a hunch that it could be but I think we are still far from it. Making it so, is a far larger effort I believe. Also I am not sure I am really up to that task. - I am kind of surprised I managed to get it working at all as a pandas novice. ;-)

Are you 100% certain you want to make it the default index straight away? I really was trying to keep the PR manageable to start with and optimize its performance with smaller, more manageable subsequent PRs.

changed/deleted tests

There is only the one test_constructor_corner(self) test that I deleted.

This became necessary since I suggested allowing more inputs to the Index(...) constructor. If I change that constructor I have to change its tests accordingly. (See also line comments.)

I could restore the Index(...) constructor to its current, more limited functionality in which case I can then also restore the test. Should I to that?

shoyer · 2015-04-24T21:11:51Z

RE the cost of repeatedly allocating the values -- I would cache the array of values on the RangeIndex once accessed.

jreback · 2015-04-24T21:30:27Z

@ARF1 np, always nice to have someone work on the internals!

I think the original intention WAS to have RangeIndex be the default index created. However as you have noted then the semantics of the operations need to be quite clearly defined, e.g. when and when not to materialize to a baser type (e.g. we experience similar discussion on when to morph CategoricalIndex to a Index).

@shoyer does have a good point that simple, predictible rules are useful.

If its release in a more limited scope (e.g. say non-default and slices only preserving, with a similar to existing constructor). Then how/when would this actually be created / where would it be useful enough that one would explicity create it?

See that's the rub. Pandas does lots of inference for users (arguable points whether too much at times of course!).

So if somethings natural (and I think we agree this is), then it should be put in the natural places for utility.

shoyer · 2015-04-24T22:16:47Z

@jreback My notion of a drop in replacement is different. I don't think we should allow for RangeIndex(np.arange(...)) and we certainly should not try to infer a range index from Index(np.arange(...)). That is simply too magical, and will have unexpected performance implications -- converting an integer ndarray to an Index should take constant, not linear time. If a user wants this sort of thing, it's not hard to construct the arguments to pass to RangeIndex on their own (e.g., usually this will just be RangeIndex(array[0], array[-1])).

I would like to see this enabled as the default index (when none is explicitly provided) for two reasons:

Currently, the default index is very wasteful. It often gets created and then immediately thrown away. This happens all the time in user code.
Testing RangeIndex as the default will be necessary to be confident that it works properly, and at that point, we've already done most of the hard work here.

ARF1 · 2015-04-25T09:19:39Z

@shoyer Regarding your constructor suggestion above: the xrange object in python 2 is much more limited than the range object in python 3. In particular it does not have start, stop or step properties. We could infer these by testing the return from the object - except for stop for which we could at best determine a new value that will lead to the same results.

To be honest, I do not really see the point however. What sane user would write Index(xrange(0,10)) when Index(0, 10) does the same thing?

In short: of course we can do it, but is it worth the effort?

@shoyer Regarding your suggestion of caching the values of RangeIndex when they are accessed: I probably misunderstood you here but:

If we cache all the values in a RangeIndex why not just convert to a Int64Index in the first place? With your suggestion we are carrying around two objects describing the same thing that we need to keep synchronized. - Sounds like a nightmare to me.
If we cache only parts of the values in the RangeIndex: how would that work in practice?

ARF1 · 2015-04-25T09:23:35Z

@jreback @shoyer I added a list of topics on which we still need a consensus (or alternatively an authoritative decision) to the OP. - I am off for today. Have a nice weekend.

ARF1 · 2015-04-25T17:06:50Z

@shoyer Re caching values: Never mind, I was a bit slow. I think I now understand what you meant: For RangeIndex to have a sort of lazy instantiation of a private Int64Index object the first time the values or a Int64Index is required. I am going to implement it and we'll see whether we mean the same thing.

shoyer · 2015-04-25T17:17:03Z

For caching values, you can use the cache_readonly decorator. So this looks something like:

@cache_readonly
def _data(self):
    # ...

shoyer · 2015-04-25T17:27:01Z

As for handling range/xrange as an argument:

I think this is nice to support, simply because it's slightly fewer characters to type than pd.RangeIndex. It's also nice because it gives us an unambiguous way to use the generic Index constructor to make a RangeIndex. (I think both @jreback and I have noted that supporting Index(start, stop, step) seems like a bad idea.)

It's definitely unfortunate Python 2 doesn't define start, step and stop for xrange. But that information is all still accessible via repr. So you could write something like:

if PY2 and isinstance(data, xrange):
    g, = re.match('xrange\((.+)\)', repr(data)).groups()
    return RangeIndex(*map(int, m.split(', ')))

Not so elegant, but it gets the job done.

shoyer · 2015-04-25T17:33:03Z

pandas/core/index.py

+        if np.isscalar(key):
+            n = int(key)
+            if n != key:
+                return super_getitem(key)


I think this should always raise KeyError? Or do we actually allow for non-integer __getitem__ on a Int64Index? If I recall correctly, numpy has deprecated that sort of indexing...

IMHO, accepting floats that can be cast to integers is correct:

In [6]: pd.Int64Index([1,2,3,4])[1.0] Out[6]: 2

Another issue is, that np.isscalar is too permissive. (It allows single-element numpy arrays.) Probably better to use isinstance(key, numbers.Integer) or something similar to replicate numpy __getitem__ behaviour.

this is deprecated for all other index types except Float64Index ; we do not allow float indexers for a lot of reasons

Ok, that changes things. Can we then for consistency reject all non-int-type inputs for RangeIndex? Especially for the constructor.

Yes, please do.

Cache a private Int64Index object the first time it or its values are required. Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1). Make RangeIndex immutable. See start, stop, step properties. In test_constructor(): check class, attributes (possibly including dtype). In test_copy(): check that copy is not identical (but equal) to the existing. In test_duplicates(): Assert is_unique and has_duplicates return correct values.

ARF1 · 2015-04-27T10:38:44Z

@jreback @shoyer I spun-off work on the "default index issue" to a new PR (#9999) to avoid messing up this PR with its more limited scope and to keep the discussions separate.

* enh: set RangeIndex as default index * fix: pandas.io.packers: encode() and decode() for RangeIndex * enh: array argument pass-through * fix: reindex * fix: use _default_index() in pandas.core.frame.extract_index() * fix: pandas.core.index.Index._is() * fix: add RangeIndex to ABCIndexClass * fix: use _default_index() in _get_names_from_index() * fix: pytables tests * fix: MultiIndex.get_level_values() * fix: RangeIndex._shallow_copy() * fix: null-size RangeIndex equals() comparison * enh: make RangeIndex.is_unique immutable

* optimize argsort() * optimize tolist() * comment clean-up

ARF1 · 2015-05-07T16:33:28Z

I am considering making RangeIndex a subclass of slice to be able to use a RangeIndex for slicing objects without instantiating a numpy array. (Since the __getitem__() functions seem to consistently check for instances of slice before checking for instances of ndarray.)

What do you think?

jreback · 2015-05-07T16:39:47Z

no, don't subclass except for Index. This just makes things harder / more confusing.

shoyer · 2015-05-07T16:42:48Z

Yeah, that seems likely to be quite problematic. For example, a negative start has a totally different meaning between range (start at that value) and slice (count from the end).

On Thu, May 7, 2015 at 9:39 AM, jreback [email protected] wrote:

no, don't subclass except for Index. This just makes things harder / more confusing.

Reply to this email directly or view it on GitHub:
#9977 (comment)

auvipy · 2015-12-03T04:51:59Z

@ARF1 could u plz fix the merge conflict?

jreback · 2015-12-23T22:11:22Z

replace by #11892
cc @ARF1

ARF added 2 commits April 24, 2015 13:11

restore Index() fastpath precedence

872775d

This was referenced Apr 24, 2015

Introduction of RangeIndex #9961

Closed

Add a more memory-efficient RangeIndex-sort of thing to avoid large arange(N) indexes in some cases #939

Closed

No way to construct mixed dtype DataFrame without total copy, proposed solution #9216

Closed

jreback added this to the 0.17.0 milestone Apr 24, 2015

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance API Design labels Apr 24, 2015

jreback reviewed Apr 24, 2015
View reviewed changes

shoyer reviewed Apr 25, 2015
View reviewed changes

ARF1 mentioned this pull request Apr 27, 2015

RangeIndex as default index #9999

Closed

ARF added 4 commits May 2, 2015 00:03

fix slicing

e8c31c5

fix view

b4d1085

enh: various performance optimizations

c9ac12f

* optimize argsort() * optimize tolist() * comment clean-up

ARF1 mentioned this pull request May 7, 2015

PERF: BlockPlacement.copy() speed-up #10073

Closed

jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015

jreback added the Prio-high label Aug 22, 2015

jreback mentioned this pull request Dec 23, 2015

ENH: RangeIndex redux #11892

Merged

2 tasks

jreback closed this Dec 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduction of RangeIndex #9977

Introduction of RangeIndex #9977

ARF1 commented Apr 24, 2015

jreback commented Apr 24, 2015

jreback Apr 24, 2015

ARF1 Apr 24, 2015

jreback Apr 24, 2015

shoyer Apr 24, 2015

shoyer commented Apr 24, 2015

jreback commented Apr 24, 2015

ARF1 commented Apr 24, 2015

shoyer commented Apr 24, 2015

jreback commented Apr 24, 2015

shoyer commented Apr 24, 2015

ARF1 commented Apr 25, 2015

ARF1 commented Apr 25, 2015

ARF1 commented Apr 25, 2015

shoyer commented Apr 25, 2015

shoyer commented Apr 25, 2015

shoyer Apr 25, 2015

ARF1 May 2, 2015

jreback May 2, 2015

ARF1 May 2, 2015

shoyer May 2, 2015

ARF1 commented Apr 27, 2015

ARF1 commented May 7, 2015

jreback commented May 7, 2015

shoyer commented May 7, 2015

no, don't subclass except for `Index`. This just makes things harder / more confusing.

auvipy commented Dec 3, 2015

jreback commented Dec 23, 2015

Introduction of RangeIndex #9977

Introduction of RangeIndex #9977

Conversation

ARF1 commented Apr 24, 2015

jreback commented Apr 24, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Apr 24, 2015

jreback commented Apr 24, 2015

ARF1 commented Apr 24, 2015

Making RangeIndex the default index

changed/deleted tests

shoyer commented Apr 24, 2015

jreback commented Apr 24, 2015

shoyer commented Apr 24, 2015

ARF1 commented Apr 25, 2015

ARF1 commented Apr 25, 2015

ARF1 commented Apr 25, 2015

shoyer commented Apr 25, 2015

shoyer commented Apr 25, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ARF1 commented Apr 27, 2015

ARF1 commented May 7, 2015

jreback commented May 7, 2015

shoyer commented May 7, 2015

no, don't subclass except for Index. This just makes things harder / more confusing.

auvipy commented Dec 3, 2015

jreback commented Dec 23, 2015

Making `RangeIndex` the default index

no, don't subclass except for `Index`. This just makes things harder / more confusing.