Skip to content

API: should the RangeIndex constructor accept start=range_index? #12067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Jan 17, 2016 · 6 comments · Fixed by #40241
Closed

API: should the RangeIndex constructor accept start=range_index? #12067

shoyer opened this issue Jan 17, 2016 · 6 comments · Fixed by #40241
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Index Related to the Index class or subclasses
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Jan 17, 2016

xref #11892 (see changes and discussion)

RangeIndex, a new Int64Index subclass that will be used as the new default index (currently) has the following signature: RangeIndex(start=None, stop=None, step=None, name=None...)

For converting an xrange/range object into a RangeIndex, we have the constructor RangeIndex.from_range.

Although it isn't documented, the current constructor supports accepting a RangeIndex instances as the start argument, in which case a copy of the original index is produced, e.g., RangeIndex(RangeIndex(0, 10, 1)) -> RangeIndex(0, 10, 1).

I think this is a mistake: it is reusing the start parameter for a different type of argument, at odds with the documented behavior and predictable function signatures. If a user wants to create another RangeIndex from a RangeIndex, they can simply use the Index construtor, e.g., Index(RangeIndex(...)).

@jreback argues that not accepting a RangeIndex as the first argument to RangeIndex would be an API break, because it would create an Index constructor that is not idempotent. It would be "a major break with the current Index design".

Thoughts from anyone else?

@shoyer shoyer mentioned this issue Jan 17, 2016
2 tasks
@wesm
Copy link
Member

wesm commented Jan 17, 2016

In general constructors in pandas are "too smart", but we are stuck in a lot of cases.

I don't know that Index constructor idempotency is an API guarantee we necessarily need to make. It was part of the original effort to be similar to np.array. Further, are we expecting people to actually use pd.RangeIndex in many direct ways? I only ever thought of RangeIndex as being an Int64Index-like performance optimization. You can see my original patch here

ef878f0

For internal pandas development, we should probably start biasing toward more precise internal APIs where possible, but when user convenience is concerned (where I've typically leaned toward more "polymorphic" / intuitive constructors at the expense of purity) there's a decent argument.

So I know it is a significant improvement and lowers memory usage, but is RangeIndex something we even wish to advertise heavily, since for most users it will just "be there" incidentally through use of read_csv or DataFrame or other constructors that create default indexes?

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Jan 17, 2016
@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

RangeIndex was/is really an internal performance substitute for Int64Index. However we DO have an (maybe implicit) guarantee on idempotency for a Index objects.

So happy to discuss:

  • I think its much easier in a new issue, as I doubt that long twisting conversation (about multiple topics) was easy to follow
  • as discussed this is mostly internal, though we are exposing it as a range-like object, hence the default constructor
  • I agree with a bit too much magic in places, but I find this convenient and unambiguous, as I indicated to @shoyer , I don't find any good reason not to do this.
  • contrary to @shoyer 's statement, RangeIndex(RangeIndex(...)) merely returns the original object (you can of course get a copy via copy=True, but it is False by default).
  • many-many times internally we have to guarantee that something is an index, usually via _ensure_index, but this goes thru Index, not always what you want. I don't think this is actually used very much (e.g. almost always self._shallow_copy() is called), but I don't see any good reason to break from the: self._constructor(self) possibility. Then you have special casing all around. Something which I feel is SO important at the size of the library. Special cases just causes bugs.

@shoyer made two points: is this user-facing or a performance enhancement.

It is both, hence the intertanglement of the two.

@kawochen
Copy link
Contributor

@shoyer range does the same (10 in range(10) is/means stop=10). Do you think adequate documentation would be a good enough compromise (current error message of RangeIndex constructor says start must be integers)?

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

In [1]: Index(xrange(0,10))
Out[1]: RangeIndex(start=0, stop=10, step=1)

In [5]: Index(pd.RangeIndex(0, 10))
Out[5]: RangeIndex(start=0, stop=10, step=1)

In [2]: pd.RangeIndex(pd.RangeIndex(0,10))
Out[2]: RangeIndex(start=0, stop=10, step=1)

In [3]: pd.RangeIndex.from_range(xrange(0,10))
Out[3]: RangeIndex(start=0, stop=10, step=1)

to satisfy @shoyer concern about magic and my own about idempotency/consistency

I would suggest we do this:

# [1] (though impl changes slightly),, but API the same
In [1]: Index(xrange(0,10))
Out[1]: RangeIndex(start=0, stop=10, step=1)

In [5]: Index(pd.RangeIndex(xrange(0, 10)))
Out[5]: RangeIndex(start=0, stop=10, step=1)

# [2] stays the same
In [2]: pd.RangeIndex(pd.RangeIndex(0,10))
Out[2]: RangeIndex(start=0, stop=10, step=1)

# but also accepts range (so this covers [3] from above)
In [2]: pd.RangeIndex(xrange(0,10))
Out[2]: RangeIndex(start=0, stop=10, step=1)

# make the 'range like' constructor something else (from_steps)
In [3]: pd.RangeIndex.from_steps(0, 10)
Out[3]: RangeIndex(start=0, stop=10, step=1)
  • so this makes RangeIndex constructor very much like the other Index constructor
def __new__(cls, data=None, dtype=None, copy=False, name=None, fastpath=False, **kwargs):
  • allows idempotency, and no possibility of confusion/magic
  • makes de-facto equal actions yield equal results (e.g. xrange/RangeIndex) have a single argument constructor from a 'range' object. These are quite similar, but very different from passing 3 integers.

if consensus, this is a pretty easy change.

@shoyer
Copy link
Member Author

shoyer commented Jan 17, 2016

@jreback the only downside of your new proposal is that the repr now no longer matches the constructor. I suppose we could make it RangeIndex(range(0, 10, 1)) but that does look a little strange to me.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

In [20]: df.index
Out[20]: RangeIndex(start=0, stop=5, step=1)

hmm, right.

could do

RangeIndex.from_steps(0, 10, 1)

bug agreed that is a bit more verbose

of course could accept start,stop,step as **kwargs and correctly handle that
(though we are then back to the original constructor that I had)

@toobaz toobaz added Index Related to the Index class or subclasses and removed Indexing Related to indexing on series/frames, not to indexes themselves labels Jun 28, 2019
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Feb 22, 2020
@jreback jreback added this to the 1.3 milestone Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants