-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Introduction of RangeIndex #9977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`RangeIndex(1, 10, 2)` is a memory saving alternative to `Index(np.arange(1, 10,2))`: c.f. #939. This re-implementation is compatible with the current `Index()` api and is a drop-in replacement for `Int64Index()`. It automatically converts to Int64Index() when required by operations. At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation. This PR also extends the functionality of the `Index()` constructor to allow creation of `RangeIndexes()` with ``` Index(20) Index(2, 20) Index(0, 20, 2) ``` in analogy to ``` range(20) range(2, 20) range(0, 20, 2) ```
@ARF1 looks like a good start, but need some more testing
put up a list of todo's at the top of the PR, as detailed as possible, and check off as you are done. We can do an in-depth review once some of these are checked off. |
copy = None | ||
elif isinstance(copy, int): | ||
range_constructor = True | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems very kludgy, what is the issue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real issue. It is not required for RangeIndex
. I just thought it might be convenient for people to be able to use e.g. Index(5)
to create a RangeIndex(5)
.
It is kludgy since it needs to ensure, that the intent of the user really was to create a RangeIndex
when calling Index(...)
with a given set of parameters. I don't want to grab inputs that I shouldn't.
If the whole Index(...)
calling options extension is an issue or just too complicated I am happy to take it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, see my other comment. if data is a scalar, then its an error (and btw use lib.isscalar
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we only implement the range like construction when using RangeIndex explicitly. To make a range index from the index constructor, you could write something like Index(range(...))
(or xrange on Python 2). This could call an alternate constructor, which might just be _simple_new
. I'll follow up with a more explicit example when I'm not on my phone.
Here's thought on the constructor: class Index(IndexOpsMixin, PandasObject):
# ...
def __new__(cls, data=None, dtype=None, copy=False, name=None, fastpath=False,
tupleize_cols=True, **kwargs):
# ...
if all(hasattr(data, a) for a in ['start', 'stop', 'step']):
return RangeIndex._simple_new(data, name)
# ...
class RangeIndex(Int64Index):
def __new__(cls, *args, **kwargs):
# range should be xrange on Python 2
range_obj = range(*args)
name = kwargs.pop('name', None)
if kwargs:
raise TypeError('too many keyword arguments')
return _simple_new(range_obj, name)
@classmethod
def _simple_new(cls, range_obj, name=None):
result = object.__new__(cls)
result._start = range_obj.start
result._stop = range_obj.stop
result._step = range_obj.step
result.name = name
return result There are two nice things about this approach: 1, the |
I think there are a ton of edge cases that can be dealt with by having a proper constructor. E.g.
I don't think there will be much explict usage of |
@jreback Thank you very much for taking the time for extensively commenting the code. I really appreciate it. On some points I will need clarification (see comments in the line notes). Two more fundamental questions I think we should discuss follow below: Making
|
RE the cost of repeatedly allocating the values -- I would cache the array of values on the RangeIndex once accessed. |
@ARF1 np, always nice to have someone work on the internals! I think the original intention WAS to have @shoyer does have a good point that simple, predictible rules are useful. If its release in a more limited scope (e.g. say non-default and slices only preserving, with a similar to existing constructor). Then how/when would this actually be created / where would it be useful enough that one would explicity create it? See that's the rub. Pandas does lots of inference for users (arguable points whether too much at times of course!). So if somethings natural (and I think we agree this is), then it should be put in the natural places for utility. |
@jreback My notion of a drop in replacement is different. I don't think we should allow for I would like to see this enabled as the default index (when none is explicitly provided) for two reasons:
|
@shoyer Regarding your constructor suggestion above: the To be honest, I do not really see the point however. What sane user would write In short: of course we can do it, but is it worth the effort? @shoyer Regarding your suggestion of caching the values of
|
@shoyer Re caching values: Never mind, I was a bit slow. I think I now understand what you meant: For |
For caching values, you can use the
|
As for handling I think this is nice to support, simply because it's slightly fewer characters to type than It's definitely unfortunate Python 2 doesn't define start, step and stop for
Not so elegant, but it gets the job done. |
if np.isscalar(key): | ||
n = int(key) | ||
if n != key: | ||
return super_getitem(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should always raise KeyError
? Or do we actually allow for non-integer __getitem__
on a Int64Index
? If I recall correctly, numpy has deprecated that sort of indexing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, accepting floats that can be cast to integers is correct:
In [6]: pd.Int64Index([1,2,3,4])[1.0]
Out[6]: 2
Another issue is, that np.isscalar
is too permissive. (It allows single-element numpy arrays.) Probably better to use isinstance(key, numbers.Integer)
or something similar to replicate numpy __getitem__
behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is deprecated for all other index types except Float64Index ; we do not allow float indexers for a lot of reasons
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that changes things. Can we then for consistency reject all non-int-type inputs for RangeIndex
? Especially for the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do.
Cache a private Int64Index object the first time it or its values are required. Restore Index(5) as error. Restore its test. Allow Index(0, 5) and Index(0, 5, 1). Make RangeIndex immutable. See start, stop, step properties. In test_constructor(): check class, attributes (possibly including dtype). In test_copy(): check that copy is not identical (but equal) to the existing. In test_duplicates(): Assert is_unique and has_duplicates return correct values.
* enh: set RangeIndex as default index * fix: pandas.io.packers: encode() and decode() for RangeIndex * enh: array argument pass-through * fix: reindex * fix: use _default_index() in pandas.core.frame.extract_index() * fix: pandas.core.index.Index._is() * fix: add RangeIndex to ABCIndexClass * fix: use _default_index() in _get_names_from_index() * fix: pytables tests * fix: MultiIndex.get_level_values() * fix: RangeIndex._shallow_copy() * fix: null-size RangeIndex equals() comparison * enh: make RangeIndex.is_unique immutable
* optimize argsort() * optimize tolist() * comment clean-up
I am considering making What do you think? |
no, don't subclass except for |
Yeah, that seems likely to be quite problematic. For example, a negative start has a totally different meaning between range (start at that value) and slice (count from the end). On Thu, May 7, 2015 at 9:39 AM, jreback [email protected] wrote:
|
@ARF1 could u plz fix the merge conflict? |
closes #939
RangeIndex(1, 10, 2)
is a memory saving alternative toIndex(np.arange(1, 10,2))
: c.f. #939.This re-implementation is compatible with the current
Index()
api and is a drop-in replacement forInt64Index()
. It automatically converts to Int64Index() when required by operations.At present only for a minimum number of operations the type is conserved (e.g. slicing, inner-, left- and right-joins). Most other operations trigger creation of an equivalent Int64Index (or at least an equivalent numpy array) and fall back to its implementation.
This PR also extends the functionality of the
Index()
constructor to allow creation ofRangeIndex()
within analogy to
TODO:
Int64Index
object the first time it or its values are required.Index(5)
as error. Restore its test. AllowIndex(0, 5)
andIndex(0, 5, 1)
.RangeIndex
immutable. Seestart
,stop
,step
properties.test_constructor()
: check class, attributes (possibly including dtype).test_copy()
: check that copy is not identical (but equal) to the existing.test_join_non_unique()
: test .append for ri.append(int64index) and int64index.append(ri).test_duplicates()
: Assertis_unique
andhas_duplicates
return correct values.test_view()
.test_constructor()
and possible add further tests.RangeIndex()
validity, adjusttest_pickle_compat_construction()
.test_slice_specialised()
.xrange
objects as inputRangeIndex
as the default index.Index
constructor: use ABCs and named arguments__getitem__()
: replacenp.isscalar
to exclude singe-element numpy arrayspandas.io.packers
In need of consensus:
Index(10)
should we also disallowRangeIndex(10)
? That would seem like an unfortunate api break with respect to built-inrange()
,xrange()
and numpyarange()
. Decision: allowRangeIndex()
be allowed as a constructor alternative to the kludgyIndex(np.empty(0, dtype='...'))
? Decision: allowRangeIndex
accept array inputs? If so, what should it do with them? Should it try to infer a correspondingRangeIndex
? What should it do with inputs it cannot infer? Decision: delegate all array inputs toInt64Index
.RangeIndex
constructor acceptdtype
as an argument? Decision: accept, but only for array inputs.Int64Index
orRangeIndex
when possible andInt64Index
when not? Decision: slicing with list will always returnInt64Index
.RangeIndex(10).view('i8')
return? I believeRangeIndex(10)
. Decision: returnRangeIndex(10)
.RangeIndex(10).view('i4')
return? I believeIndex(RangeIndex(10), dtype='i4')
. Decision: returnIndex(RangeIndex(10), dtype='i4')
.