Skip to content

API: consistency with .ix and .loc for getitem operations (GH8613) #9566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 4, 2015

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Feb 28, 2015

closes #8613

In [1]:      df = DataFrame(np.random.randn(5,4), columns=list('ABCD'), index=date_range('20130101',periods=5))

In [2]:      df
Out[2]: 
                   A         B         C         D
2013-01-01 -1.380065  1.221596  0.279703 -0.119258
2013-01-02  1.684202 -0.202251 -0.961145 -1.595015
2013-01-03 -0.623634 -2.377577 -3.024554 -0.298809
2013-01-04 -1.251699  0.456356  1.257002  1.465878
2013-01-05  0.687818 -2.125079  1.454911  1.914207

In [5]:      s = Series(range(5),[-2,-1,1,2,3])

In [6]:      s
Out[6]: 
-2    0
-1    1
 1    2
 2    3
 3    4
dtype: int64

In [7]:      df.loc['2013-01-02':'2013-01-10']
Out[7]: 
                   A         B         C         D
2013-01-02  1.684202 -0.202251 -0.961145 -1.595015
2013-01-03 -0.623634 -2.377577 -3.024554 -0.298809
2013-01-04 -1.251699  0.456356  1.257002  1.465878
2013-01-05  0.687818 -2.125079  1.454911  1.914207

In [9]:      s.loc[-10:3]
Out[9]: 
-2    0
-1    1
 1    2
 2    3
 3    4
dtype: int64
# this used to be a KeyError
In [15]: df.loc[2:3]
TypeError: Cannot index datetime64 with <type 'int'> keys

Slicing with float indexers; now works for ix (loc worked before)

In [3]: s.loc[-1.0:2]
Out[3]: 
-1    1
 1    2
 2    3
dtype: int64

In [4]: s.ix[-1.0:2]
Out[4]: 
-1    1
 1    2
 2    3
dtype: int64

# [] raises
In [5]: s[-1.0:2]
TypeError: cannot do slice start value indexing on <class 'pandas.core.index.Int64Index'> with these indexers [-1.0] of <type 'float'>

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 28, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 28, 2015
@jreback jreback force-pushed the loc branch 4 times, most recently from f5b9f7e to d61b9b5 Compare February 28, 2015 22:42
@jreback
Copy link
Contributor Author

jreback commented Feb 28, 2015

cc @immerrr
@jorisvandenbossche
@shoyer

I think this should resolve things and make .loc strict, while matching .ix (notably in that .ix tries for positional access when it can, but .loc will refuse to guess).


The behavior of a small sub-set of edge cases for indexers have changed (:issue:`8613`):

- to conform the intent of ``.loc`` to allow the same behavior for slice indexing as ``.ix`` when indexers are not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would here also explain it in words, something like "slicing where the start and/or stop bound is not found in the index is now allowed (previously gave KeyError), as is the case for ix. The change is only for slicing, not when indecxing with a single label."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback force-pushed the loc branch 2 times, most recently from 1141055 to 2c2cf61 Compare March 1, 2015 16:06
@@ -1243,25 +1243,7 @@ def _has_valid_type(self, key, axis):
# boolean

if isinstance(key, slice):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to deleting lines from core.indexing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoo hoo!

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

First of all, big thanks for working on this, @jreback!

However, I don't think we need to enforce distinctions between float and int for slice bounds, because there is no ambiguity in the ordering between integers and floats (whereas the ordering with respect to strings, for example, is indeed problematic). Users shouldn't have to check the type of their index (if they know it's numeric) before doing indexing.

In particular:

  1. I think it should be OK to index an Int64Index with floats, like .loc[0.5:2.5] or .loc[-1.0:2].
  2. I think it should be OK to index a Float64Index with integers, like .loc[0:10]. (Of course, these are still to be interpreted as labels, not integer positions)

The binary search logic for determining slice bound locations should already work for both of these cases.

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

To give a more concrete example from my experience: an index corresponding to a variable like longitude might variously consist of integer (e.g., if it's in multiples of whole degrees) or as floating point numbers (e.g., if it's in multiples of half degrees). It's a bad user experience to need to cast their slice bounds to the type of index. Code like s.loc[min_lon:max_lon] should work regardless of dtypes.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

@shoyer

I added in these checks, mainly because numpy is deprecating float indexers (IIRC it has a warning now), and we have been warning on it.

Its easy to take out, but I think that indexing with a float on an integer index is really not in the .loc domain. This is moving back to the .ix functionaility (which actually doesn't work in 0.15.2 or atm), though .loc did (which is odd).

though it IS convenient.

currently we don't allow this for [],.ix (but allow for .loc in 0.15.2.

simple example from 0.15.2

In [1]: s = Series(range(5))

In [2]: s[1.5:3]
TypeError: the slice start value [1.5] is not a proper indexer for this index type (Int64Index)

In [3]: s.ix[1.5:3]
TypeError: the slice start value [1.5] is not a proper indexer for this index type (Int64Index)

In [4]: s.loc[1.5:3]
Out[4]: 
2    2
3    3
dtype: int64

These all fail with
TypeError: cannot do slice start value indexing on <class 'pandas.core.index.Int64Index'> with these indexers [1.5] of <type 'float'> in master/0.16.0

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

@jorisvandenbossche what do you think?

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

I agree that using float indexers is a based idea for position based indexing (.iloc). Position based indexing should always use integers. We should be consistent with numpy for this.

But indexing with .loc uses labels. Here (e.g., with searchsorted), numpy is more flexible -- the important thing is that there is a well defined sort order.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

this should then also work for .ix/[]?

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

To elaborate:

There is never a good reason to use floats for position based indexing. Numpy warns and/or raises because using a float is an indication that your code probably has a bug.

In contrast, there certainly are good reasons to use floats for label based indexing, because the labels might include floats themselves (or might not, depending on dtypes).

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

.ix and [] are an unpredictable mess! We hadn't even written down the rules for exactly how indexing with [] works last time I checked.

If I had my way, we would make some hard choices and deprecate/remove fall-back indexing all together (#9213). In the mean-time, I suspect the best approach is to simply try to break/change as little existing code as possible.

@jreback
Copy link
Contributor Author

jreback commented Mar 1, 2015

ok, here is the change (plus making some testing more robust)

In [3]:  index = tm.makeIntIndex()
In [4]:  s = Series(np.arange(len(index))+10,index+5)
In [5]: s
Out[5]: 
5     10
6     11
7     12
8     13
9     14
10    15
11    16
12    17
13    18
14    19
dtype: int64

In [6]: s.loc[6.0:8.5]
Out[6]: 
6    11
7    12
8    13
dtype: int64

In [7]: s.loc[5.5:8.5]
Out[7]: 
6    11
7    12
8    13
dtype: int64

In [8]: s.loc[5.5:8.0]
Out[8]: 
6    11
7    12
8    13
dtype: int64

I could enable for []/ix, but then would s[5.0:8.0] become a fallback scenario (yes I think)
so what would s[5.5,8.5] do (non-fallback I think), so they become very sensitive to the actual values, hence bad

@jorisvandenbossche
Copy link
Member

I agree with the logic of allowing float indexers for integer indexes when using loc (so as you now implemented)

Ideally, I think it should also work for ix (to keep ix as much as possible to the logic of try: s.loc[]; except: s.iloc[]. In principle , we say that for an integer index axis, we do never integer location fallback (http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing), so following the loc behaviour should not be ambiguous in this case I think.

For [], it is something else. For that, we should first have a better documented overview of it behaviour (I started with that last week, will try to put it up as a PR)

@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2015

ok, I added the .ix support for float indexers on an Int index. It is trivial to make this work for []. I am thinking that this should have the same semantics as .ix (e.g. label based with an int index). It works the same.

@jorisvandenbossche
Copy link
Member

But for [] slicing is 'integer-location' based (an exception on the other types of slicing), so this should follow the semantics of iloc in this case, disallowing floats.

@jorisvandenbossche
Copy link
Member

Example:

In [46]: s = Series(range(4), index=[-1,0,1,2])

In [47]: s
Out[47]:
-1    0
 0    1
 1    2
 2    3
dtype: int64

In [48]: s[0:2]    # <--- slicing = integer location based
Out[48]:
-1    0
 0    1
dtype: int64

In [49]: s[0]    # <--- single label = label based
Out[49]: 1

@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2015

ok, is updated, any more comments
@jorisvandenbossche @shoyer

Notes
-----
Value of `side` parameter should be validated in caller.

"""

# pass thru float indexers if we have a numeric type index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic here seems OK, but a little confusing in context. To me it would make more sense to put it on the subclass _maybe_cast_slice_bound methods, even if that means a little more duplication. Otherwise, things will break, e.g., as soon as we add IntervalIndex.

@shoyer
Copy link
Member

shoyer commented Mar 3, 2015

Generally seems fine by me.

@jreback
Copy link
Contributor Author

jreback commented Mar 4, 2015

indexing is such a rabbit hole :)

I cleaned up _convert_slice_indexer a bit. I tried to use _maybe_cast_slice_bound directly there (I added the kind argument which is the type of indexing, e.g. ix/loc/iloc/getitem). But so many cases are handled that its really hard to get this right. Because the different kinds are handled in different places

its tricky because you have the kinds of indexers (ix/loc/iloc/getitem) x index_types (Index,Float,Int,Datetime....) x kinds of indexing (slice/scalar)

So its better (meaning slightly cleaner), with the semantics we discussed above.

rename typ -> kind for _convert_*_indexer

add kind argument to _maybe_cast_slice_bound

cleaned up _convert_slice_indexer a bit
jreback added a commit that referenced this pull request Mar 4, 2015
API: consistency with .ix and .loc for getitem operations (GH8613)
@jreback jreback merged commit 8992129 into pandas-dev:master Mar 4, 2015
@jorisvandenbossche
Copy link
Member

Thanks a lot for this!

@shoyer
Copy link
Member

shoyer commented Mar 4, 2015

Indeed, greatly appreciated!

@immerrr
Copy link
Contributor

immerrr commented Mar 13, 2015

That's great that you guys (and especially @jreback) actually went through this 👍

@jreback
Copy link
Contributor Author

jreback commented Mar 13, 2015

trying to make indexing slightly less of a rabbit hole :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: label-based slicing with not-included labels
4 participants