Skip to content

API: behaviour of label indexing with floats on integer index #12333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Feb 15, 2016 · 21 comments
Closed

API: behaviour of label indexing with floats on integer index #12333

jreback opened this issue Feb 15, 2016 · 21 comments
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 15, 2016

from #12246

@jorisvandenbossche this looks odd

In [11]: s = pd.Series([1,2,3])

In [12]: s.loc[0.0] = 10

In [13]: s.loc[0.0]
TypeError: cannot do label indexing on <class 'pandas.indexes.range.RangeIndex'>
 with these indexers [0.0] of <type 'float'>
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Compat pandas objects compatability with Numpy or Python functions labels Feb 15, 2016
@jreback jreback added this to the 0.18.0 milestone Feb 15, 2016
@jorisvandenbossche jorisvandenbossche changed the title DEPR: odd indexing with float indexers API: behaviour of label indexing with floats on integer index Feb 15, 2016
@jorisvandenbossche
Copy link
Member

I think there is clear agreement about the following:

  • when doing integer positional indexing, using floats are not allowed.
    Eg:

    In [22]: s = pd.Series([1,2,3])
    
    In [23]: s.iloc[0.0]
    TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIn
    dex'> with these indexers [0.0] of <type 'float'>
    
    In [24]: s2 = pd.Series([1,2,3], index=['a', 'b', 'c'])
    
    In [25]: s2[0.0]
    TypeError: cannot do label indexing on <class 'pandas.indexes.base.Index'> with
    these indexers [0.0] of <type 'float'>
    

    This was deprecated and now errors after Jeff's PR.

  • when doing label indexing on a non-numerical index (this is logical, but just to be complete)

There is less agreement on when floats should be allowed when doing label indexing on a numerical but not-float index (so integer index or RangeIndex) in case the float can be interpreted as an integer:

  • Previously, these were interpreted as integers, but raised a deprecation warning

  • In current master, when setting, the float will be coerced to an integer when possible:

    In [38]: s = pd.Series([1,2,3])
    
    In [39]: s[1.0] = 10
    
    In [40]: s
    Out[40]:
    0     1
    1    10
    2     3
    dtype: int64
    
  • But (also in current master), when getting a value, this still raises, also for coercables floats:

    In [42]: s[1.0]
    TypeError: cannot do label indexing on <class 'pandas.indexes.range.RangeIndex'>
    with these indexers [1.0] of <type 'float'>
    

    giving the strange contradiction between s[1.0] = 10 and s[1.0] raising

I think for setting, the rule is if the label you want to set evaluates as equal to one of the labels in the index, then it works without changing the index (1 == 1.0 and 1 != '1'). If it does not evaluates to equal, then it is seen as a new label and the index is expanded.

I would argue that the same rule should apply for getting as for setting.
On the other hand, setting is already more flexible (as for example s[1.5] = 10 is allowed and will upcast the index, while getting with this s[1.5] obviously raises).

cc @shoyer @TomAugspurger

@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2016

The issue I have with @jorisvandenbossche last

I think for setting, the rule is if the label you want to set evaluates as equal to one of the labels in the index, then it works without changing the index (1 == 1.0 and 1 != '1'). If it does not evaluates to equal, then it is seen as a new label and the index is expanded.

as this would only apply to [], .ix and NOT .loc as .loc is strict about labels while the others can do positional AND/OR labels.

But since this in an integer index we are talking about we have the odd fallback where it is treated like a label (.ix), so this would be inconsistent then? (of course I may be misrepresenting things here as fallback is simply confusing).

@TomAugspurger
Copy link
Contributor

as this would only apply to [], .ix and NOT .loc as .loc is strict about labels while the others can do positional AND/OR labels.

We could say that a python dict is also strict about labels, but {1: 'one'}[1.0] goes through fine.

I think I'm in agreement with Joris; haven't though how it affects fallback indexing entirely yet.

@jorisvandenbossche
Copy link
Member

AFAIK, integer indexes don't have any fallback to positional, and the indexer is always interpreted as label. So I don't see how this could mess up with the fallback indexing.

.loc is strict about labels while the others can do positional AND/OR labels.

I don't see why that is a reason to make .loc have different behaviour, as we are speaking here only about label indexing, not positional.

@jorisvandenbossche
Copy link
Member

You also have this discrepancy:

In [71]: 1.0 in s.index
Out[71]: True

In [72]: s[1.0]
TypeError: cannot do label indexing on <class 'pandas.indexes.numeric.Int64Index
'> with these indexers [1.0] of <type 'float'>

(to be honest, it would only be a full discrepancy if this gave a "KeyError 1.0 not found in index")

@jreback
Copy link
Contributor Author

jreback commented Feb 16, 2016

ok, so @jorisvandenbossche what do you think should work then. give me an example for setting/getting with the various indexers.

@shoyer
Copy link
Member

shoyer commented Feb 16, 2016

I agree with @jorisvandenbossche, as I stated in the last PR:

  • __setitem__ and __getitem__ should be entirely consistent
  • We should use equality (like the Python dict) to determine whether or not a key matches
  • If inserting a key that doesn't exist, then the index type is upcast if necessary

@jreback
Copy link
Contributor Author

jreback commented Feb 16, 2016

so @shoyer then essentially you want to have full support back for float indexers? (excluding .iloc)

thats what your conditions imply.

@shoyer
Copy link
Member

shoyer commented Feb 16, 2016

@jreback Yes, I guess so. If pandas were stricter about not upcasting types (or even better, if we did not support reindexing with .loc at all), then it could make sense to make the distinction. But as is, I think we should be OK with float indexers. Obviously float indexers that don't match an integer will still result in KeyError, though

@jreback
Copy link
Contributor Author

jreback commented Feb 17, 2016

I suppose this means we are reverting this as well.

In [1]: s = Series(range(3))

In [2]: s
Out[2]: 
0    0
1    1
2    2
dtype: int64

In [3]: s[1.0:2]
TypeError: cannot do slice start value indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [1.0] of <type 'float'>

and have it return equiv of s[1:2] (and allow setting).

Note that this will only apply to integer-like indices as string index will still raise TypeError.

@jorisvandenbossche
Copy link
Member

I suppose this means we are reverting this as well.

In [3]: s[1.0:2]
TypeError: cannot do slice start value indexing on <class 'pandas.indexes.range.RangeIndex'>

No, that should still raise, as this is positional indexing (alas ..):

In [84]: s = pd.Series([1,2,3,4], index=range(1,5))

In [85]: s
Out[85]:
1    1
2    2
3    3
4    4
dtype: int64

In [86]: s[1:3]
Out[86]:
2    2
3    3
dtype: int64

In [87]: s[1.0:3]
TypeError: cannot do slice start value indexing on <class 'pandas.indexes.numeri
c.Int64Index'> with these indexers [1.0] of <type 'float'>

@jorisvandenbossche
Copy link
Member

But you are correct for the case of slicing in .ix (and .loc). But strangely this does not raise on master for me at the moment:

In [88]: s.ix[1.0:3]
Out[88]:
1    1
2    2
3    3
dtype: int64

In [89]: pd.__version__
Out[89]: '0.18.0rc1+10.gcac5f8b'

@jreback
Copy link
Contributor Author

jreback commented Feb 17, 2016

that last has always been allowed.

@jorisvandenbossche
Copy link
Member

@jreback I don't really understand the confusion between us. My main issue is (here and in https://github.com/pydata/pandas/pull/12370/files#r53281590): I don't see why we should have differences in behaviour between scalar indexing and slicing (with regard to labels being found or not). It should just be:

  • positional: floats are disallowed, only integers
  • label-based: we use equality to see if a label matches

This last rule is then applicable to loc/ix/[] for scalar indexing (as these are all label based), and to loc/ix for slicing (as [] is positional and not label-based in the case of slicing, and so should follow the positional rule).

Why would this last one be different for loc?
If you find a matching label with s.loc[1.0], why should this exact same label not match in s.loc[1.0:2] ?

@jreback
Copy link
Contributor Author

jreback commented Feb 18, 2016

the confusion is we had agreement on disallowing floats in indexers except for float indices

which is s very simple rule

now we are allowing them all over the place essentially going back to where we were

creating another indexing mess where nothing is simple

@jorisvandenbossche
Copy link
Member

now we are allowing them all over the place essentially going back to where we were
creating another indexing mess where nothing is simple

And I am sorry for that, as you are doing the hard work in dealing with the complex indexing code, and then changing behaviour every time is not making it easier

the confusion is we had agreement on disallowing floats in indexers except for float indices
which is s very simple rule

OK, I agree that that is indeed a clear and simple rule. However, I personally think the rule should be how @shoyer formulated it: "We should use equality to determine whether or not a key matches".
I understand that you are maybe in favor of the "disallow all floats except for FloatIndex" rule (is that the case?).
But if we accept the equality-rule (which I thought you did, as in the merged PR #12246, you applied it for the setitem case), I think this rule should be entirely consistent for both setitem and getitem, for both scalar indexing as slicing, for both ix and loc.

It's from that point of vue (accepting the equality-rule), that I don't see why positional based slicing in loc should be different.

@TomAugspurger
Copy link
Contributor

I want to echo @jorisvandenbossche thanks for all your work on this. I'm still too scared to open up core/indexing.py 😄

The rules of

  • equality for key / label indexing (like python's dict)
  • strict (integer) for location indexing (like where NumPy's going)
  • symmetry between getitem and setitem

are the best.

@jorisvandenbossche
Copy link
Member

Hmm, the loc slicing case is a bit more complicated than I thought ... I didn't think about the case where the float does not match one of the labels, because that is currently allowed in slicing

(Using 0.17.1 here for the examples) Assuming we follow the rules as @shoyer summarised them above (#12333 (comment)), accessing one value with a float that matches should work:

In [1]: s = Series(range(3))

In [2]: s
Out[2]:
0    0
1    1
2    2
dtype: int64

In [4]: s.loc[1.0]
Out[4]: 1

Then I would say, slicing with matching items should also work:

In [5]: s.loc[1.0:2]
Out[5]:
1    1
2    2
dtype: int64

But what to do with slicing with a float that does not match one of the labels in the index? When slicing with integers, the slice values don't need to be contained in the index, eg:

In [6]: s.loc[1:10]
Out[6]:
1    1
2    2
dtype: int64

So should this also work with floats? And currently, this also works with floats with a fractional part, eg:

In [7]: s.loc[0.5:2.5]
Out[7]:
1    1
2    2
dtype: int64

So in the slicing case, there is no equality required (only an ability to do searchsorted?), in the case of a monotonic index. So the defined rules do not really cover this case ..

@jorisvandenbossche
Copy link
Member

Slicing with floats in on an integer index is actually (now I remember) deliberately implemented, and even extended to work with decreasing monotonic indexes by @shoyer (#8680) even after the initial deprecation for float indexers was put in place.

This also did not raise a deprecation warning:

In [1]: pd.__version__
Out[1]: u'0.17.1'

In [2]: s = Series(range(3))

In [3]: s.loc[1.0:2.5]
Out[3]:
1    1
2    2
dtype: int64

while the above now raises in master. So @jreback even if we decide that we don't want this behaviour any longer, we should at least first deprecate it IMO.

@jreback
Copy link
Contributor Author

jreback commented Feb 20, 2016

I don't mind fixing it to be the right API, but I don't agree with @shoyer at all here. The indexing should strictly depend on the index type. I don't see why a float should be coereced to integer if they happen to be equal.

I heare preaching all the time about how data shouldn't determine indexing. Well isn't that exactly what you are doing here?

We should not allow float indexer/slicers at all except with Float64Index, full-stop. This is a natural and simple rule. Everything else leads to confusion, special cases and whatnot.

A python dictionary and numpy are just too simple for this case. They don't have to simultaneously deal with different TYPES of indexers and they only care about strict label matching (dicts) or positional (numpy).

To be honest I don't care what the rules are for .ix, whatever goes is fine, its already pretty screwed up. However for .loc we have some very strict rules. Every case above you mentioned is bending them some more, pretty soon this just be another .ix.

Pandas must deal with both.

I think this is getting pretty lost in the fact that a user should have really well laid out rules for what is expected of indexing. They by definition HAVE to be simple. They are already way way too complex and special cased.

@shoyer
Copy link
Member

shoyer commented Feb 20, 2016

This discussion reminds me of the one we had back in #8613.

We ended resolving this by not making a distinction between floats and integers for label based indexing: #9566

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Compat pandas objects compatability with Numpy or Python functions Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants