`Series.getitem` semantics vary according to runtime values #9213

dandavison · 2015-01-08T12:14:10Z

Depending on the runtime value used in [ indexing on a series, the semantics may be array-indexing, or slicing based on a MultiIndex. This is extremely error-prone behavior in a very basic operation.

Is the answer that production code should always use .loc/.ix instead of [? If so, I don't think the docs make that clear.

>>> s = pandas.Series({(1, 1): 11, (1, 2): 12, (2, 1): 21, (2, 2): 22})

>>> s

1  1    11
   2    12
2  1    21
   2    22
dtype: int64

#0 isn't recognized as something to do with the index
# so treat it as an array index
>>> s[0]  
    11

# But this could be asking for a slice,
# so return a slice rather than an array-indexed element.
>>> s[1] 

1    11
2    12
dtype: int64

>>> pandas.__version__
    '0.15.2'

See also #3390

The text was updated successfully, but these errors were encountered:

jreback · 2015-01-09T02:38:17Z

you should simply use .loc/.iloc to be very explicity (and avoid .ix which does fallback indexing, which is even more confusing). [] tries to do the right things, but in edge and ambiguous cases it is not always possible.

If you would like to propose an addition to the docs that you think is more clear. all ears.

ischwabacher · 2015-01-29T18:24:50Z

It's frustrating also that there's no concise way to index df.ix[:-1,'a'] such that the operation succeeds when df.index is an Int64Index, especially when you want to write to this subobject, because the obvious solution df.iloc[:-1].loc[:,'a'] is a chained index.

The only spelling I can think of for this is df.iloc.loc[:-1,'a'], which seems kind of wacky but might be doable. At least you can assign to this without having to call __getitem__.

jreback · 2015-01-30T14:48:05Z

@ischwabacher

In [9]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('ab'))

In [10]: df
Out[10]: 
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [11]: df.loc[df.index[:-1],'a'] *= 2

In [12]: df
Out[12]: 
    a  b
0   0  1
1   4  3
2   8  5
3  12  7
4   8  9

I think the indexing is already quite confusing. I supposed we could add a flag, something like

df.ix(strict=False)[:-1,'a'] to do this. Not that difficult and would have API compat.

ischwabacher · 2015-01-30T16:30:07Z

I guess that works, even though it feels inefficient (said the premature optimizer!). But for the strict=False idea, what does this do:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=[0, 3, 1, 2], columns=['a', 'b'])

In [3]: df
Out[3]: 
     a    b
0  NaN  NaN
3  NaN  NaN
1  NaN  NaN
2  NaN  NaN

In [4]: df.ix(strict=False)[:2,'a']

hughesadam87 · 2015-02-27T00:40:20Z

@dandavison

Looking at this more carefully, it seems you've indeed stumbled on the ambiguity that occurs when your indicies are numeric types. When you do s[0] there is no label equal to 0, so pandas looks up by index. When you do s[1], because 1 is actually being used to label your Series, it looks up by value. When I spoke to you, I didn't realize this is the same issue I encountered early on when working with spectral data in pandas. Soon thereafter, Jeff introduced .loc and .iloc and a nice guide to indexing and selecting data in pandas (http://pandas.pydata.org/pandas-docs/dev/indexing.html). To avoid this problem, you might consider using labels instead of integers. However, if you continue to use numeric labels, then as Jeff said, always use .iloc and .loc for index and label slicing, respectively.

@jreback @shoyer .iloc and .loc have been around long enough that most of the userbase is probably comfortable with them. Would it be too aggressive to say that item slicing (__getitem__) is deprecated? Would it be possible to at least raise a warning or even an error that forbids users to call __getitem__ on objects with numeric labels? Would you consider this a corner case for the most part? All my data is numerically labeled, so I'm biased.

When new users learn pandas, they are instinctively going to slice with [ ], and since this usually outputs the desired result, they're not going to anticipate this cornercase, even if they've heard of .loc and .iloc. When I teach undergrads pandas slicing, I hammer this point to them now because it is such a pitfall.

In regard to the less extreme suggestion of changing the docs, I think the most simple change would be to introduce .iloc and .loc earlier in the docs. Their current order is categorically well-designed, but human beings are predisposed to pay more attention to what comes first.

shoyer · 2015-02-27T09:05:21Z

I only use __getitem__ for two use-cases that I know are safe:

Getting columns by label from a frame, by supplying either a single column name or a list of column names.
Indexing with a boolean array of the same length as the frame/series.

I agree, the current indexing semantics are a land-mine. Now that I think about it, I'm not even sure (1) is safe if my column labels are integers, though I guess I usually use strings.

It would definitely be a good idea to mention .loc and .iloc as preferred options in the docs -- if you have concrete suggestions, please make a PR for that.

We might also consider making a thorough list of what __getitem__ behavior could be deprecated. Though honestly, it's hard to see how that could be done smoothly without breaking lots of code.

jorisvandenbossche · 2015-02-27T09:10:18Z

In any case, good __getitem__ documentation with a clear description of what it does in all cases is missing (docs are quite thorough on ix/loc/iloc etc, but not on []). I was starting with this in #9316, but have still to add __getitem__. I was yesterday also making an overview of all possible cases, will make a new issue about this.

jorisvandenbossche · 2015-02-27T09:12:04Z

Your case (1) is in any case safe I think, as this is always label based (so when having an integer column axis, it also works on the labels, not integer location)

dandavison · 2015-02-27T18:24:23Z

Would it be too aggressive to say that item slicing (getitem) is deprecated?

We might also consider making a thorough list of what getitem behavior could be deprecated. Though honestly, it's hard to see how that could be done smoothly without breaking lots of code.

Python developers new to pandas are always going to use [ unless it raises an exception. If pandas is going to become a widely-used python library in production code then I think that it's essential that [ at least has the same semantics for all inputs. In general I really appreciate pandas, but I don't feel comfortable recommending it for use in my organization with the current behavior.

hughesadam87 · 2015-02-27T19:01:29Z

To bikeshed further, if one had to choose between index or value slicing, what do you guys think the [] should return? While I believe value slicing is more useful, index lookup is the default for most applications, especially numpy. It might be easy to just stick with the mantra [ ] is identical to numpy. That would be easy for most users to digest, and then one could just refer to the numpy docs. Of course this would break much backwards compat. and require serious refactoring, so it's obviously easier said than done.

@dandavison While I don't speak for the pandas development team, I can say from experience that this particular behavior is not indicative of my overall personal experiences with pandas, and the library does have a lot to offer. Hopefully your team will keep a foot in the door so-to-speak. As someone with moderate experience working with numpy, IPython and matplotlib, workflows with pandas usually feel more natural than those without. Of course I don't have the constraint of writing production quality workflows :)

jreback added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 9, 2015

jreback added the API Design label Jan 30, 2015

This was referenced Mar 1, 2015

API: consistency with .ix and .loc for getitem operations (GH8613) #9566

Merged

Overview of [] (__getitem__) API #9595

Open

jreback mentioned this issue Mar 20, 2017

Proposal to change behaviour with .loc and missing keys #15747

Closed

mroeschke removed the API Design label Apr 12, 2021

mroeschke added the MultiIndex label Sep 12, 2024

mroeschke changed the title ~~Indexing semantics vary according to runtime values~~ Series.__getitem__ semantics vary according to runtime values Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Series.getitem` semantics vary according to runtime values #9213

`Series.getitem` semantics vary according to runtime values #9213

dandavison commented Jan 8, 2015

jreback commented Jan 9, 2015

ischwabacher commented Jan 29, 2015

jreback commented Jan 30, 2015

ischwabacher commented Jan 30, 2015

hughesadam87 commented Feb 27, 2015

shoyer commented Feb 27, 2015

jorisvandenbossche commented Feb 27, 2015

jorisvandenbossche commented Feb 27, 2015

dandavison commented Feb 27, 2015

hughesadam87 commented Feb 27, 2015

Series.__getitem__ semantics vary according to runtime values #9213

Series.__getitem__ semantics vary according to runtime values #9213

Comments

dandavison commented Jan 8, 2015

jreback commented Jan 9, 2015

ischwabacher commented Jan 29, 2015

jreback commented Jan 30, 2015

ischwabacher commented Jan 30, 2015

hughesadam87 commented Feb 27, 2015

shoyer commented Feb 27, 2015

jorisvandenbossche commented Feb 27, 2015

jorisvandenbossche commented Feb 27, 2015

dandavison commented Feb 27, 2015

hughesadam87 commented Feb 27, 2015

`Series.getitem` semantics vary according to runtime values #9213

`Series.getitem` semantics vary according to runtime values #9213