Skip to content

Series.__getitem__ semantics vary according to runtime values #9213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dandavison opened this issue Jan 8, 2015 · 10 comments
Open

Series.__getitem__ semantics vary according to runtime values #9213

dandavison opened this issue Jan 8, 2015 · 10 comments
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@dandavison
Copy link

Depending on the runtime value used in [ indexing on a series, the semantics may be array-indexing, or slicing based on a MultiIndex. This is extremely error-prone behavior in a very basic operation.

Is the answer that production code should always use .loc/.ix instead of [? If so, I don't think the docs make that clear.

>>> s = pandas.Series({(1, 1): 11, (1, 2): 12, (2, 1): 21, (2, 2): 22})

>>> s

1  1    11
   2    12
2  1    21
   2    22
dtype: int64

#0 isn't recognized as something to do with the index
# so treat it as an array index
>>> s[0]  
    11

# But this could be asking for a slice,
# so return a slice rather than an array-indexed element.
>>> s[1] 

1    11
2    12
dtype: int64

>>> pandas.__version__
    '0.15.2'

See also #3390

@jreback
Copy link
Contributor

jreback commented Jan 9, 2015

you should simply use .loc/.iloc to be very explicity (and avoid .ix which does fallback indexing, which is even more confusing). [] tries to do the right things, but in edge and ambiguous cases it is not always possible.

If you would like to propose an addition to the docs that you think is more clear. all ears.

@jreback jreback added Docs Indexing Related to indexing on series/frames, not to indexes themselves labels Jan 9, 2015
@ischwabacher
Copy link
Contributor

It's frustrating also that there's no concise way to index df.ix[:-1,'a'] such that the operation succeeds when df.index is an Int64Index, especially when you want to write to this subobject, because the obvious solution df.iloc[:-1].loc[:,'a'] is a chained index.

The only spelling I can think of for this is df.iloc.loc[:-1,'a'], which seems kind of wacky but might be doable. At least you can assign to this without having to call __getitem__.

@jreback
Copy link
Contributor

jreback commented Jan 30, 2015

@ischwabacher

In [9]: df = DataFrame(np.arange(10).reshape(5,2),columns=list('ab'))

In [10]: df
Out[10]: 
   a  b
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9

In [11]: df.loc[df.index[:-1],'a'] *= 2

In [12]: df
Out[12]: 
    a  b
0   0  1
1   4  3
2   8  5
3  12  7
4   8  9

I think the indexing is already quite confusing. I supposed we could add a flag, something like

df.ix(strict=False)[:-1,'a'] to do this. Not that difficult and would have API compat.

@ischwabacher
Copy link
Contributor

I guess that works, even though it feels inefficient (said the premature optimizer!). But for the strict=False idea, what does this do:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(index=[0, 3, 1, 2], columns=['a', 'b'])

In [3]: df
Out[3]: 
     a    b
0  NaN  NaN
3  NaN  NaN
1  NaN  NaN
2  NaN  NaN

In [4]: df.ix(strict=False)[:2,'a']

@hughesadam87
Copy link

@dandavison

Looking at this more carefully, it seems you've indeed stumbled on the ambiguity that occurs when your indicies are numeric types. When you do s[0] there is no label equal to 0, so pandas looks up by index. When you do s[1], because 1 is actually being used to label your Series, it looks up by value. When I spoke to you, I didn't realize this is the same issue I encountered early on when working with spectral data in pandas. Soon thereafter, Jeff introduced .loc and .iloc and a nice guide to indexing and selecting data in pandas (http://pandas.pydata.org/pandas-docs/dev/indexing.html). To avoid this problem, you might consider using labels instead of integers. However, if you continue to use numeric labels, then as Jeff said, always use .iloc and .loc for index and label slicing, respectively.

@jreback @shoyer .iloc and .loc have been around long enough that most of the userbase is probably comfortable with them. Would it be too aggressive to say that item slicing (__getitem__) is deprecated? Would it be possible to at least raise a warning or even an error that forbids users to call __getitem__ on objects with numeric labels? Would you consider this a corner case for the most part? All my data is numerically labeled, so I'm biased.

When new users learn pandas, they are instinctively going to slice with [ ], and since this usually outputs the desired result, they're not going to anticipate this cornercase, even if they've heard of .loc and .iloc. When I teach undergrads pandas slicing, I hammer this point to them now because it is such a pitfall.

In regard to the less extreme suggestion of changing the docs, I think the most simple change would be to introduce .iloc and .loc earlier in the docs. Their current order is categorically well-designed, but human beings are predisposed to pay more attention to what comes first.

@shoyer
Copy link
Member

shoyer commented Feb 27, 2015

I only use __getitem__ for two use-cases that I know are safe:

  1. Getting columns by label from a frame, by supplying either a single column name or a list of column names.
  2. Indexing with a boolean array of the same length as the frame/series.

I agree, the current indexing semantics are a land-mine. Now that I think about it, I'm not even sure (1) is safe if my column labels are integers, though I guess I usually use strings.

It would definitely be a good idea to mention .loc and .iloc as preferred options in the docs -- if you have concrete suggestions, please make a PR for that.

We might also consider making a thorough list of what __getitem__ behavior could be deprecated. Though honestly, it's hard to see how that could be done smoothly without breaking lots of code.

@jorisvandenbossche
Copy link
Member

In any case, good __getitem__ documentation with a clear description of what it does in all cases is missing (docs are quite thorough on ix/loc/iloc etc, but not on []). I was starting with this in #9316, but have still to add __getitem__. I was yesterday also making an overview of all possible cases, will make a new issue about this.

@jorisvandenbossche
Copy link
Member

Your case (1) is in any case safe I think, as this is always label based (so when having an integer column axis, it also works on the labels, not integer location)

@dandavison
Copy link
Author

Would it be too aggressive to say that item slicing (getitem) is deprecated?

We might also consider making a thorough list of what getitem behavior could be deprecated. Though honestly, it's hard to see how that could be done smoothly without breaking lots of code.

Python developers new to pandas are always going to use [ unless it raises an exception. If pandas is going to become a widely-used python library in production code then I think that it's essential that [ at least has the same semantics for all inputs. In general I really appreciate pandas, but I don't feel comfortable recommending it for use in my organization with the current behavior.

@hughesadam87
Copy link

To bikeshed further, if one had to choose between index or value slicing, what do you guys think the [] should return? While I believe value slicing is more useful, index lookup is the default for most applications, especially numpy. It might be easy to just stick with the mantra [ ] is identical to numpy. That would be easy for most users to digest, and then one could just refer to the numpy docs. Of course this would break much backwards compat. and require serious refactoring, so it's obviously easier said than done.

@dandavison While I don't speak for the pandas development team, I can say from experience that this particular behavior is not indicative of my overall personal experiences with pandas, and the library does have a lot to offer. Hopefully your team will keep a foot in the door so-to-speak. As someone with moderate experience working with numpy, IPython and matplotlib, workflows with pandas usually feel more natural than those without. Of course I don't have the constraint of writing production quality workflows :)

@mroeschke mroeschke changed the title Indexing semantics vary according to runtime values Series.__getitem__ semantics vary according to runtime values Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

7 participants