Skip to content

MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mansenfranzen opened this issue Jul 15, 2017 · 15 comments
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex

Comments

@mansenfranzen
Copy link

mansenfranzen commented Jul 15, 2017

Code Sample, a copy-pastable example if possible

data = {"ID1": [1, 1, 1, 2, 2],
        "ID2": [1001, 1001, 1002, 1001, 1002],
        "ID3": [1, 2, 1, 1, 2],
        "Value": [1, 2, 9, 3, 4]}

df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted

print(df)

Out[3]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9
2   1001 1        3
    1002 2        4

Problem description

Now, extracting the desired rows with loc fails here while returning only the first row:

In [5]: df.loc[desired_rows, :]
Out[5]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

Expected Output

One solution would be to convert the tuple to a list internally because a list of indices work correctly:

In [6]: df.loc[list(desired_rows), :]
Out[6]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1002 2        4

Another solution is to raise an error if a tuple of indices is provided as the row indexer of the loc in order to prevent unpredicted results.

Output of pd.show_versions()

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

One solution would be to convert the tuple to a list

I suspect this would break other tests, were tuples have different meanings that lists for slicing. I may be wrong though, if you want to give it a shot.

@TomAugspurger TomAugspurger added Difficulty Intermediate Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jul 15, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jul 15, 2017
@jreback
Copy link
Contributor

jreback commented Jul 15, 2017

this technically is not covered by the doc-string

- A list or array of labels, e.g. ['a', 'b', 'c'].

but we almost always accept array-like (which includes tuples). The reason this is confusing slightly is that a non-nested tuple is also valid as a single indexer.

@cbrnr
Copy link
Contributor

cbrnr commented Feb 1, 2018

I came across a slightly related issue: using a multi-index dataframe, why can I only use a tuple as an indexer and not a list (i.e. why do they give different results)?

Using the example data, if I want to pull out rows where ID1=1 and ID2=1001, I can only use a tuple inside loc:

df.loc[(1, 1001)]

This returns the desired slice:

     Value
ID3       
1        1
2        2

I can't use a list:

 df.loc[[1, 1001]]

This seems to imply that I want values 1 and 1001 for the first level of the index only:

              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9

It took me quite some time to figure this out. Is this intended behavior? If yes, is this documented (I thought it should be mentioned here but didn't find anything)?

@jorisvandenbossche
Copy link
Member

@cbrnr Yes, that is intended behaviour. For single "labels" of a MultiIndex (so one value for each level), we always use tuples and not a list, because it would otherwise be difficult to distinguish. I think for this case we are quite consistent within pandas.
It is the other way around (in a case where we want list-like, do we accept tuple?) that there can be more discussion. Typically we allfow tuples as list-like, but exactly for the reason above (tuples are used to indicate labels of a MI) we might not want to do that in the case of the original issue here.

So your assessment is correct: it tries to look for those values of the list in the first index level. You could interpret the list as give me the combination of indexing the dataframe with each element of the list, so df.loc[1] and df.loc[1001] -> in both cases you select rows based on the first index level.

@jorisvandenbossche
Copy link
Member

For the original issue: given the possible confusion between the two, I think it might be better in this case to not interpret the tuple as a list-like.
But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

cc @toobaz interesting case :-)

@cbrnr
Copy link
Contributor

cbrnr commented Feb 1, 2018

Thanks @jorisvandenbossche, this makes sense! I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

@jorisvandenbossche
Copy link
Member

But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

Ah, no, I suppose this is wrong. It seems that it does interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a list, but not as a list of labels, but as a list of lists (a list of indexers into one level).

So it is indexing as:

In [21]: df.loc[pd.IndexSlice[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]
Out[21]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

To make it a bit more confusing: it is a bit strange however that the actual list of lists (df.loc[[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]) does not work in this case but raises an error that " '[1, 1001, 1]' is an invalid key". So the list of lists is interpreted as a list of tuples (list of labels).

@jorisvandenbossche
Copy link
Member

I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

Yes, this is one of the gotcha's due to the complexity of MultiIndexing that we somehow need to distinguish between both.
And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).

@cbrnr
Copy link
Contributor

cbrnr commented Feb 1, 2018

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers. Not a warning box, but maybe there's a note or an info box? Or is there a better place to put such a note? Let me know and I can take care of that in a PR.

@jorisvandenbossche
Copy link
Member

Ah, so we are actually using tuples there in the docs :-) So I just might have had the wrong assumption that a list would work (regarding the last of my comment above #16943 (comment)).
But yes, adding a note there that those multiple indexers need a be contained in a tuple is a good idea (and using the IndexSlice makes it even more explicit)

@toobaz
Copy link
Member

toobaz commented Feb 1, 2018

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

You could maybe start by stating that MultiIndex keys take the form of tuples, then you could swap the first two examples currently provided (move the complete indexing one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar' stands for than ('bar',)). Finally, I think a warning box could then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (for MultiIndexes, and not only).

You might want to show examples of the fact that lists of tuples in general refer to multiple complete (MultiIndex) keys, while tuples of lists in general refer to multiple values on each, that is something like

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

Asides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.

@cbrnr
Copy link
Contributor

cbrnr commented Feb 2, 2018

See #19507

@toobaz
Copy link
Member

toobaz commented Aug 3, 2018

I think this is fixed by #19507 . Anyone feel free to reopen if you disagree.

@toobaz toobaz closed this as completed Aug 3, 2018
@AntonVlasenko
Copy link

@toobaz

This is almost bizarre how you helped me with getting a dataframe by multiindex. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

8 participants
@jreback @jorisvandenbossche @toobaz @TomAugspurger @cbrnr @mansenfranzen @AntonVlasenko and others