MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

mansenfranzen · 2017-07-15T10:30:26Z

Code Sample, a copy-pastable example if possible

data = {"ID1": [1, 1, 1, 2, 2],
        "ID2": [1001, 1001, 1002, 1001, 1002],
        "ID3": [1, 2, 1, 1, 2],
        "Value": [1, 2, 9, 3, 4]}

df = pd.DataFrame(data).set_index(["ID1", "ID2", "ID3"])
desired_rows = ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) # the rows to be extracted

print(df)

Out[3]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9
2   1001 1        3
    1002 2        4

Problem description

Now, extracting the desired rows with loc fails here while returning only the first row:

In [5]: df.loc[desired_rows, :]
Out[5]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

Expected Output

One solution would be to convert the tuple to a list internally because a list of indices work correctly:

In [6]: df.loc[list(desired_rows), :]
Out[6]: 
              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
2   1002 2        4

Another solution is to raise an error if a tuple of indices is provided as the row indexer of the loc in order to prevent unpredicted results.

Output of `pd.show_versions()`

In [8]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.8.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-07-15T12:54:16Z

One solution would be to convert the tuple to a list

I suspect this would break other tests, were tuples have different meanings that lists for slicing. I may be wrong though, if you want to give it a shot.

jreback · 2017-07-15T14:38:57Z

this technically is not covered by the doc-string

- A list or array of labels, e.g. ['a', 'b', 'c'].

but we almost always accept array-like (which includes tuples). The reason this is confusing slightly is that a non-nested tuple is also valid as a single indexer.

cbrnr · 2018-02-01T10:23:39Z

I came across a slightly related issue: using a multi-index dataframe, why can I only use a tuple as an indexer and not a list (i.e. why do they give different results)?

Using the example data, if I want to pull out rows where ID1=1 and ID2=1001, I can only use a tuple inside loc:

df.loc[(1, 1001)]

This returns the desired slice:

I can't use a list:

 df.loc[[1, 1001]]

This seems to imply that I want values 1 and 1001 for the first level of the index only:

              Value
ID1 ID2  ID3       
1   1001 1        1
         2        2
    1002 1        9

It took me quite some time to figure this out. Is this intended behavior? If yes, is this documented (I thought it should be mentioned here but didn't find anything)?

jorisvandenbossche · 2018-02-01T12:16:46Z

@cbrnr Yes, that is intended behaviour. For single "labels" of a MultiIndex (so one value for each level), we always use tuples and not a list, because it would otherwise be difficult to distinguish. I think for this case we are quite consistent within pandas.
It is the other way around (in a case where we want list-like, do we accept tuple?) that there can be more discussion. Typically we allfow tuples as list-like, but exactly for the reason above (tuples are used to indicate labels of a MI) we might not want to do that in the case of the original issue here.

So your assessment is correct: it tries to look for those values of the list in the first index level. You could interpret the list as give me the combination of indexing the dataframe with each element of the list, so df.loc[1] and df.loc[1001] -> in both cases you select rows based on the first index level.

jorisvandenbossche · 2018-02-01T12:20:33Z

For the original issue: given the possible confusion between the two, I think it might be better in this case to not interpret the tuple as a list-like.
But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

cc @toobaz interesting case :-)

cbrnr · 2018-02-01T12:20:46Z

Thanks @jorisvandenbossche, this makes sense! I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

jorisvandenbossche · 2018-02-01T12:25:08Z

But, in that case, shouldn't it raise an error? As if we interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a single label, it should not find it?

Ah, no, I suppose this is wrong. It seems that it does interpret ((1, 1001, 1), (1, 1001, 2), (2, 1002, 2)) as a list, but not as a list of labels, but as a list of lists (a list of indexers into one level).

So it is indexing as:

In [21]: df.loc[pd.IndexSlice[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]
Out[21]: 
              Value
ID1 ID2  ID3       
1   1001 2        2

To make it a bit more confusing: it is a bit strange however that the actual list of lists (df.loc[[[1, 1001, 1], [1, 1001, 2], [2, 1002, 2]], :]) does not work in this case but raises an error that " '[1, 1001, 1]' is an invalid key". So the list of lists is interpreted as a list of tuples (list of labels).

jorisvandenbossche · 2018-02-01T12:27:38Z

I usually don't distinguish between lists and tuples in plain Python since they are both list-like objects. So this Pandas behavior tripped me up a bit - is this documented clearly somewhere?

Yes, this is one of the gotcha's due to the complexity of MultiIndexing that we somehow need to distinguish between both.
And documentation can certainly better about those things. But in general this is also an area where we would need more extensive testing of the different cases, and then better documentation of those cases (eg see my comment above, even for me it is difficult to really predict how something will be interpreted in certain cases).

cbrnr · 2018-02-01T12:31:20Z

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex: http://pandas.pydata.org/pandas-docs/stable/advanced.html#using-slicers. Not a warning box, but maybe there's a note or an info box? Or is there a better place to put such a note? Let me know and I can take care of that in a PR.

jorisvandenbossche · 2018-02-01T12:51:08Z

Ah, so we are actually using tuples there in the docs :-) So I just might have had the wrong assumption that a list would work (regarding the last of my comment above #16943 (comment)).
But yes, adding a note there that those multiple indexers need a be contained in a tuple is a good idea (and using the IndexSlice makes it even more explicit)

toobaz · 2018-02-01T13:19:13Z

I could add a statement that tuples are needed in the case of multiple indexers on a multiindex:

Great idea! I think such statement should actually go at the beginning of http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-indexing-with-hierarchical-index

You could maybe start by stating that MultiIndex keys take the form of tuples, then you could swap the first two examples currently provided (move the complete indexing one first), then introduce partial indexing, and mention that when doing partial indexing on the first level, you are allowed to only pass the first element of the tuple ('bar' stands for than ('bar',)). Finally, I think a warning box could then clarify that (for the reasons above), tuples and lists are not equivalent in pandas, and in particular, tuples should not be used as lists of keys (for MultiIndexes, and not only).

You might want to show examples of the fact that lists of tuples in general refer to multiple complete (MultiIndex) keys, while tuples of lists in general refer to multiple values on each, that is something like

In [2]: s = pd.Series(-1, index=pd.MultiIndex.from_product([[1, 2], [3, 4]]))

In [3]: s.loc[[(1, 3), (2, 4)]]
Out[3]: 
1  3   -1
2  4   -1
dtype: int64

In [4]: s.loc[([1, 2], [3, 4])]
Out[4]: 
1  3   -1
   4   -1
2  3   -1
   4   -1
dtype: int64

Asides from the possible docs improvements: yes, in some cases we interpret tuples as lists, but I think it should be seen as an undesired implementation legacy. Vice-versa, I see no harm (in general - caveats clearly can apply to specific cases) in interpreting generators, dicts or other list-likes that as lists.

cbrnr · 2018-02-02T08:27:07Z

See #19507

toobaz · 2018-08-03T07:58:00Z

I think this is fixed by #19507 . Anyone feel free to reopen if you disagree.

AntonVlasenko · 2019-02-10T21:16:32Z

@toobaz

This is almost bizarre how you helped me with getting a dataframe by multiindex. Thank you!

TomAugspurger added Difficulty Intermediate Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jul 15, 2017

TomAugspurger added this to the Next Major Release milestone Jul 15, 2017

cbrnr mentioned this issue Feb 2, 2018

DOC: improve docs to clarify MultiIndex indexing #19507

Merged

toobaz closed this as completed Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

mansenfranzen commented Jul 15, 2017 •

edited

Loading

INSTALLED VERSIONS

TomAugspurger commented Jul 15, 2017

jreback commented Jul 15, 2017

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

toobaz commented Feb 1, 2018 •

edited

Loading

cbrnr commented Feb 2, 2018

toobaz commented Aug 3, 2018

AntonVlasenko commented Feb 10, 2019

MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

MultiIndex row indexing with .loc fail with tuple but work with list of indices #16943

Comments

mansenfranzen commented Jul 15, 2017 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

TomAugspurger commented Jul 15, 2017

jreback commented Jul 15, 2017

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

cbrnr commented Feb 1, 2018

jorisvandenbossche commented Feb 1, 2018

toobaz commented Feb 1, 2018 • edited Loading

cbrnr commented Feb 2, 2018

toobaz commented Aug 3, 2018

AntonVlasenko commented Feb 10, 2019

mansenfranzen commented Jul 15, 2017 •

edited

Loading

Output of `pd.show_versions()`

toobaz commented Feb 1, 2018 •

edited

Loading