Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

toobaz · 2017-02-17T15:41:38Z

Currently, (non-Multi)Indexes can be indexed with Series indexers. And this actually also applies to MultiIndexes, of which you would be selecting from the first level. Hence, it seems a natural consequence for MultiIndexes to be indexed with DataFrame indexers.

Moreover, once #15434 is fixed, we will have a bi-dimensional object (MultiIndex) which can be indexed with np.arrays... but only one-dimensional ones! This is also strange.

The feature per se is certainly useful. As a simple real world example, I am currently working with a subjects DataFrame to which I must attribute two columns from design, another DataFrame, depending on a group and time columns of subjects, which are also levels of the MultiIndex of design. I would like to just do

subjects[design.columns] = design.loc[subjects[["group", "time"]]]

Now, I know this could be solved by .joining the two DataFrames... but this is conceptually more complicated (I even currently ignore whether I can join one DataFrame on columns and the other on index levels... but this is OT), to the point that I'm rather doing:

to_mi = lambda df : df.set_index(list(df.columns)).index
subjects[design.columns] = design.loc[to_mi(subjects[["group", "time"]])]

@jorisvandenbossche suggests this feature would add complexity to indexing, "eg, should the column names align on the level names?". I'm personally fine with both answers:

Yes: then we just use something like to_mi above (transforming a DataFrame in MultiIndex, and then using it to actually index)
No: then it's really really simple (we just transform the DataFrame into tuples - I had actually already done this in Mi indexing #15425 before rolling back)

"Yes" is probably the cleanest answer (possibly together with allowing indexing with bi-dimensional np.arrays, to obtain the equivalent of the "No" answer). In any case, once we decide, I can take care of this.

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-17T15:49:04Z

can you show a copy-paste example that creates the actual frame you are talking about (a generic simple version is fine, just give generic names to levels and such).

toobaz · 2017-02-17T16:45:05Z

s = pd.Series(range(4),
              index=pd.MultiIndex.from_product([[1,2], ['a', 'b']],
                                               names=['first', 'second']))

df = pd.DataFrame([[1, 'a'], [2, 'b']],
                  columns=['first', 'second'])

What I would like to do:

s.loc[df]

what I end up doing:

to_mi = lambda df : df.set_index(list(df.columns)).index
s.loc[to_mi(df)]

outputting indeed:

first  second
1      a         0
2      b         3
dtype: int64

However, only now I realize that column names do not matter above. That is: when you index a MultiIndex using a MultiIndex as indexer, level names are not considered! See the following example:

idf = pd.MultiIndex.from_tuples([[1, 'a'], [2, 'b']], names=['other', 'names'])
s.loc[idf]

which outputs

other  names
1      a        0
2      b        3
dtype: int64

So either this current behavior is undesired, or the answer to @jorisvandenbossche's question is trivial: "we don't respect names for MultiIndex indexers, why should we for DataFrame indexers?!"
(But I think the behavior we want to keep with MultiIndex indexers - which e.g. behave strangely now if they have less levels than the indexed one - is better to be discussed separately - let's just say we want DataFrame indexing to be coherent with it)

toobaz · 2017-03-10T08:32:44Z

Oh, by the way,

In [2]: s = pd.Series(range(4), index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))

In [3]: s.loc[pd.DataFrame([['a', 'c'], ['b', 'c']]).values] = [100, 101]
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [4]: s
Out[4]: 
a  c    100
   d      1
b  c    101
   d      3
dtype: int64

I don't exactly understand what's going on with the "Exception ignored", but setter works with 2-d ndarray (by the way: .loc.__getitem__ and .loc.__setitem__ should really share a lot more code).

toobaz · 2018-04-26T15:29:29Z

xref #12550

jreback added API Design MultiIndex labels Feb 17, 2017

toobaz mentioned this issue Mar 8, 2017

Clean multiindex keys #15615

Closed

4 tasks

mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

toobaz commented Feb 17, 2017

jreback commented Feb 17, 2017

toobaz commented Feb 17, 2017

toobaz commented Mar 10, 2017

toobaz commented Apr 26, 2018

Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

Comments

toobaz commented Feb 17, 2017

jreback commented Feb 17, 2017

toobaz commented Feb 17, 2017

toobaz commented Mar 10, 2017

toobaz commented Apr 26, 2018