Skip to content

Adding support for indexing a MultiIndex with a DataFrame and/or bi-dimensional np.array #15438

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Feb 17, 2017 · 4 comments
Labels
Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action

Comments

@toobaz
Copy link
Member

toobaz commented Feb 17, 2017

(From #15425 )

Currently, (non-Multi)Indexes can be indexed with Series indexers. And this actually also applies to MultiIndexes, of which you would be selecting from the first level. Hence, it seems a natural consequence for MultiIndexes to be indexed with DataFrame indexers.

Moreover, once #15434 is fixed, we will have a bi-dimensional object (MultiIndex) which can be indexed with np.arrays... but only one-dimensional ones! This is also strange.

The feature per se is certainly useful. As a simple real world example, I am currently working with a subjects DataFrame to which I must attribute two columns from design, another DataFrame, depending on a group and time columns of subjects, which are also levels of the MultiIndex of design. I would like to just do

subjects[design.columns] = design.loc[subjects[["group", "time"]]]

Now, I know this could be solved by .joining the two DataFrames... but this is conceptually more complicated (I even currently ignore whether I can join one DataFrame on columns and the other on index levels... but this is OT), to the point that I'm rather doing:

to_mi = lambda df : df.set_index(list(df.columns)).index
subjects[design.columns] = design.loc[to_mi(subjects[["group", "time"]])]

@jorisvandenbossche suggests this feature would add complexity to indexing, "eg, should the column names align on the level names?". I'm personally fine with both answers:

  • Yes: then we just use something like to_mi above (transforming a DataFrame in MultiIndex, and then using it to actually index)
  • No: then it's really really simple (we just transform the DataFrame into tuples - I had actually already done this in Mi indexing #15425 before rolling back)

"Yes" is probably the cleanest answer (possibly together with allowing indexing with bi-dimensional np.arrays, to obtain the equivalent of the "No" answer). In any case, once we decide, I can take care of this.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2017

can you show a copy-paste example that creates the actual frame you are talking about (a generic simple version is fine, just give generic names to levels and such).

@toobaz
Copy link
Member Author

toobaz commented Feb 17, 2017

s = pd.Series(range(4),
              index=pd.MultiIndex.from_product([[1,2], ['a', 'b']],
                                               names=['first', 'second']))

df = pd.DataFrame([[1, 'a'], [2, 'b']],
                  columns=['first', 'second'])

What I would like to do:

s.loc[df]

what I end up doing:

to_mi = lambda df : df.set_index(list(df.columns)).index
s.loc[to_mi(df)]

outputting indeed:

first  second
1      a         0
2      b         3
dtype: int64

However, only now I realize that column names do not matter above. That is: when you index a MultiIndex using a MultiIndex as indexer, level names are not considered! See the following example:

idf = pd.MultiIndex.from_tuples([[1, 'a'], [2, 'b']], names=['other', 'names'])
s.loc[idf]

which outputs

other  names
1      a        0
2      b        3
dtype: int64

So either this current behavior is undesired, or the answer to @jorisvandenbossche's question is trivial: "we don't respect names for MultiIndex indexers, why should we for DataFrame indexers?!"
(But I think the behavior we want to keep with MultiIndex indexers - which e.g. behave strangely now if they have less levels than the indexed one - is better to be discussed separately - let's just say we want DataFrame indexing to be coherent with it)

@toobaz toobaz mentioned this issue Mar 8, 2017
4 tasks
@toobaz
Copy link
Member Author

toobaz commented Mar 10, 2017

Oh, by the way,

In [2]: s = pd.Series(range(4), index=pd.MultiIndex.from_product([['a', 'b'], ['c', 'd']]))

In [3]: s.loc[pd.DataFrame([['a', 'c'], ['b', 'c']]).values] = [100, 101]
Exception ignored in: 'pandas._libs.lib.is_bool_array'
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [4]: s
Out[4]: 
a  c    100
   d      1
b  c    101
   d      3
dtype: int64

I don't exactly understand what's going on with the "Exception ignored", but setter works with 2-d ndarray (by the way: .loc.__getitem__ and .loc.__setitem__ should really share a lot more code).

@toobaz
Copy link
Member Author

toobaz commented Apr 26, 2018

xref #12550

@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement MultiIndex Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants