Skip to content

Wrong behaviour/error when indexing MultiIndex with list #12416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue Feb 22, 2016 · 12 comments · Fixed by #16029
Closed

Wrong behaviour/error when indexing MultiIndex with list #12416

toobaz opened this issue Feb 22, 2016 · 12 comments · Fixed by #16029
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented Feb 22, 2016

In [2]: df = pd.DataFrame(index=range(2), columns=pd.MultiIndex.from_product([[10,20], ['a', 'b']]))

In [3]: df[[20]]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-8422ebb3f356> in <module>()
----> 1 df[[20]]

/home/pietro/nobackup/repo/pandas/pandas/core/frame.pyc in __getitem__(self, key)
   1975         if isinstance(key, (Series, np.ndarray, Index, list)):
   1976             # either boolean or fancy integer index
-> 1977             return self._getitem_array(key)
   1978         elif isinstance(key, DataFrame):
   1979             return self._getitem_frame(key)

/home/pietro/nobackup/repo/pandas/pandas/core/frame.pyc in _getitem_array(self, key)
   2020         else:
   2021             indexer = self.ix._convert_to_indexer(key, axis=1)
-> 2022             return self.take(indexer, axis=1, convert=True)
   2023 
   2024     def _getitem_multilevel(self, key):

/home/pietro/nobackup/repo/pandas/pandas/core/generic.pyc in take(self, indices, axis, convert, is_copy)
   1590         new_data = self._data.take(indices,
   1591                                    axis=self._get_block_manager_axis(axis),
-> 1592                                    convert=True, verify=True)
   1593         result = self._constructor(new_data).__finalize__(self)
   1594 

/home/pietro/nobackup/repo/pandas/pandas/core/internals.pyc in take(self, indexer, axis, verify, convert)
   3617         n = self.shape[axis]
   3618         if convert:
-> 3619             indexer = maybe_convert_indices(indexer, n)
   3620 
   3621         if verify:

/home/pietro/nobackup/repo/pandas/pandas/core/indexing.pyc in maybe_convert_indices(indices, n)
   1803     mask = (indices >= n) | (indices < 0)
   1804     if mask.any():
-> 1805         raise IndexError("indices are out-of-bounds")
   1806     return indices
   1807 

IndexError: indices are out-of-bounds

and

In [4]: df[[1]]
Out[4]: 
    10
     b
0  NaN
1  NaN

should both raise KeyError (see #12369 (comment) )

@jorisvandenbossche
Copy link
Member

Why should it raise a KeyError (the first case of df[[20]]) ?
It seems this should just work? (as it works for strings)

In [117]: df = pd.DataFrame(index=range(2), columns=pd.MultiIndex.from_product([
['a','b'], ['c', 'd']]))

In [120]: df[['a']]
Out[120]:
     a
     c    d
0  NaN  NaN
1  NaN  NaN

However, I agree that df[[1]] should raise. But this seems to indicate that when using a list with integers to access the columns of a MultIndexed column, these are interpreted as positional instead of label.

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Feb 22, 2016
@toobaz
Copy link
Member Author

toobaz commented Feb 22, 2016

@jorisvandenbossche What you say is true only when you omit some levels. For instance df[[(10, 'a')]] interprets the given column as label instead. While it would probably be cool to just have something like

obj[a_list] == pd.concat([obj[key] for key in a_list if key in NDFrame])

whatever key is, I am probably oversimplifying a lot, and @jreback instead suggested that indexing with lists of incomplete MultiIndex columns should not be supported at all.

@toobaz
Copy link
Member Author

toobaz commented Feb 22, 2016

To complement the above...

In [2]: df = pd.DataFrame(index=pd.MultiIndex.from_product([[1, 2], [3, 4]]), columns=range(2))

In [3]: df.loc[[1]]
Out[3]: 
       0    1
1 3  NaN  NaN
  4  NaN  NaN

In [4]: df.loc[[1,4]]
Out[4]: 
       0    1
1 3  NaN  NaN
  4  NaN  NaN

In [5]: df.loc[[3,4]]
Out[5]: 
Empty DataFrame
Columns: [0, 1]
Index: []

In [6]: df.loc[[4]]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-bdf10e6956ab> in <module>()
----> 1 df.loc[[4]]

/home/pietro/nobackup/repo/pandas/pandas/core/indexing.pyc in __getitem__(self, key)
   1279             return self._getitem_tuple(key)
   1280         else:
-> 1281             return self._getitem_axis(key, axis=0)
   1282 
   1283     def _getitem_axis(self, key, axis=0):

/home/pietro/nobackup/repo/pandas/pandas/core/indexing.pyc in _getitem_axis(self, key, axis)
   1412                     raise ValueError('Cannot index with multidimensional key')
   1413 
-> 1414                 return self._getitem_iterable(key, axis=axis)
   1415 
   1416             # nested tuple slicing

/home/pietro/nobackup/repo/pandas/pandas/core/indexing.pyc in _getitem_iterable(self, key, axis)
   1055                         if (hasattr(result, 'ndim') and
   1056                                 not np.prod(result.shape) and len(keyarr)):
-> 1057                             raise KeyError("cannot index a multi-index axis "
   1058                                            "with these keys")
   1059 

KeyError: 'cannot index a multi-index axis with these keys'

The last two examples have no valid excuse for not resulting in the same behaviour, I think.

@toobaz
Copy link
Member Author

toobaz commented Feb 22, 2016

Even better:

In [2]: df = pd.DataFrame(2, index=pd.MultiIndex.from_product([[1, 2], [3, 4], [5,6]]), columns=range(2))

In [3]: df.loc[(1,3)]
Out[3]: 
   0  1
5  2  2
6  2  2

In [4]: df.loc[[(1,3)]]
Out[4]: 
      0   1
1 3 NaN NaN

@EricPrescottGagnon
Copy link

I am obtaining similar issues with a DateTimeIndex.

In[1]: df = pandas.DataFrame(0, index=pandas.date_range('1/8/2011', periods=5, freq='W'), columns=['a', 'b'])
In[2]: df
Out[2]: 
            a  b
2011-01-09  0  0
2011-01-16  0  0
2011-01-23  0  0
2011-01-30  0  0
2011-02-06  0  0

In[3]: df.loc[list(pandas.date_range('1/1/2011', periods=4, freq='W'))]
Out[3]: 
              a    b
2011-01-02  NaN  NaN
2011-01-09  0.0  0.0
2011-01-16  0.0  0.0
2011-01-23  0.0  0.0

In[4]: df.loc[tuple(pandas.date_range('1/1/2011', periods=4, freq='W'))]
Traceback (most recent call last):
  File "C:\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1395, in _has_valid_type
    error()
  File "C:\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1390, in error
    (key, self.obj._get_axis_name(axis)))
KeyError: 'the label [2011-01-02 00:00:00] is not in the [index]'

@toobaz
Copy link
Member Author

toobaz commented Apr 15, 2017

I think this bug can be closed. The initial example now works correctly:

In [2]: df = pd.DataFrame(index=range(2), columns=pd.MultiIndex.from_product([[10,20], ['a', 'b']]))

In [3]: df[[20]]
Out[3]: 
    20     
     a    b
0  NaN  NaN
1  NaN  NaN

The example I reported in my second comment (df.loc[[3,4]]) is already reported as #15452 .

The example I reported in my third comment is probably still wrong but unrelated, and also discussed in #15452 .

The example reported by @EricPrescottGagnon is not really a bug: df.loc[tuple(something)] and df.loc[list(something)] are not the same thing for pandas, the first will look in the index for the first element of something only, and since that element is not there (and it is a single label, not a list of labels), it has all the right to fail, precisely as df.loc[something[0]] would.

@toobaz toobaz closed this as completed Apr 15, 2017
@jorisvandenbossche
Copy link
Member

@toobaz would you like to add a test for the initial examples?

@toobaz
Copy link
Member Author

toobaz commented Apr 16, 2017

@toobaz would you like to add a test for the initial examples?

I suspected this is already tested somewhere, but I'm making a note to check

@jorisvandenbossche
Copy link
Member

Based on a quick search in the 0.20 whatsnew file, there does not seem to be listed something about multi-indexes that is exactly this, so quite possibly it was fixed indirectly for another issue

@toobaz
Copy link
Member Author

toobaz commented Apr 17, 2017

I'm confused: this seems to test precisely what we want... except it was already in place when I filed this bug. I don't know if the difference in behaviour in my example came from labels being integers, or form the level of the df having more than one label, or even from the NaNs. Anyway I can add my initial example as a test if you want.

@jorisvandenbossche
Copy link
Member

Yes, the issue here was related to the fact that the integer was (wrongly) regarded as positional instead of as label

@jorisvandenbossche
Copy link
Member

Test would be good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants