-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Partial Selection on Hierarichal Index Not Working as Expected Beyond 2 Levels #2995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here is a workaround for solving your particular issue. The problem is passing a tuple of indexers is tantamount to this:
so passing a tuple is ambiguous; this is handled for a len 2 tuple however (there is currently no method to 'decide' if you are passing a tuple to index axis=0 with a tuple on a multi-index or index axis0 and axis1)
|
Good catch on the df.ix[loc[0],loc[1]] not correctly selecting what I intended. Indeed I meant to select df.ix[loc[0]].ix[loc[1]]. Thanks for the partial_indexer function. If the df.ix[row, col] method converts the index arguments to a tuple(row, col) are the following correctly interpreted as I have shown: df.ix[1, 4, 1, 3, :] correctly partially selects the data. df.ix[(1, 4, 1, 3), :] does not work Key Error df.ix[1, 4, 1, 3, 1] does not work --> Too many Indexers. Not sure how to partially select using all indices. df.ix[1, 4, 1, 3, 1, :] does not work either --> Too many Indexers but this works... df.ix[df[0:1].index] Perhaps it would make sense to accept a tuple of tuples to partition the row/col selections. ((ordered tuple of row indexers),(ordered tuple of col indexers)) then the outer tuple can be unpacked and you have two separate objects for partially selecting from both rows and cols. -Gagi |
These are scenarios where i prefer In [44]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[44]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4003 entries, (0, 2, 3) to (2, 4, 4)
Data columns:
A5 4003 non-null values
A6 4003 non-null values
A7 4003 non-null values
A8 4003 non-null values
A9 4003 non-null values
dtypes: int64(5)
|
I see that df.xs does seem to support partial indexing as expected. Is it possible to call df.xs without explicitly passing the level? I level is not passed could it default to a list from 1 to the number of dimensions passed? |
It should default to the first n levels (n=length of the key), but it does not (at least not on master any more, see also #1796). In [28]: df.xs([loc[0], loc[1]])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 2'
In [29]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[29]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4047 entries, (2, 1, 3) to (1, 1, 0)
Data columns:
A5 4047 non-null values
A6 4047 non-null values
A7 4047 non-null values
A8 4047 non-null values
A9 4047 non-null values
dtypes: int64(5) |
With a similar DataFrame as above, df.xs can't partially select rows where the passed tuple length equals the total number of levels in the MultIndex. For example below we know (0, 1, 0, 4, 3) will return at least the first row of the data set. Using df.xs cant access rows indexed by (0, 1, 0, 4, 3). However using df.ix[df[0:1].index] to directly access rows that match the first rows index works. In [35]: df
Out[35]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 1, 0, 4, 3) to (2, 0, 4, 1, 1)
Data columns:
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [36]: loc
Out[36]: (2, 3, 2, 4)
In [37]: df.xs(loc, level=[0,1,2,3])
Out[37]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 1 to 3
Data columns:
A5 201 non-null values
A6 201 non-null values
A7 201 non-null values
A8 201 non-null values
A9 201 non-null values
dtypes: int64(5)
In [38]: df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-38-d8180b507e6b> in <module>()
----> 1 df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])
C:\Python27\lib\site-packages\pandas\core\frame.py in xs(self, key, axis, level, copy)
2291 labels = self._get_axis(axis)
2292 if level is not None:
-> 2293 loc, new_ax = labels.get_loc_level(key, level=level)
2294
2295 if not copy and not isinstance(loc, slice):
C:\Python27\lib\site-packages\pandas\core\index.py in get_loc_level(self, key, level)
2243
2244 result = loc if result is None else result & loc
-> 2245 return result, _drop_levels(result, level)
2246
2247 level = self._get_level_number(level)
C:\Python27\lib\site-packages\pandas\core\index.py in _drop_levels(indexer, levels)
2227 levels = [self._get_level_number(i) for i in levels]
2228 for i in sorted(levels, reverse=True):
-> 2229 new_index = new_index.droplevel(i)
2230 return new_index
2231
AttributeError: 'Int64Index' object has no attribute 'droplevel'
In [39]: df.ix[df[0:1].index]
Out[39]: A5 A6 A7 A8 A9
A0 A1 A2 A3 A4
0 1 0 4 3 1 1 0 4 2
3 1 1 3 1 0
3 3 2 3 0 0
3 4 3 1 1 1
3 2 0 1 2 1
3 4 1 1 2 4
3 0 4 2 1 0
3 2 0 2 0 4
3 0 0 3 0 3
3 1 0 1 2 3
3 2 0 3 0 3
3 1 3 4 2 3
3 3 1 0 1 2
3 0 0 1 4 0
3 3 1 3 1 4
3 4 0 2 2 1
3 2 1 0 1 2
3 0 1 4 3 2
3 4 0 4 4 3
3 1 4 0 4 0
3 2 1 3 3 0
3 3 2 1 0 4
3 1 2 0 1 3
3 4 1 0 1 4
3 3 2 0 0 0
3 3 3 3 1 0
3 0 0 0 0 3
3 1 0 1 0 2
3 2 3 2 1 0
3 3 0 3 2 4
3 3 0 3 3 3
3 1 3 4 0 1
3 4 4 0 1 4
3 0 0 4 3 2
3 1 1 2 4 2 |
Note that your DataFrame has an unsorted index (i somehow did not notice this before). This is important when slicing/indexing on a MultiIndex. For details see "The need for sortedness" In [156]: df
Out[156]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (4, 0, 0, 2, 2) to (1, 4, 1, 2, 0)
Data columns:
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [157]: df.index.lexsort_depth
Out[157]: 0
In [158]: df.xs((0,0,0,0,0))
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 5'
In [159]: dfs = df.sort_index() # sort the index
In [160]: dfs.index.lexsort_depth
Out[160]: 5
In [161]: dfs.xs((0,0,0,0,0))
Out[161]:
A5 A6 A7 A8 A9
A0 A1 A2 A3 A4
0 0 0 0 0 4 1 0 1 3
0 3 0 3 2 4
0 4 0 2 2 0
0 1 4 4 1 3
0 4 4 2 3 0
0 4 1 0 3 4
0 3 4 3 3 1
0 1 3 2 4 2
0 1 1 3 1 2
0 2 3 2 3 4
0 3 3 2 2 2
0 4 0 1 3 0
0 2 1 4 1 0
0 4 1 3 2 1
0 3 4 2 0 2
0 3 4 2 4 1
0 4 4 3 2 3
0 3 4 4 0 0
0 3 1 4 1 1
0 2 4 0 1 4
0 4 4 2 3 2
0 4 2 0 1 0
0 1 4 4 0 3
0 0 0 1 0 3
0 3 1 1 1 0
0 4 3 1 4 2
0 0 4 1 2 3
0 1 4 3 4 3 Going back to your original examples, the failing parts work fine after sorting In [168]: dfs.ix[loc]
Out[168]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 164 entries, 0 to 4
Data columns:
A5 164 non-null values
A6 164 non-null values
A7 164 non-null values
A8 164 non-null values
A9 164 non-null values
dtypes: int64(5)
In [169]: dfs.ix[loc[0], loc[1], loc[2]]
Out[169]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 785 entries, (0, 0) to (4, 4)
Data columns:
A5 785 non-null values
A6 785 non-null values
A7 785 non-null values
A8 785 non-null values
A9 785 non-null values
dtypes: int64(5) |
Great catch lodagro! Sorting the hierarchical index did fix the df.ix and df.xs methods. The fluke indexing working when appending the null slice was just serving to confuse me further. For a future release it might make sense to have a default bool sort_index parameter when calling df.set_index() ensuring that a sorted index is returned. Issue closed. |
I was playing around with the partial selection and I ran into an inconsistency with how partially selected data frames are returned. When selecting using a subset of levels you get a df is returned with the fixed/selected levels dropped. When you partially select using a tuple on all levels you get a data frame with all indices returned. Is there a way to return all indices when sub-selecting on a subset of levels? Is there a way to not return the entire hierarchical index when selecting across all levels? With a df similar to the one above, indexed using 5 integer columns and with the hierarchical indices sorted. In [79]: df
Out[79]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5 100000 non-null values
A6 100000 non-null values
A7 100000 non-null values
A8 100000 non-null values
A9 100000 non-null values
dtypes: int64(5)
In [80]: df.ix[(0)]
Out[80]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 20011 entries, (0, 0, 0, 0) to (4, 4, 4, 4)
Data columns:
A5 20011 non-null values
A6 20011 non-null values
A7 20011 non-null values
A8 20011 non-null values
A9 20011 non-null values
dtypes: int64(5)
In [81]: df.ix[(0,1)]
Out[81]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 4007 entries, (0, 0, 0) to (4, 4, 4)
Data columns:
A5 4007 non-null values
A6 4007 non-null values
A7 4007 non-null values
A8 4007 non-null values
A9 4007 non-null values
dtypes: int64(5)
In [82]: df.ix[(0,1,2)]
Out[82]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 817 entries, (0, 0) to (4, 4)
Data columns:
A5 817 non-null values
A6 817 non-null values
A7 817 non-null values
A8 817 non-null values
A9 817 non-null values
dtypes: int64(5)
In [83]: df.ix[(0,1,2,3)]
Out[83]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 162 entries, 0 to 4
Data columns:
A5 162 non-null values
A6 162 non-null values
A7 162 non-null values
A8 162 non-null values
A9 162 non-null values
dtypes: int64(5)
In [84]: df.ix[(0,1,2,3,4)]
Out[84]: A5 A6 A7 A8 A9
A0 A1 A2 A3 A4
0 1 2 3 4 1 2 2 4 2
4 1 4 4 1 0
4 2 1 4 1 3
4 2 4 2 1 1
4 1 1 2 1 4
4 0 0 2 1 1
4 2 0 0 3 1
4 2 2 3 3 1
4 3 0 3 4 1
4 1 1 0 0 1
4 2 1 0 2 4
4 3 4 1 2 3
4 0 4 3 1 0
4 4 1 4 1 2
4 1 3 4 3 3
4 0 1 1 3 1
4 2 2 2 0 3
4 0 0 1 4 0
4 1 0 1 4 2
4 1 4 2 2 0
4 4 2 0 3 1
4 2 1 2 3 2
4 4 2 0 1 4
4 1 4 1 1 4
4 1 0 1 2 4
4 2 3 0 1 3
4 2 1 3 3 3
4 1 2 0 4 2
4 3 0 4 4 0
4 4 4 2 3 0
4 0 0 1 3 2
4 4 0 0 0 3
4 2 0 3 4 2
4 3 3 3 0 2
4 4 2 2 0 1
4 2 1 3 4 0
In [86]: df.index.lexsort_depth
Out[86]: 5 |
There is no way to control how the index is returned with xs, either full partial or not. fyi to remove the index you can do Since the discussion has moved away from the original issue (which is resolved i think), would it be ok for you to close this issue and if needed open a new one? |
Closing this issue and opened new issue #3057 for controlling what index levels are returned when sub-selecting. |
The Pandas documentation states that a hierarchically indexed DataFrame can be partially selected from by passing a partial list of index values. For example if we have a 5-level index, then df.ix['A','B','C','D'] should return a DataFrame with the subset of rows where levels[0,1,2,3] match values ['A','B','C','D'] indexed with by the remaining lowest index level 5. I have found that this only works when partially selecting on the first two levels, anything beyond the first 2 levels requires the user to append a null slice at the end for .ix to work. This becomes cumbersome especially if you do not know exactly how many levels are available in a DataFrame after several processing steps. The same behavior persists regardless if your MultiIndex is int64 or string or mixed types.
It would be best if any partial tuple of index values passed to the .ix method returns a partial selection matching that tuple without requiring users to explicitly append the empty slice for the remaining levels. Let me know if this is how the .ix method is intended to work. It might be a simple fix of by checking if the tuple length == # levels and if not appending the empty slice within the .ix method.
Example Showing the Issue:
Take an example data set with 5 index levels, group by the first 4 levels and count the group sizes. Select the group with the largest size.
Results: (Clean Code Pasted Below IPython Output)
Code:
The text was updated successfully, but these errors were encountered: