Skip to content

Partial indexing with a list and hierarchical index #13501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Jun 23, 2016 · 4 comments · Fixed by #41482
Closed

Partial indexing with a list and hierarchical index #13501

jseabold opened this issue Jun 23, 2016 · 4 comments · Fixed by #41482
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jseabold
Copy link
Contributor

Code Sample, a copy-pastable example if possible

Teaching a pandas course. Attendee just came across this. Note that we index with a list instead of a tuple at the bottom.

frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)),
                  index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
frame.loc[['b', 2], 'Colorado']
frame.loc[['b', 1], 'Colorado']

Returns

color      Green
key1 key2
b    1         8
     2        11

in both cases on pandas 0.18.1

Expected Output

Error?

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Darwin
OS-release: 15.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: None
pip: 8.1.1
setuptools: 20.3
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.0.3
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None
@jorisvandenbossche
Copy link
Member

So if you have a non-hierarchical index, you expect a reindex:

In [59]: frame2 = frame.reset_index(level=1, drop=True)

In [60]: frame2
Out[60]:
state  Ohio     Colorado
color Green Red    Green
key1
a         0   1        2
a         3   4        5
b         6   7        8
b         9  10       11

In [63]: frame2.loc[['b', 2], 'Colorado']
Out[63]:
color  Green
key1
b        8.0
b       11.0
2        NaN

So loc is rather liberal on its inputs and does not raise with a list-indexer when at least one of the labels is found.
The problem with the multi-index case is that reindexing is not really an option if you do not provide full indexers (containing all levels, eg frame.loc[[('b', 2), ('c', 3)], 'Colorado'] does reindex)

So in this case it would maybe indeed make sense to raise ? Other option is to keep as is and ignore those values, or to do a reindex with an empty label for the second level (like reindexing with [(2, NaN)])

I thought we already had a discussion on this once, but can't directly find the issue.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2016

so you are using slicers implicitly here and its ambiguous, see docs here

You want something like this?

In [12]: idx = pd.IndexSlice

In [19]: frame.loc[idx['b', 1], 'Colorado']
Out[19]: 
color
Green    8
Name: (b, 1), dtype: int64

In [21]: frame.loc[idx[['b'], 1], 'Colorado']
Out[21]: 
color      Green
key1 key2       
b    1         8

This is 'partial' indexing (e.g. I can use ':' for give me everything for that level)

In [29]: frame.loc[idx['a', :], 'Colorado']
Out[29]: 
color      Green
key1 key2       
a    1         2
     2         5

So giving a list to the entire input is an error, you can give a list to a single level (e.g. with multiple values).

I think this part is a bug

In [22]: frame.loc[idx[['b', 1]], 'Colorado']
Out[22]: 
color      Green
key1 key2       
b    1         8
     2        11

@jreback jreback added this to the Next Major Release milestone Jun 23, 2016
@jorisvandenbossche
Copy link
Member

@jreback You don't need the IndexSlicer to access a single element, doing frame.loc[('b', 1), 'Colorado'] (using a tuple) is perfectly fine IMO?

I suppose the issue was raised (@jseabold correct me if I am wrong) is because somebody wanted to do the above (and so should have used tuple) but used a list, and that the output was then a bit unexpected.

The frame.loc[['b', 2], 'Colorado'] is interpreted (I think) like frame.loc[(['b', 2], slice(None)), 'Colorado'] or frame.loc[idx[['b', 2], :], 'Colorado'].
This is similar like frame.loc[['a', 'b'], 'Colorado'] which gives you correctly the full frame:

In [11]: frame.loc[['a', 'b'], 'Colorado']
Out[11]:
color      Green
key1 key2
a    1         2
     2         5
b    1         8
     2        11

So the issue is more regarding: what to do when not all labels are included in a list indexer in case of a MultiIndex ?

You can also see this issue when using the more explicit IndexSlice:

In [17]: frame.loc[idx[['b', 'c'], :], 'Colorado']
Out[17]:
color      Green
key1 key2
b    1         8
     2        11

The 'c' is ignored in this case (for a single index, you would get a reindex operation introducing NaNs).

@mroeschke
Copy link
Member

Looks like this example raises now which makes sense to me. Could use a test

In [52]: frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)),
    ...:                   index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    ...:                   columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])
    ...: frame.index.names = ['key1', 'key2']
    ...: frame.columns.names = ['state', 'color']

In [53]: frame
Out[53]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

In [54]: frame.loc[['b', 2], 'Colorado']
KeyError: '[2] not in index'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Error Reporting Incorrect or improved errors from pandas MultiIndex labels May 1, 2021
@mroeschke mroeschke modified the milestones: Contributions Welcome, 1.3 May 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants