Skip to content

ERR: reindexed non-included labels on a multiindex are dropped #7886

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Jul 31, 2014 · 17 comments
Closed

ERR: reindexed non-included labels on a multiindex are dropped #7886

jreback opened this issue Jul 31, 2014 · 17 comments
Labels
Docs Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jul 31, 2014

related #4088
related #7867

I think this should raise as this is not clear how this should work (e.g. should you get all the other levels set to nan?)

or see my comment below, maybe just document?

In [11]: s = pd.Series(np.arange(9),index=pd.MultiIndex.from_product([['A','B','C'],['foo','bar','baz']],names=['one','two'])).sortlevel()

In [12]: s
Out[12]: 
one  two
A    bar    1
     baz    2
     foo    0
B    bar    4
     baz    5
     foo    3
C    bar    7
     baz    8
     foo    6
dtype: int64

In [13]: s.reindex(['A','B','D'],level=0)
Out[13]: 
one  two
A    bar    1
     baz    2
     foo    0
B    bar    4
     baz    5
     foo    3
dtype: int64
@jreback jreback added this to the 0.15.0 milestone Jul 31, 2014
@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

cc @immerrr

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

FWIW, I agree that it's better to refuse temptation to guess and raise in such cases.

@cpcloud
Copy link
Member

cpcloud commented Jul 31, 2014

me three, this is a weird operation

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

though by virtue of fixing #7866

s.loc[['A','B','D']] gives the same result

even though these are basically the same type of operation, conceptually

.loc is selection, while reindex is making an index the same

so maybe just doc this?

@jreback jreback added the Docs label Jul 31, 2014
@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

I think incomplete missing keys must raise in any case: there's not enough information to insert new labels and there's no data to be retrieved with those (well, save for searchsorted-like lookups)

For loc there's a chance to provide missing keys in full: s.loc[['A', 'B', ('D', 0)]]. That also looks weird, but not unthinkable, IMO, for reindex there's just no such possibility.

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

hmm, so @immerrr do you disagree with #7866 then?

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

As of now, yes, I don't see anything broken there to be fixed.

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

I'd go for boolean mask, like you proposed there. To me that sounds closer to the problem definition: find all rows where A is one of the following.

In fact, it would probably be nice to have a level= kwarg in Index.isin method in unified API for both saving some keystrokes and optimization potential (as in look up levels once and match labels afterwards).

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

hmm, I like that idea for isin. want to create and issue? pull-request welcome for that as well.

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

Or maybe even Index.isin({'A': set(['foo', 'bar']), 'B': set(['baz', 'qux'])})

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

Index.isin([ set(['foo', 'bar']), set(['baz', 'qux'] ], level=['A','B'])

is more consistent (with how we use set_name/levels/labels). I mean could accept the dict, but should do that separately

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

Yup, that last one is probably me getting too carried away with syntax sugar.

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

ok, if you'd open a new issue (for isin enhancement would be great. I'll close this one.

have to think about reverting #7866 though (I agree its a bit of a stretch, but it IS convient)

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

Ok, done

@immerrr
Copy link
Contributor

immerrr commented Jul 31, 2014

Speaking of not enough information, I remembered that there's some kind of "variable length" multiindex emulation with empty string keys:

In [70]: df
Out[70]: 
    a   b
    1   1
0   0   3
1   6   9
2  12  15
3  18  21
4  24  27

In [71]: df.loc[:, ('c','')] = 100.

In [72]: df
Out[72]: 
    a   b    c
    1   1     
0   0   3  100
1   6   9  100
2  12  15  100
3  18  21  100
4  24  27  100

In [73]: df['c']
Out[73]: 
0    100
1    100
2    100
3    100
4    100
Name: c, dtype: float64

I'm not sure how it works across the library, though, it was so slow that we didn't even consider it. But I suppose it can be made to work nicely (e.g. change empty string to nan/nat to include numeric and datetime indices, optimize here and there) and then incomplete missing keys would be ok.

@jreback
Copy link
Contributor Author

jreback commented Jul 31, 2014

I think this is actually a candidate for adding to a MultiIndex (maybe via an attribute or something).

Separate issue though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Error Reporting Incorrect or improved errors from pandas Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

No branches or pull requests

3 participants