-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Please make ".loc" return type depend on index, not on specific labels #9519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree, this is messy. However, I don't think we can change the behavior of To get your desired behavior, you could use a list/array instead of a single value:
|
Sorry, I was not clear. I don't think the behavior of s.loc[2] should change in general, but just that it should depend on s.index.is_unique (i.e. taking care of what you rightly mention as a "special case"), not on the count of "2" in the index. (sorry again if this was obvious) |
Oh, okay. I could get behind that change. |
(great... by the way: it will also be coherent with
) |
If you want to give this a go, I believe it should suffice to change this line and the same corresponding lines in the other Instead of |
Ouch... I had wrongly assumed that each call to .loc was preceded by setting "is_unique". So the change I am proposing would make the first use of ".loc" on any given index cost O(log(N)) rather than O(N). Even considering that in unique indices it would then allow a speedup of ~ 2x for any subsequent call to ".loc", I doubt this is acceptable. |
@toobaz the perf impact or getting s single element is generally irrelevant and you pay the indexing cost on the first use always this API change is good idea but if u care about perf you shouldn't use duplicates in a single level of an index |
@jreback Sorry. I had missed the fact that the first call to .loc creates the hashtable anyway (through IndexEngine.contains). So forget my last message, I'll send a PR tonight. Just one question: shouldn't |
Not so easy as I hoped. I applied the suggested fixes and catched one more: The good/scaring part is that all (previously existing) tests pass. |
It is worth mentioning that In [2]: s = pd.Series(range(4),
index=pd.MultiIndex.from_tuples([(1,2), (1, 2), (3, 4), (5, 6)]))
In [3]: s.loc[1,2].shape
Out[3]: (2,)
In [4]: s.loc[3,4].shape
Out[4]: (1,) |
I already mentioned this in #9466 but I think it deserves its own bug report:
Quoting #5678 , "You are selecting out of a duplicated index Series. You could argue that you should get back another Series"
I really think life would be easier if s.loc[2] returned a Series of length one (and DataFrames and Panels behaved similarly). One is assumed to know (and can check in O(1)) if an index is unique, but maybe not if a given label is unique.
With higher dimensions structures it's even more messy because if e.g. .loc[lab_a, lab_b, lab_c] yields a lower dimension structure, but still a pandas structure, you have to find out which dimensions have been lost/kept (i.e. which of the labels were duplicates).
I don't think I have the skills to propose a PR, but I would volunteer to fix the broken tests.
The text was updated successfully, but these errors were encountered: