Skip to content

API: Please make ".loc" return type depend on index, not on specific labels #9519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
toobaz opened this issue Feb 19, 2015 · 10 comments
Open
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@toobaz
Copy link
Member

toobaz commented Feb 19, 2015

I already mentioned this in #9466 but I think it deserves its own bug report:

In [2]: s = pd.Series([1, 2, 3], index=(1,1,2))

In [3]: s
Out[3]: 
1    1
1    2
2    3
dtype: int64

In [4]: s.loc[1]
Out[4]: 
1    1
1    2
dtype: int64

In [5]: type(s.loc[1])
Out[5]: pandas.core.series.Series

In [6]: s.loc[2]
Out[6]: 3

In [7]: type(s.loc[2])
Out[7]: numpy.int64

Quoting #5678 , "You are selecting out of a duplicated index Series. You could argue that you should get back another Series"

I really think life would be easier if s.loc[2] returned a Series of length one (and DataFrames and Panels behaved similarly). One is assumed to know (and can check in O(1)) if an index is unique, but maybe not if a given label is unique.

With higher dimensions structures it's even more messy because if e.g. .loc[lab_a, lab_b, lab_c] yields a lower dimension structure, but still a pandas structure, you have to find out which dimensions have been lost/kept (i.e. which of the labels were duplicates).

I don't think I have the skills to propose a PR, but I would volunteer to fix the broken tests.

@shoyer
Copy link
Member

shoyer commented Feb 19, 2015

I agree, this is messy.

However, I don't think we can change the behavior of s.loc[2]. The location based indexer primarily exists for pulling out individual values -- indexes with duplicate values are a bit of a special case. Changing this would break a huge amount of existing code.

To get your desired behavior, you could use a list/array instead of a single value:

In [10]: s.loc[[1]]
Out[10]:
1    1
1    2
dtype: int64

In [11]: s.loc[[2]]
Out[11]:
2    3
dtype: int64

@toobaz
Copy link
Member Author

toobaz commented Feb 19, 2015

Sorry, I was not clear. I don't think the behavior of s.loc[2] should change in general, but just that it should depend on s.index.is_unique (i.e. taking care of what you rightly mention as a "special case"), not on the count of "2" in the index.

(sorry again if this was obvious)

@shoyer
Copy link
Member

shoyer commented Feb 19, 2015

Oh, okay. I could get behind that change.

@toobaz
Copy link
Member Author

toobaz commented Feb 19, 2015

(great... by the way: it will also be coherent with

In [2]: type(Series([1,2,3], index=MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('C', 'a')]))['C'])
Out[2]: pandas.core.series.Series

)

@shoyer
Copy link
Member

shoyer commented Feb 19, 2015

If you want to give this a go, I believe it should suffice to change this line and the same corresponding lines in the other _maybe_get_bool_indexer methods in that file: https://github.com/pydata/pandas/blob/bceb342d76ca4061fc8d3a1089bc4e83a32da656/pandas/index.pyx#L202

Instead of return last_true, you want return slice(last_true, last_true + 1). Using a slice with a single item is just as fast as indexing with an integer, but doesn't remove a dimension.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 19, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 19, 2015
@toobaz
Copy link
Member Author

toobaz commented Feb 19, 2015

Ouch... I had wrongly assumed that each call to .loc was preceded by setting "is_unique".

So the change I am proposing would make the first use of ".loc" on any given index cost O(log(N)) rather than O(N). Even considering that in unique indices it would then allow a speedup of ~ 2x for any subsequent call to ".loc", I doubt this is acceptable.

@jreback
Copy link
Contributor

jreback commented Feb 19, 2015

@toobaz the perf impact or getting s single element is generally irrelevant
if you need to do this multiple times you would do it as a list-like (or Boolean) selection which has much better perf guarantees

and you pay the indexing cost on the first use always
we always determine uniqueness via a hashtable
indexing into a duplicate indexing is almost always O(n)

this API change is good idea but if u care about perf you shouldn't use duplicates in a single level of an index

@toobaz
Copy link
Member Author

toobaz commented Feb 19, 2015

@jreback Sorry. I had missed the fact that the first call to .loc creates the hashtable anyway (through IndexEngine.contains). So forget my last message, I'll send a PR tonight.

Just one question: shouldn't
https://github.com/pydata/pandas/blob/bceb342d76ca4061fc8d3a1089bc4e83a32da656/pandas/index.pyx#L272
have one less indentation level?

@toobaz
Copy link
Member Author

toobaz commented Feb 19, 2015

Not so easy as I hoped. I applied the suggested fixes and catched one more:
https://github.com/toobaz/pandas/tree/nonuniqueslices
... they are sufficient for .loc[label], but not for .loc[:,label], which passes to the IndexEngine an already filtered index. I won't have much time to investigate in the next few days.

The good/scaring part is that all (previously existing) tests pass.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 5, 2015
@toobaz
Copy link
Member Author

toobaz commented Oct 1, 2017

It is worth mentioning that MultiIndexes already work as desired:

In [2]: s = pd.Series(range(4),
                      index=pd.MultiIndex.from_tuples([(1,2), (1, 2), (3, 4), (5, 6)]))

In [3]: s.loc[1,2].shape
Out[3]: (2,)

In [4]: s.loc[3,4].shape
Out[4]: (1,)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants