API: Please make ".loc" return type depend on index, not on specific labels #9519

toobaz · 2015-02-19T07:28:54Z

I already mentioned this in #9466 but I think it deserves its own bug report:

In [2]: s = pd.Series([1, 2, 3], index=(1,1,2))

In [3]: s
Out[3]: 
1    1
1    2
2    3
dtype: int64

In [4]: s.loc[1]
Out[4]: 
1    1
1    2
dtype: int64

In [5]: type(s.loc[1])
Out[5]: pandas.core.series.Series

In [6]: s.loc[2]
Out[6]: 3

In [7]: type(s.loc[2])
Out[7]: numpy.int64

Quoting #5678 , "You are selecting out of a duplicated index Series. You could argue that you should get back another Series"

I really think life would be easier if s.loc[2] returned a Series of length one (and DataFrames and Panels behaved similarly). One is assumed to know (and can check in O(1)) if an index is unique, but maybe not if a given label is unique.

With higher dimensions structures it's even more messy because if e.g. .loc[lab_a, lab_b, lab_c] yields a lower dimension structure, but still a pandas structure, you have to find out which dimensions have been lost/kept (i.e. which of the labels were duplicates).

I don't think I have the skills to propose a PR, but I would volunteer to fix the broken tests.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-02-19T09:27:23Z

I agree, this is messy.

However, I don't think we can change the behavior of s.loc[2]. The location based indexer primarily exists for pulling out individual values -- indexes with duplicate values are a bit of a special case. Changing this would break a huge amount of existing code.

To get your desired behavior, you could use a list/array instead of a single value:

In [10]: s.loc[[1]]
Out[10]:
1    1
1    2
dtype: int64

In [11]: s.loc[[2]]
Out[11]:
2    3
dtype: int64

toobaz · 2015-02-19T09:34:09Z

Sorry, I was not clear. I don't think the behavior of s.loc[2] should change in general, but just that it should depend on s.index.is_unique (i.e. taking care of what you rightly mention as a "special case"), not on the count of "2" in the index.

(sorry again if this was obvious)

shoyer · 2015-02-19T09:50:12Z

Oh, okay. I could get behind that change.

toobaz · 2015-02-19T09:53:10Z

(great... by the way: it will also be coherent with

In [2]: type(Series([1,2,3], index=MultiIndex.from_tuples([('A', 'a'), ('A', 'b'), ('C', 'a')]))['C'])
Out[2]: pandas.core.series.Series

)

shoyer · 2015-02-19T10:01:36Z

If you want to give this a go, I believe it should suffice to change this line and the same corresponding lines in the other _maybe_get_bool_indexer methods in that file: https://github.com/pydata/pandas/blob/bceb342d76ca4061fc8d3a1089bc4e83a32da656/pandas/index.pyx#L202

Instead of return last_true, you want return slice(last_true, last_true + 1). Using a slice with a single item is just as fast as indexing with an integer, but doesn't remove a dimension.

toobaz · 2015-02-19T11:38:52Z

Ouch... I had wrongly assumed that each call to .loc was preceded by setting "is_unique".

So the change I am proposing would make the first use of ".loc" on any given index cost O(log(N)) rather than O(N). Even considering that in unique indices it would then allow a speedup of ~ 2x for any subsequent call to ".loc", I doubt this is acceptable.

jreback · 2015-02-19T11:53:32Z

@toobaz the perf impact or getting s single element is generally irrelevant
if you need to do this multiple times you would do it as a list-like (or Boolean) selection which has much better perf guarantees

and you pay the indexing cost on the first use always
we always determine uniqueness via a hashtable
indexing into a duplicate indexing is almost always O(n)

this API change is good idea but if u care about perf you shouldn't use duplicates in a single level of an index

toobaz · 2015-02-19T12:48:22Z

@jreback Sorry. I had missed the fact that the first call to .loc creates the hashtable anyway (through IndexEngine.contains). So forget my last message, I'll send a PR tonight.

Just one question: shouldn't
https://github.com/pydata/pandas/blob/bceb342d76ca4061fc8d3a1089bc4e83a32da656/pandas/index.pyx#L272
have one less indentation level?

toobaz · 2015-02-19T21:22:23Z

Not so easy as I hoped. I applied the suggested fixes and catched one more:
https://github.com/toobaz/pandas/tree/nonuniqueslices
... they are sufficient for .loc[label], but not for .loc[:,label], which passes to the IndexEngine an already filtered index. I won't have much time to investigate in the next few days.

The good/scaring part is that all (previously existing) tests pass.

toobaz · 2017-10-01T07:47:50Z

It is worth mentioning that MultiIndexes already work as desired:

In [2]: s = pd.Series(range(4),
                      index=pd.MultiIndex.from_tuples([(1,2), (1, 2), (3, 4), (5, 6)]))

In [3]: s.loc[1,2].shape
Out[3]: (2,)

In [4]: s.loc[3,4].shape
Out[4]: (1,)

jreback added Indexing Related to indexing on series/frames, not to indexes themselves API Design labels Feb 19, 2015

jreback added this to the 0.16.0 milestone Feb 19, 2015

jreback modified the milestones: 0.16.0, Next Major Release Mar 5, 2015

samueljohn mentioned this issue Nov 16, 2015

.loc on DataFrame returning coerced dtype for single rows #11617

Closed

toobaz mentioned this issue Jun 4, 2018

DataFrame.[key] and .loc[:, key] with non-unique columns returns frame if (and only if) columns are MultiIndex #21309

Open

toobaz mentioned this issue Jul 3, 2018

BUG: fix DataFrame.__getitem__ and .loc with non-list listlikes #21313

Merged

5 tasks

toobaz mentioned this issue Aug 3, 2018

Multi-Index .loc with nlevels > 2 fails when nlevels - i > 1 levels are specified in a tuple. #22151

Closed

jreback mentioned this issue Jan 28, 2020

loc on Series returning scalar when result size is 1 #31381

Closed

jorisvandenbossche mentioned this issue Apr 7, 2020

API: require scalar result from the scalar .at indexer #33153

Open

mroeschke added Enhancement and removed API Design labels Apr 12, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Please make ".loc" return type depend on index, not on specific labels #9519

API: Please make ".loc" return type depend on index, not on specific labels #9519

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

jreback commented Feb 19, 2015

toobaz commented Feb 19, 2015

toobaz commented Feb 19, 2015

toobaz commented Oct 1, 2017

API: Please make ".loc" return type depend on index, not on specific labels #9519

API: Please make ".loc" return type depend on index, not on specific labels #9519

Comments

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

shoyer commented Feb 19, 2015

toobaz commented Feb 19, 2015

jreback commented Feb 19, 2015

toobaz commented Feb 19, 2015

toobaz commented Feb 19, 2015

toobaz commented Oct 1, 2017