Discussion: Why is loc label-based slicing right-inclusive? #27059

ghost · 2019-06-26T17:43:02Z

I've searched the documentation and old issues and I can't find anything on the reason for this unusual choice. The original historic PR was #2922.

One user raised the question before in #14900 and aggressively shut down by jeff without an answer.

It has caused bugs in my code in the past, and I've seen it do the same for other people,
It is the cause of subtle bugs seen in the wild, #26959 (comment). It's inconsistency with python slice conventions is confusing to newbies who ask on SO but no explanation is given.

Every pandas tutorial has to mention this special case:

.loc includes the last value with slice notation. In other data containers such as Python lists, the last value is excluded.

users often need to slice something with closed='left' behavior, and try to add it in similar situations where it isn't available.

This behavior is fully documented. That's not the problem. For example,
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says

Note that contrary to usual python slices, both the start and the stop are 
included, when present in the index! See "Slicing with labels".).

The "Slicing with labels" section documents the behavior, but gives no reason why the choice to break with python conventions was taken.

Another user who asked this question was given a workaround which is much more cumbersome compared to the convenience of .loc in #16571 (comment).

He points to DataFrame.between_time, which has kwds for requesting this behavior, but infuriatingly, accepts only times and not datetimes.

A few related cookbook entries were added
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes
which explains

There are 2 explicit slicing methods, with a third general case

    Positional-oriented (Python slicing style : exclusive of end)
    Label-oriented (Non-Python slicing style : inclusive of end)

this again documents how loc works, but again offers no reason why pandas originally broke with python conventions.

In every case I've seen someone ask "why is label-based slicing right-inclusive?" the answer has always been "because it's label based, not position based", which doesn't really explain anything.

The same issue exists with string based partial time slicing such like df.loc[:"2018"],
which will include rows with 2018-01-01 prefix.

So, over the years several people have found this undesirable and/or have tried to find out why, but with an hour's worth of gooling and reading, I can't find an explanation ever have been given.

I'm not saying there's no good reason, I understand that indexing can get complicated, and mixed cases are tricky, etc'. But I want to understand why this was necessary or desirable, and why making .loc pythonic was unfeasible.

I'll be opening a very small POC PR for discussion in a few minutes, which adds a new indexer called locs. ~~It is far from complete, but it seems to do exactly what I want for the single-index case, in what I've tried so far.~~ it passes all the equivalent tests loc does.

To summarize:

it's been asked before, but not answered.
the behavior is documented, but is surprising and never explained.
It's not immediately obvious why it's impossible to implement the pythonic version.

So I ask, why is .loc right-inclusive?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-26T18:21:17Z

Just FYI, before you put too much time into this, I don't think this is likely something we'll change.

…

On Wed, Jun 26, 2019 at 12:43 PM pilkibun ***@***.***> wrote: I've searched the documentation and old issues and I can't find anything on the reason for this unusual choice. The original historic PR was #2922 <#2922>. One user raised the question before in #14900 <#14900> and aggressively shut down by jeff without an answer. It has caused bugs in my code in the past, and I've seen it do the same for other people, It is the cause of subtle bugs seen in the wild <https://gist.github.com/betatim/c59039682d92fab89859358e8c585313>. It's inconsistency with python slice conventions is confusing to newbies who ask on SO <https://stackoverflow.com/questions/55187559/why-is-loc-slicing-in-pandas-inclusive-of-stop-contrary-to-typical-python-slic> but no explanation is given. users often need to slice something with closed='left' behavior, and try to add it <#12398> in similar situations where it isn't available. This behavior is fully documented. That's not the problem. For example, https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says Note that contrary to usual python slices, both the start and the stop are included, when present in the index! See "Slicing with labels".). The "Slicing with labels" section documents the behavior, but gives no reason why the choice to break with python conventions was taken. Another user who asked this question was given a workaround which is much more cumbersomen compared to the convenience of .loc in #16571 (comment) <#16571 (comment)> . He points to DataFrame.between_time <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html>, which has kwds for requesting this behavior, but infuriatingly, accepts only times and not datetimes. A few related cookbook entries were added https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes which explains There are 2 explicit slicing methods, with a third general case Positional-oriented (Python slicing style : exclusive of end) Label-oriented (Non-Python slicing style : inclusive of end) this again documents how loc works, but again offers no reason why pandas originally broke with python conventions. In every case I've seen someone ask "why is label-based slicing right-inclusive?" the answer has always been "because it's label based, not position based", which doesn't really explain anything. The same issue exists with string based partial time slicing such like df.loc[:"2018"], which will include rows with 2018-01-01 prefix. So, over the years several people have found this undesirable and/or have tried to find out why, but with an hour's worth of gooling and reading, I can't find an explanation ever have been given. I'm not saying there's no good reason, I understand that indexing can get complicated, and mixed cases are tricky, etc'. But I want to understand why this was necessary or desirable, and why making .loc pythonic was unfeasible. I'll be opening a very small POC PR for discussion in a few minutes, which adds a new indexer called locs. It is far from complete, but it seems to do exactly what I want for the single-index case, in what I've tried so far. To summarize: - it's been asked before, but not answered. - the behavior is documented, but is surprising and never explained. - It's not immediately obvious why it's impossible to implement the pythonic version. So I ask, why is .loc right-inclusive? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27059?email_source=notifications&email_token=AAKAOIV2DMWZUM2TBBTPMRTP4OTDDA5CNFSM4H3USRXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G332O3Q>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIUAVUZDN3CNUN34KH3P4OTDDANCNFSM4H3USRXA> .

jreback · 2019-06-26T18:29:09Z

-1 on this. This has long been standard in pandas. Also very likely impossible to change in a non-breaking manner (even if this was desired).

shoyer · 2019-06-27T07:38:46Z

I recall reading the claim that this is preferable because with that with label-based slicing, you can't always predict what the next element would be. For lists with elements in no particular order, this sort of makes sense, e.g., consider a DataFrame with columns ['one', 'two', 'three', 'a', 'b', 'c]. Arguably it is clearer to write df.loc[:, 'one':'three'] then df.loc[:, 'one':'a'].

That said, I don't think label based slicing is very useful for non-sorted indexes, because there's no guarantee about what exists between "one" and "three". And with sorted indexes, this argument goes away: it's always straightforward to construct an element larger than any given element, even if it's less trivial than i+1 with integers.

So I agree that this was a design mistake (one that I have regretted copying in xarray), but I also think it's too late to fix it in pandas.

I don't think the marginally better behavior would be worth the trouble of breaking all existing uses of slice based indexing with .loc, nor is it worth the confusion of adding another almost identical but subtly different indexing API like .locs.

TomAugspurger · 2019-06-27T12:08:22Z

Well said. We could *consider* deprecating this kind of slicing for non-sorted indexes, but even that may not be worth the churn on users.

…

On Thu, Jun 27, 2019 at 2:38 AM Stephan Hoyer ***@***.***> wrote: I recall reading the claim that this is preferable because with that with label-based slicing, you can't always predict what the next element would be. For lists with elements in no particular order, this sort of makes sense, e.g., consider a DataFrame with columns ['one', 'two', 'three', 'a', 'b', 'c]. Arguably it is clearer to write df.loc[:, 'one':'three'] then df.loc[:, 'one':'a']. That said, I don't think label based slicing is very useful for non-sorted indexes, because there's no guarantee about what exists *between* "one" and "three". And with sorted indexes, this argument goes away: it's always straightforward to construct an element larger than any given element, even if it's less trivial than i+1 with integers. So I agree that this was a design mistake (one that I have regretted copying in xarray), but I also think it's too late to fix it in pandas. I don't think the marginally better behavior would be worth the trouble of breaking all existing uses of slice based indexing with .loc, nor is it worth the confusion of adding another almost identical but subtly different indexing API like .locs. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27059?email_source=notifications&email_token=AAKAOIRF7H3OS7TFHVCJBYLP4RVA5A5CNFSM4H3USRXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYWHRQA#issuecomment-506230976>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITNYWTZG4ZWM6UQMXDP4RVA5ANCNFSM4H3USRXA> .

ghost · 2019-06-28T20:09:03Z

Thanks for responding. I did not, anywhere, suggest loc should be deprecated or changed.

Thanks @shoyer for trying to explain the reasoning. I believe I've found your reference, hidden on page 36 of the documentation section about MultiIndex, It isn't linked from the primary discussion of loc indexing. I'll fix that in a PR.

I don't think the marginally better behavior

not better - equally valid. The reasoning given in the documentation boils down to 2 reasons. The first doesn't hold for sorted indexes as you've explained. The second claims that closed intervals are a common use case, so loc supports that case. But, half-closed intervals are just as common, if not more so. That doesn't convince either.

The conclusion is it might have easily gone the other way, and out of the two options, the unpythonic variant was picked up.

There's also no essential technical reason why a half-closed indexer couldn't work. The POC (updated yesterday) is tiny, and now passes the full loc test suite. This is seemingly just an arbitrary decision to support one use case and not another. And it's the arbitrariness that makes it so irritating.

In retrospect, I don't understand why the question was framed as having to choose between one or the other in the first place. These are two very similar and equally common scenarios, yet one is cleanly supported while the other necessitates "code noise".

If loc hadn't been put forth as "the way it's done in pandas" originally, I'm not sure there'd be such resistance to the idea of supporting both cases.

Hell, all 4 cases. Several recent pandas additions, like Interval and Rolling, support a "closed" argument. Why does it make sense to treat all 4 cases as equally important in other contexts, but not in the essential context of slicing pandas objects?

"We can't do anything about it because it's been that way for years" is a really disappointing response.

You don't like the idea of adding a new near-identical indexer or three. ok. How about a more minimalist solution, which can undo some of the damage by reusing a python staple name? by miracle, the identifier is still unclaimed in the Series/DataFrame namespace.

df.range('2011','2012') # closed='left', like python's range()
df.range('2011','2012',2) # same, with step
df.range('2011','2012','both')  # the 3rd arg can be an int or a string. for conciseness.
df.range('2011','2012', 2, 'right') 
df.range('2011','2012',closed='neither')
dt.T.range(...).T # slicing columns

It's working right now (modulo locs name change).

def _slice(self,start=None,end=None, *args, **kwds): 
    """A minimalist, pythonic, function for slicing a Pandas-Object
    
    Parameters
    ----------
    start: optional, default None
    end: optional, default None
    step: integer, optional, default 1
    closed: str
        'left'/'right'/'both'/'neither'     
    """
    closed_mapping = {
        'left':'left',
        'right':'right',
        'both':'both',
        'neither':'neither',
            '[)':'left',
            '(]':'right',
            '[]':'both',
            '()':'neither',
            } 
    if args and isinstance(args[0],int):
        kwds['step'],args =  args[0],args[1:]
        if args and isinstance(args[0],str):
            kwds['closed'],args =  args[0],args[1:]        
    elif args and isinstance(args[0],str): 
        kwds['closed'],args =  args[0],args[1:] 
        if args and isinstance(args[0],int):
            assert 'step' not in kwds
            kwds['step'],args =  args[0],args[1:]
    elif args:
        raise ValueError("Unrecogonized position argument {arg0}".format(arg0=args[0]))
    step =  kwds.get('step')
    closed = kwds['closed'] = closed_mapping.get(kwds.get('closed'), kwds['closed'])
    assert closed in ['left', 'right', 'both', 'neither']
    
    if closed == 'both':
        return self.loc[slice(start,end,step)]
    if closed == 'left':
        return self.loc_left[slice(start,end,step)]
    if closed == 'right':
        return self.loc[slice(None,end,step)][::-1].loc_left[:start][::-1]
    if closed == 'neither':
        return self.loc_left[slice(None,end,step)][::-1].loc_left[:start][::-1]       
 
DataFrame.slice=_slice

shoyer · 2019-06-28T21:00:28Z

Thanks @shoyer for trying to explain the reasoning. I believe I've found your reference, hidden on page 36 of the documentation section about MultiIndex, It isn't linked from the primary discussion of loc indexing. I'll fix that in a PR.

Thanks for surfacing this doc!

I have to say that the specific use case of "[limiting] a time series to start and end at two specific dates" a little ironic, because this is a perfect example of cleaner code with exclusive endpoint slicing: Pandas has a bunch of messy code specifically to ensure string slicing in datetime index like df.loc['2019-01':'2019-12'] is interpreted like df.loc['2019-01-01T00:00:00':'2019-12-31T23:59:59'] rather than like df.loc['2019-01-01T00:00:00':'2019-12-01T00:00:00']. This would entirely go away if slicing was exclusive of endpoints -- you could just write df.loc['2019-01':'2020-01'], with both datetimes being expanded to the first day/hour/minute.

A dedicated method for slicing that could add these option seems like a reasonable sort of thing to support, though we might call it DataFrame.slice rather than DataFrame.range. The lack of an ability to include optional/keyword argument is definitely a downside to using getitem [] style indexing rather than a method call.

In xarray, our sel method lets us refer to axes by name, which also lets us support approximate look-ups, e.g., ds.sel(x=slice(0, 5), method='nearest'). Here we could (and probably should) add an closed option for specifying how slices work. So another option would be to implement .sel() in pandas, e.g., df.sel(index=slice(0, 5), columns=slice('a', 'b'), closed='left').

ghost · 2019-07-01T17:15:16Z

It's heartening to have some support in this, thanks!. One thing that I like about the name range is that it makes choosing closed='left' as the default behavior (which I would like) loc more palatable,
since if it differs from loc's.

Adding a new indexing method is a tough sell, and choosing a different default behavior (even pythonic), without either - a naming convention (`loc_left`, `loc_right`, `loc_both`, `loc_neither`) OR - a well-known relative to imitate (python's `range`)
as justification for the difference, would make it an even tougher sell. Since having four (additional) variants of loc seems like a non-starter, I think the name range has its merits.

ghost · 2019-07-04T16:27:17Z

I don't use python's slice directly that often but on second thought that is better choice. I've updated the PR overview accordingly. Is there support for this from additional team members?

ghost · 2019-07-08T22:38:54Z

Updated the PR with initial implementation of slice. Very sad that the pandas team won't engage, this is meant to benefit the users, after all.

TomAugspurger · 2019-07-08T22:51:31Z

Personally, I’ve been busy with the US holiday and the 0.25 release. Comments like that make me less likely to review.

ghost · 2019-07-08T23:05:49Z

That's totally understandable. You are not the only team member though, and no one so far has.

ghost · 2019-07-12T23:33:16Z

The PR has been updated with a (cleaner) slice method and is now tested. The cases for 'both'/'left' are supported. The 'neither'/'right' cases will require more work which I'm disinclined to do without some indication that the pandas team will consider merging.

ghost · 2019-08-08T18:51:20Z

I've closed the PR for lack of interest. And I've suggested a new alternative: migrating loc to become a function. This can be done with minimal disruption by repurposing loc.__call__ #27363 (comment).

jreback · 2020-01-01T15:58:48Z

closing. this is a feature of pandas and not likely to change. if more documentation is desired would be ok, but this would be a major disruption with relatively little upside.

ghost mentioned this issue Jun 26, 2019

POC: Implement pythonic label-based index slicing NDFrame.slice(closed='left') #27060

Closed

3 tasks

ghost mentioned this issue Jun 29, 2019

DOC: add internal links to "Endpoints are inclusive" section #27131

Merged

simonjayhawkins added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 1, 2019

jreback added this to the No action milestone Jan 1, 2020

jreback closed this as completed Jan 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Why is loc label-based slicing right-inclusive? #27059

Discussion: Why is loc label-based slicing right-inclusive? #27059

ghost commented Jun 26, 2019 •

edited by ghost

Loading

TomAugspurger commented Jun 26, 2019 via email

jreback commented Jun 26, 2019

shoyer commented Jun 27, 2019

TomAugspurger commented Jun 27, 2019 via email

ghost commented Jun 28, 2019 •

edited by ghost

Loading

shoyer commented Jun 28, 2019

ghost commented Jul 1, 2019 •

edited by ghost

Loading

ghost commented Jul 4, 2019 •

edited by ghost

Loading

ghost commented Jul 8, 2019 •

edited by ghost

Loading

TomAugspurger commented Jul 8, 2019

ghost commented Jul 8, 2019

ghost commented Jul 12, 2019 •

edited by ghost

Loading

ghost commented Aug 8, 2019 •

edited by ghost

Loading

jreback commented Jan 1, 2020

Discussion: Why is loc label-based slicing right-inclusive? #27059

Discussion: Why is loc label-based slicing right-inclusive? #27059

Comments

ghost commented Jun 26, 2019 • edited by ghost Loading

TomAugspurger commented Jun 26, 2019 via email

jreback commented Jun 26, 2019

shoyer commented Jun 27, 2019

TomAugspurger commented Jun 27, 2019 via email

ghost commented Jun 28, 2019 • edited by ghost Loading

shoyer commented Jun 28, 2019

ghost commented Jul 1, 2019 • edited by ghost Loading

ghost commented Jul 4, 2019 • edited by ghost Loading

ghost commented Jul 8, 2019 • edited by ghost Loading

TomAugspurger commented Jul 8, 2019

ghost commented Jul 8, 2019

ghost commented Jul 12, 2019 • edited by ghost Loading

ghost commented Aug 8, 2019 • edited by ghost Loading

jreback commented Jan 1, 2020

ghost commented Jun 26, 2019 •

edited by ghost

Loading

ghost commented Jun 28, 2019 •

edited by ghost

Loading

ghost commented Jul 1, 2019 •

edited by ghost

Loading

ghost commented Jul 4, 2019 •

edited by ghost

Loading

ghost commented Jul 8, 2019 •

edited by ghost

Loading

ghost commented Jul 12, 2019 •

edited by ghost

Loading

ghost commented Aug 8, 2019 •

edited by ghost

Loading