Skip to content

Discussion: Why is loc label-based slicing right-inclusive? #27059

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jun 26, 2019 · 14 comments
Closed

Discussion: Why is loc label-based slicing right-inclusive? #27059

ghost opened this issue Jun 26, 2019 · 14 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@ghost
Copy link

ghost commented Jun 26, 2019

I've searched the documentation and old issues and I can't find anything on the reason for this unusual choice. The original historic PR was #2922.

One user raised the question before in #14900 and aggressively shut down by jeff without an answer.

It has caused bugs in my code in the past, and I've seen it do the same for other people,
It is the cause of subtle bugs seen in the wild, #26959 (comment). It's inconsistency with python slice conventions is confusing to newbies who ask on SO but no explanation is given.

Every pandas tutorial has to mention this special case:

.loc includes the last value with slice notation. In other data containers such as Python lists, the last value is excluded.

users often need to slice something with closed='left' behavior, and try to add it in similar situations where it isn't available.

This behavior is fully documented. That's not the problem. For example,
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says

Note that contrary to usual python slices, both the start and the stop are 
included, when present in the index! See "Slicing with labels".).

The "Slicing with labels" section documents the behavior, but gives no reason why the choice to break with python conventions was taken.

Another user who asked this question was given a workaround which is much more cumbersome compared to the convenience of .loc in #16571 (comment).

He points to DataFrame.between_time, which has kwds for requesting this behavior, but infuriatingly, accepts only times and not datetimes.

A few related cookbook entries were added
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes
which explains

There are 2 explicit slicing methods, with a third general case

    Positional-oriented (Python slicing style : exclusive of end)
    Label-oriented (Non-Python slicing style : inclusive of end)

this again documents how loc works, but again offers no reason why pandas originally broke with python conventions.

In every case I've seen someone ask "why is label-based slicing right-inclusive?" the answer has always been "because it's label based, not position based", which doesn't really explain anything.

The same issue exists with string based partial time slicing such like df.loc[:"2018"],
which will include rows with 2018-01-01 prefix.

So, over the years several people have found this undesirable and/or have tried to find out why, but with an hour's worth of gooling and reading, I can't find an explanation ever have been given.

I'm not saying there's no good reason, I understand that indexing can get complicated, and mixed cases are tricky, etc'. But I want to understand why this was necessary or desirable, and why making .loc pythonic was unfeasible.

I'll be opening a very small POC PR for discussion in a few minutes, which adds a new indexer called locs. It is far from complete, but it seems to do exactly what I want for the single-index case, in what I've tried so far. it passes all the equivalent tests loc does.

To summarize:

  • it's been asked before, but not answered.
  • the behavior is documented, but is surprising and never explained.
  • It's not immediately obvious why it's impossible to implement the pythonic version.

So I ask, why is .loc right-inclusive?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 26, 2019 via email

@jreback
Copy link
Contributor

jreback commented Jun 26, 2019

-1 on this. This has long been standard in pandas. Also very likely impossible to change in a non-breaking manner (even if this was desired).

@shoyer
Copy link
Member

shoyer commented Jun 27, 2019

I recall reading the claim that this is preferable because with that with label-based slicing, you can't always predict what the next element would be. For lists with elements in no particular order, this sort of makes sense, e.g., consider a DataFrame with columns ['one', 'two', 'three', 'a', 'b', 'c]. Arguably it is clearer to write df.loc[:, 'one':'three'] then df.loc[:, 'one':'a'].

That said, I don't think label based slicing is very useful for non-sorted indexes, because there's no guarantee about what exists between "one" and "three". And with sorted indexes, this argument goes away: it's always straightforward to construct an element larger than any given element, even if it's less trivial than i+1 with integers.

So I agree that this was a design mistake (one that I have regretted copying in xarray), but I also think it's too late to fix it in pandas.

I don't think the marginally better behavior would be worth the trouble of breaking all existing uses of slice based indexing with .loc, nor is it worth the confusion of adding another almost identical but subtly different indexing API like .locs.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 27, 2019 via email

@ghost
Copy link
Author

ghost commented Jun 28, 2019

Thanks for responding. I did not, anywhere, suggest loc should be deprecated or changed.

Thanks @shoyer for trying to explain the reasoning. I believe I've found your reference, hidden on page 36 of the documentation section about MultiIndex, It isn't linked from the primary discussion of loc indexing. I'll fix that in a PR.

I don't think the marginally better behavior

not better - equally valid. The reasoning given in the documentation boils down to 2 reasons. The first doesn't hold for sorted indexes as you've explained. The second claims that closed intervals are a common use case, so loc supports that case. But, half-closed intervals are just as common, if not more so. That doesn't convince either.

The conclusion is it might have easily gone the other way, and out of the two options, the unpythonic variant was picked up.

There's also no essential technical reason why a half-closed indexer couldn't work. The POC (updated yesterday) is tiny, and now passes the full loc test suite. This is seemingly just an arbitrary decision to support one use case and not another. And it's the arbitrariness that makes it so irritating.

In retrospect, I don't understand why the question was framed as having to choose between one or the other in the first place. These are two very similar and equally common scenarios, yet one is cleanly supported while the other necessitates "code noise".

If loc hadn't been put forth as "the way it's done in pandas" originally, I'm not sure there'd be such resistance to the idea of supporting both cases.

Hell, all 4 cases. Several recent pandas additions, like Interval and Rolling, support a "closed" argument. Why does it make sense to treat all 4 cases as equally important in other contexts, but not in the essential context of slicing pandas objects?

"We can't do anything about it because it's been that way for years" is a really disappointing response.

You don't like the idea of adding a new near-identical indexer or three. ok. How about a more minimalist solution, which can undo some of the damage by reusing a python staple name? by miracle, the identifier is still unclaimed in the Series/DataFrame namespace.

df.range('2011','2012') # closed='left', like python's range()
df.range('2011','2012',2) # same, with step
df.range('2011','2012','both')  # the 3rd arg can be an int or a string. for conciseness.
df.range('2011','2012', 2, 'right') 
df.range('2011','2012',closed='neither')
dt.T.range(...).T # slicing columns

It's working right now (modulo locs name change).

def _slice(self,start=None,end=None, *args, **kwds): 
    """A minimalist, pythonic, function for slicing a Pandas-Object
    
    Parameters
    ----------
    start: optional, default None
    end: optional, default None
    step: integer, optional, default 1
    closed: str
        'left'/'right'/'both'/'neither'     
    """
    closed_mapping = {
        'left':'left',
        'right':'right',
        'both':'both',
        'neither':'neither',
            '[)':'left',
            '(]':'right',
            '[]':'both',
            '()':'neither',
            } 
    if args and isinstance(args[0],int):
        kwds['step'],args =  args[0],args[1:]
        if args and isinstance(args[0],str):
            kwds['closed'],args =  args[0],args[1:]        
    elif args and isinstance(args[0],str): 
        kwds['closed'],args =  args[0],args[1:] 
        if args and isinstance(args[0],int):
            assert 'step' not in kwds
            kwds['step'],args =  args[0],args[1:]
    elif args:
        raise ValueError("Unrecogonized position argument {arg0}".format(arg0=args[0]))
    step =  kwds.get('step')
    closed = kwds['closed'] = closed_mapping.get(kwds.get('closed'), kwds['closed'])
    assert closed in ['left', 'right', 'both', 'neither']
    
    if closed == 'both':
        return self.loc[slice(start,end,step)]
    if closed == 'left':
        return self.loc_left[slice(start,end,step)]
    if closed == 'right':
        return self.loc[slice(None,end,step)][::-1].loc_left[:start][::-1]
    if closed == 'neither':
        return self.loc_left[slice(None,end,step)][::-1].loc_left[:start][::-1]       
 
DataFrame.slice=_slice

@shoyer
Copy link
Member

shoyer commented Jun 28, 2019

Thanks @shoyer for trying to explain the reasoning. I believe I've found your reference, hidden on page 36 of the documentation section about MultiIndex, It isn't linked from the primary discussion of loc indexing. I'll fix that in a PR.

Thanks for surfacing this doc!

I have to say that the specific use case of "[limiting] a time series to start and end at two specific dates" a little ironic, because this is a perfect example of cleaner code with exclusive endpoint slicing: Pandas has a bunch of messy code specifically to ensure string slicing in datetime index like df.loc['2019-01':'2019-12'] is interpreted like df.loc['2019-01-01T00:00:00':'2019-12-31T23:59:59'] rather than like df.loc['2019-01-01T00:00:00':'2019-12-01T00:00:00']. This would entirely go away if slicing was exclusive of endpoints -- you could just write df.loc['2019-01':'2020-01'], with both datetimes being expanded to the first day/hour/minute.

A dedicated method for slicing that could add these option seems like a reasonable sort of thing to support, though we might call it DataFrame.slice rather than DataFrame.range. The lack of an ability to include optional/keyword argument is definitely a downside to using getitem [] style indexing rather than a method call.

In xarray, our sel method lets us refer to axes by name, which also lets us support approximate look-ups, e.g., ds.sel(x=slice(0, 5), method='nearest'). Here we could (and probably should) add an closed option for specifying how slices work. So another option would be to implement .sel() in pandas, e.g., df.sel(index=slice(0, 5), columns=slice('a', 'b'), closed='left').

@ghost
Copy link
Author

ghost commented Jul 1, 2019

It's heartening to have some support in this, thanks!. One thing that I like about the name range is that it makes choosing closed='left' as the default behavior (which I would like) loc more palatable,
since if it differs from loc's.

Adding a new indexing method is a tough sell, and choosing a different default behavior (even pythonic), without either - a naming convention (`loc_left`, `loc_right`, `loc_both`, `loc_neither`) OR - a well-known relative to imitate (python's `range`)

as justification for the difference, would make it an even tougher sell. Since having four (additional) variants of loc seems like a non-starter, I think the name range has its merits.

@ghost
Copy link
Author

ghost commented Jul 4, 2019

I don't use python's slice directly that often but on second thought that is better choice. I've updated the PR overview accordingly. Is there support for this from additional team members?

@ghost
Copy link
Author

ghost commented Jul 8, 2019

Updated the PR with initial implementation of slice. Very sad that the pandas team won't engage, this is meant to benefit the users, after all.

@TomAugspurger
Copy link
Contributor

Personally, I’ve been busy with the US holiday and the 0.25 release. Comments like that make me less likely to review.

@ghost
Copy link
Author

ghost commented Jul 8, 2019

That's totally understandable. You are not the only team member though, and no one so far has.

@ghost
Copy link
Author

ghost commented Jul 12, 2019

The PR has been updated with a (cleaner) slice method and is now tested. The cases for 'both'/'left' are supported. The 'neither'/'right' cases will require more work which I'm disinclined to do without some indication that the pandas team will consider merging.

@simonjayhawkins simonjayhawkins added API Design Indexing Related to indexing on series/frames, not to indexes themselves labels Aug 1, 2019
@ghost
Copy link
Author

ghost commented Aug 8, 2019

I've closed the PR for lack of interest. And I've suggested a new alternative: migrating loc to become a function. This can be done with minimal disruption by repurposing loc.__call__ #27363 (comment).

@jreback jreback added this to the No action milestone Jan 1, 2020
@jreback
Copy link
Contributor

jreback commented Jan 1, 2020

closing. this is a feature of pandas and not likely to change. if more documentation is desired would be ok, but this would be a major disruption with relatively little upside.

@jreback jreback closed this as completed Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants