-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Discussion: Why is loc label-based slicing right-inclusive? #27059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just FYI, before you put too much time into this, I don't think this is
likely something we'll change.
…On Wed, Jun 26, 2019 at 12:43 PM pilkibun ***@***.***> wrote:
I've searched the documentation and old issues and I can't find anything
on the reason for this unusual choice. The original historic PR was #2922
<#2922>.
One user raised the question before in #14900
<#14900> and aggressively shut
down by jeff without an answer.
It has caused bugs in my code in the past, and I've seen it do the same
for other people,
It is the cause of subtle bugs seen in the wild
<https://gist.github.com/betatim/c59039682d92fab89859358e8c585313>. It's
inconsistency with python slice conventions is confusing to newbies who ask
on SO
<https://stackoverflow.com/questions/55187559/why-is-loc-slicing-in-pandas-inclusive-of-stop-contrary-to-typical-python-slic>
but no explanation is given.
users often need to slice something with closed='left' behavior, and try to
add it <#12398> in similar
situations where it isn't available.
This behavior is fully documented. That's not the problem. For example,
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says
Note that contrary to usual python slices, both the start and the stop are
included, when present in the index! See "Slicing with labels".).
The "Slicing with labels" section documents the behavior, but gives no
reason why the choice to break with python conventions was taken.
Another user who asked this question was given a workaround which is much
more cumbersomen compared to the convenience of .loc in #16571 (comment)
<#16571 (comment)>
.
He points to DataFrame.between_time
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html>,
which has kwds for requesting this behavior, but infuriatingly, accepts
only times and not datetimes.
A few related cookbook entries were added
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes
which explains
There are 2 explicit slicing methods, with a third general case
Positional-oriented (Python slicing style : exclusive of end)
Label-oriented (Non-Python slicing style : inclusive of end)
this again documents how loc works, but again offers no reason why pandas
originally broke with python conventions.
In every case I've seen someone ask "why is label-based slicing
right-inclusive?" the answer has always been "because it's label based, not
position based", which doesn't really explain anything.
The same issue exists with string based partial time slicing such like
df.loc[:"2018"],
which will include rows with 2018-01-01 prefix.
So, over the years several people have found this undesirable and/or have
tried to find out why, but with an hour's worth of gooling and reading, I
can't find an explanation ever have been given.
I'm not saying there's no good reason, I understand that indexing can get
complicated, and mixed cases are tricky, etc'. But I want to understand why
this was necessary or desirable, and why making .loc pythonic was
unfeasible.
I'll be opening a very small POC PR for discussion in a few minutes, which
adds a new indexer called locs. It is far from complete, but it seems to
do exactly what I want for the single-index case, in what I've tried so far.
To summarize:
- it's been asked before, but not answered.
- the behavior is documented, but is surprising and never explained.
- It's not immediately obvious why it's impossible to implement the
pythonic version.
So I ask, why is .loc right-inclusive?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27059?email_source=notifications&email_token=AAKAOIV2DMWZUM2TBBTPMRTP4OTDDA5CNFSM4H3USRXKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G332O3Q>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIUAVUZDN3CNUN34KH3P4OTDDANCNFSM4H3USRXA>
.
|
-1 on this. This has long been standard in pandas. Also very likely impossible to change in a non-breaking manner (even if this was desired). |
I recall reading the claim that this is preferable because with that with label-based slicing, you can't always predict what the next element would be. For lists with elements in no particular order, this sort of makes sense, e.g., consider a DataFrame with columns That said, I don't think label based slicing is very useful for non-sorted indexes, because there's no guarantee about what exists between "one" and "three". And with sorted indexes, this argument goes away: it's always straightforward to construct an element larger than any given element, even if it's less trivial than So I agree that this was a design mistake (one that I have regretted copying in xarray), but I also think it's too late to fix it in pandas. I don't think the marginally better behavior would be worth the trouble of breaking all existing uses of slice based indexing with |
Well said.
We could *consider* deprecating this kind of slicing for non-sorted
indexes, but even that may not be worth the churn on users.
…On Thu, Jun 27, 2019 at 2:38 AM Stephan Hoyer ***@***.***> wrote:
I recall reading the claim that this is preferable because with that with
label-based slicing, you can't always predict what the next element would
be. For lists with elements in no particular order, this sort of makes
sense, e.g., consider a DataFrame with columns ['one', 'two', 'three',
'a', 'b', 'c]. Arguably it is clearer to write df.loc[:, 'one':'three']
then df.loc[:, 'one':'a'].
That said, I don't think label based slicing is very useful for non-sorted
indexes, because there's no guarantee about what exists *between* "one"
and "three". And with sorted indexes, this argument goes away: it's always
straightforward to construct an element larger than any given element, even
if it's less trivial than i+1 with integers.
So I agree that this was a design mistake (one that I have regretted
copying in xarray), but I also think it's too late to fix it in pandas.
I don't think the marginally better behavior would be worth the trouble of
breaking all existing uses of slice based indexing with .loc, nor is it
worth the confusion of adding another almost identical but subtly different
indexing API like .locs.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27059?email_source=notifications&email_token=AAKAOIRF7H3OS7TFHVCJBYLP4RVA5A5CNFSM4H3USRXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYWHRQA#issuecomment-506230976>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOITNYWTZG4ZWM6UQMXDP4RVA5ANCNFSM4H3USRXA>
.
|
Thanks for responding. I did not, anywhere, suggest Thanks @shoyer for trying to explain the reasoning. I believe I've found your reference, hidden on page 36 of the documentation section about MultiIndex, It isn't linked from the primary discussion of loc indexing. I'll fix that in a PR.
not better - equally valid. The reasoning given in the documentation boils down to 2 reasons. The first doesn't hold for sorted indexes as you've explained. The second claims that closed intervals are a common use case, so loc supports that case. But, half-closed intervals are just as common, if not more so. That doesn't convince either. The conclusion is it might have easily gone the other way, and out of the two options, the unpythonic variant was picked up. There's also no essential technical reason why a half-closed indexer couldn't work. The POC (updated yesterday) is tiny, and now passes the full loc test suite. This is seemingly just an arbitrary decision to support one use case and not another. And it's the arbitrariness that makes it so irritating. In retrospect, I don't understand why the question was framed as having to choose between one or the other in the first place. These are two very similar and equally common scenarios, yet one is cleanly supported while the other necessitates "code noise". If Hell, all 4 cases. Several recent pandas additions, like "We can't do anything about it because it's been that way for years" is a really disappointing response. You don't like the idea of adding a new near-identical indexer or three. ok. How about a more minimalist solution, which can undo some of the damage by reusing a python staple name? by miracle, the identifier is still unclaimed in the Series/DataFrame namespace. df.range('2011','2012') # closed='left', like python's range()
df.range('2011','2012',2) # same, with step
df.range('2011','2012','both') # the 3rd arg can be an int or a string. for conciseness.
df.range('2011','2012', 2, 'right')
df.range('2011','2012',closed='neither')
dt.T.range(...).T # slicing columns It's working right now (modulo def _slice(self,start=None,end=None, *args, **kwds):
"""A minimalist, pythonic, function for slicing a Pandas-Object
Parameters
----------
start: optional, default None
end: optional, default None
step: integer, optional, default 1
closed: str
'left'/'right'/'both'/'neither'
"""
closed_mapping = {
'left':'left',
'right':'right',
'both':'both',
'neither':'neither',
'[)':'left',
'(]':'right',
'[]':'both',
'()':'neither',
}
if args and isinstance(args[0],int):
kwds['step'],args = args[0],args[1:]
if args and isinstance(args[0],str):
kwds['closed'],args = args[0],args[1:]
elif args and isinstance(args[0],str):
kwds['closed'],args = args[0],args[1:]
if args and isinstance(args[0],int):
assert 'step' not in kwds
kwds['step'],args = args[0],args[1:]
elif args:
raise ValueError("Unrecogonized position argument {arg0}".format(arg0=args[0]))
step = kwds.get('step')
closed = kwds['closed'] = closed_mapping.get(kwds.get('closed'), kwds['closed'])
assert closed in ['left', 'right', 'both', 'neither']
if closed == 'both':
return self.loc[slice(start,end,step)]
if closed == 'left':
return self.loc_left[slice(start,end,step)]
if closed == 'right':
return self.loc[slice(None,end,step)][::-1].loc_left[:start][::-1]
if closed == 'neither':
return self.loc_left[slice(None,end,step)][::-1].loc_left[:start][::-1]
DataFrame.slice=_slice |
Thanks for surfacing this doc! I have to say that the specific use case of "[limiting] a time series to start and end at two specific dates" a little ironic, because this is a perfect example of cleaner code with exclusive endpoint slicing: Pandas has a bunch of messy code specifically to ensure string slicing in datetime index like A dedicated method for slicing that could add these option seems like a reasonable sort of thing to support, though we might call it In xarray, our |
It's heartening to have some support in this, thanks!. One thing that I like about the name as justification for the difference, would make it an even tougher sell. Since having four (additional) variants of |
I don't use python's |
Updated the PR with initial implementation of |
Personally, I’ve been busy with the US holiday and the 0.25 release. Comments like that make me less likely to review. |
That's totally understandable. You are not the only team member though, and no one so far has. |
The PR has been updated with a (cleaner) |
I've closed the PR for lack of interest. And I've suggested a new alternative: migrating |
closing. this is a feature of pandas and not likely to change. if more documentation is desired would be ok, but this would be a major disruption with relatively little upside. |
I've searched the documentation and old issues and I can't find anything on the reason for this unusual choice. The original historic PR was #2922.
One user raised the question before in #14900 and aggressively shut down by jeff without an answer.
It has caused bugs in my code in the past, and I've seen it do the same for other people,
It is the cause of subtle bugs seen in the wild, #26959 (comment). It's inconsistency with python slice conventions is confusing to newbies who ask on SO but no explanation is given.
Every pandas tutorial has to mention this special case:
users often need to slice something with
closed='left'
behavior, and try to add it in similar situations where it isn't available.This behavior is fully documented. That's not the problem. For example,
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html says
The "Slicing with labels" section documents the behavior, but gives no reason why the choice to break with python conventions was taken.
Another user who asked this question was given a workaround which is much more cumbersome compared to the convenience of
.loc
in #16571 (comment).He points to DataFrame.between_time, which has kwds for requesting this behavior, but infuriatingly, accepts only times and not datetimes.
A few related cookbook entries were added
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#dataframes
which explains
this again documents how
loc
works, but again offers no reason why pandas originally broke with python conventions.In every case I've seen someone ask "why is label-based slicing right-inclusive?" the answer has always been "because it's label based, not position based", which doesn't really explain anything.
The same issue exists with string based partial time slicing such like
df.loc[:"2018"]
,which will include rows with
2018-01-01
prefix.So, over the years several people have found this undesirable and/or have tried to find out why, but with an hour's worth of gooling and reading, I can't find an explanation ever have been given.
I'm not saying there's no good reason, I understand that indexing can get complicated, and mixed cases are tricky, etc'. But I want to understand why this was necessary or desirable, and why making
.loc
pythonic was unfeasible.I'll be opening a very small POC PR for discussion in a few minutes, which adds a new indexer called
locs
.It is far from complete, but it seems to do exactly what I want for the single-index case, in what I've tried so far.it passes all the equivalent tests loc does.To summarize:
So I ask, why is
.loc
right-inclusive?The text was updated successfully, but these errors were encountered: