-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: label-based slicing with not-included labels #8613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ah, seems there was a (closed) issue related to this: #5223 |
Also, and this was the actual use case, how would be the following be best done?
So you have a dataframe, and for some reason there are some NaNs, and these data have to be removed. With the resulting dataframe, I now want to select all data up to the end of January for columns a and d. So my code did:
as I don't know beforehand which indexes will be missing. I wanted to update this to use |
IMO this should work with loc given the corresponding axis is monotonic. |
@immerrr Well, that is indeed the 'logic' for |
I think this is a bug. That said, we should document the missing values handling (for scalar/slice) a bit more |
I shouldn't say bug, rather unintended non-compliance with |
In #8740, I noted that this is already inconsistent between float and int indexes (float indexes don't check bounds). In addition to consistency considerations, there may also be an efficiency argument. For float and interval indexes, you can't check whether a number is within the index bounds without doing binary search. This essentially doubles the amount of work necessary for doing slice lookups. |
@jreback It sounds like you are concerned about how weird it would be to get back an empty Series if one or both of the indexers are out of bounds. But in fact, this is exactly what Python (and numpy) already does when indexing a list/ndarray with out of bound integers:
Based on the precedence from Python, I would only raise an exception if one of the slice bounds has the wrong type to lookup its order in the index -- but I suspect this case is already handled in |
Speaking of unintended non-compliance, it seems that #7525 "fixed" this issue for
If the index is non-monotonic, out-of-bound label lookup should raise. Also, existing, but non-unique bound that doesn't occupy contiguous slots of storage should raise as a slice bound (think There's also a rather shady case of looking up dates with string literals, e.g. you are allowed to do |
@immerrr Agreed about all those cases. But again, I'm pretty sure all of those are already handled in slice_locs. The out of bounds slices bounds check is extra, and it's done in _LocIndexer (sp?). So it's also poor separation of concerns. |
Yup, I don't like this either. As a matter of fact I was thinking about redesigning |
Another thing: "out-of-bound" is at the moment not allowed, but hte index does not have to be in the index:
So the error message is also confusing, as the reason is not really that the key is not found in the index, but that it lies outside the range of the index. Also a bit confusing and inconsistent I think. |
you guys seem to be missing the point. Various index types DO handle this properly. Datetimelike handle out-of-bounds label (string) based slicing to enable partial indexing (e.g. using '2014-01'). While when specifying a Timestamp they must be exactly in the index. Floats are exactly the same (the above example has an Int64Index, so float slicing does not apply). Int64Index by definition NEVER has label sematics. And an object index, CANNOT ever have out-of-bounds slicing, it is never a monotonic index type, e.g.
So what is the example that you are concerned about here? @jorisvandenbossche |
is purely positional based, so I WOULD expect this work (and But label based is a completely different animal. Aside from partial string indexing, I think you either have to have the label in the index (to now when to stop), or the current behavior of allowing a non-existant index to be replaced by the end-points (I personally find this confusing but it does make sense). |
I don't get why's that. I mean, I don't see user-level semantical difference between
That might depend on how do you read slicing operation. I read I kind of like the idea of separating strict and lax lookups and making the user decide which one do they want, but I'd wildly guess that most of the time, especially interactively, they would go for lax lookups and thus it should be as convenient to use as it is now. |
@jreback didn't you mean 'always' instead of 'never'? (at least for
But isn't it a general known feature that you can do this with a string index (the example you gave), although it only works with
|
@immerrr I disagree, I think object indices (when they contain strings) are by definition non-monotonic (I can see an ordering of course, and maybe that IS the difference here). @jorisvandenbossche as far as your second issue. I think that was your original example, |
@jreback, our notions themselves may be different, because I'm often confused about what you say on that topic. I usually think of ordering as a binary relation defined over a set of objects that possesses several properties (asymmetrical, transitive and Now, less-than operation is not necessarily defined for two arbitrary Python objects, so in general, I agree, object index, unlike integer one, does not necessarily have an ordering. And it is under this assumption that NaT and NaN values technically break any ordering. But if it contains only strings and less-than operation between any two strings is defined, then it has an ordering, by definition of ordering, regardless of the actual order of elements in that index. As for monotonicity, Index represents a certain location-to-label mapping which can be monotonically increasing (nondecreasing for non-unique) if for any two locations Using these definitions I can't see how an object index can NEVER be monotonic if less-than (less-or-equal for non-unique) operation is defined for both locs — which are int — and for labels — which are str — and the monotonicity condition holds. |
what I mean is a The problem is that we may be giving an ordering to something that actually is not implied at all. e.g. 'aa','ab','ac','az' I think we are making an assumption that this is monotonic increasing in the strictest sense. That said maybe a user would expect this, but is IMHO not obvious at all, and thus we shouldn't do it. By definition labels are NOT ordered. (Categories of course can be so this will solve the entire problem once we have not sure if that is more clear or not :) |
What is then the point/logic of |
So you're saying that unordered categorical values don't have ordering relation at all. I agree with that.
If index labels have an ordering and they are arranged in ascending order, they are monotonic, by definition. What you meant was probably that object labels should never have an implicit ordering. That is reasonable for categorical data, but I don't think that arbitrary objects taken straight from python runtime should be interpreted as categorical values by default. In other words, given True == (objs[0] < objs[1])
# and in the same time
False == objs.is_monotonic_increasing But again, I agree that False == CategoricalIndex(objs, ordered=False).is_monotonic_increasing
# and even
True == CategoricalIndex(objs, categories=objs[::-1]).is_monotonic_decreasing |
Actually, for monotonic indexes, I think we should map the labels to integer locations, and then indexing should be exactly the same as standard numpy/python indexing. So the non-existent label is inserted in the location that maintains the order. This is consistent and simpler than the current rules. I don't think we should have different rules for int/float indexes -- that is very surprising to me. Side note: is there a good reason why we have not deprecated |
I just took a look into implementing this change (commented out some lines of code, really) and ran into an unfortunate limitation: comparing an integer to a string does not raise an exception in Python 2. Given that we want to raise However, every other type of index does have well defined types, so in principle we can replace the |
Another note about |
@shoyer can u show me the example that you say needs fixing u can try to change indexing but their are lots of special cases - their is a reason for has_valid_type - but appreciate someone wading in |
There should be a method that casts the special case to a generic one or raises an error if it cannot, which is EAFP is about. Speaking of wading in, I think I'm on a way to something interesting in #8753, but I'm already concerned about merging it in a non-disruptive manner. |
@immerrr @shoyer oh don't get me wrong, I am +++1 for you guys wading in. My experience in the past has shown that:
so, your approaches are actually good to divide-and-conquer by starting with cleaning up slicing |
@jreback OK, here's an example: Having this sort of logic on the index instead of the indexer also means that libraries like my project xray can make use of these sort of checks (we reuse pandas indexes but not indexers). Honestly, I'm not entirely certain it's worth the trouble of wading in to this. I do understand that (almost) every awkward special case is there to fix a real bug. That scares me! :) |
Disclaimer: i am not (yet) familiar with the indexing internals, so maybe my following comment will be stupid :-) But, the things we are talking about for |
You are suggesting that [10] work I believe. I am big -1 on this. The entire point of indexing is basically to raise |
Safeguard against typos with such judgement may require you to name attributes methods according to Levenshtein distance between them, this doesn't seem reasonable to me. |
And it's not about convenience, but rather about consistency and predictability. The last example should work for all integer slice bounds. Or it should NOT, documentation should state explicitly that it does NOT (regardless of dtype) and there should be a method/indexer that does work (regardless of dtype), because selecting/setting/deleting all items with labels between bounds seems a very useful operation to me. |
I think the last example if very predictable, it wont work as the docs state unless BOTH bounds are included. full stop. pandas has gone down this road before with allowing I get consistency. I push for it. But what exactly is inconsistent about the current behavior? @timmie I believe And certainly the docs on One further thought. It is my belief that the indexers are NOT orthogoal at all, and have much overlap. Whether that was originally a good decision is a point of view:
So this all has to be a balance. Cover the edge cases, and allow uses to not have to use a bunch of different methods. |
One thing is that Floating-point case kind of supports the idea of having a lax label indexer: you know that floats are better compared approximately, so you generally use an approximate indexer. But if you know that the exact float value should be in the index, you go for a strict one and run into an error early if it does not. |
@jorisvandenbossche If my example works for @jreback I want both your examples 10 and 11 to work, like this:
As far as consistency goes, I think it depends on your mental model for "strict indexing". If an index is monotonic, my mental model of
Here we have a "general rule" (bounds must be included) that is ignored for quite a few types of indexes for the sake of practically. We could "fix" all these cases 1-3 to make them consistent, but they are already consistent with each other and because this functionality can indeed be quite useful. As a general rule, I usually assume that something will just work if it can clearly be interpreted in an unambiguously way and excluding it would require extra work. Obviously this was also surprising to at least two other heavy pandas users besides myself (@jorisvandenbossche and @immerrr). And the truth is, |
|
yes, I don't see a need to deprectate
So I guess ONLY |
@jreback small comment on what you said above:
I think that when the docs say "both the start and the stop are included" (http://pandas.pydata.org/pandas-docs/stable/indexing.html#different-choices-for-indexing), this is about the fact that in label-based indexing the stop label is included, contrary to the usual python (integer location based) slices. And this is not about what is allowed for the start en stop label in such slices.
|
going to work on this soon |
Changes `boundary_slice` to handle cases where - the index is not sorted - using label-based indexing (loc) - the start or stop is missing See pandas-dev/pandas#8613 for details on the pandas side.
* BUG: boundary_slice assumes sorted indexes Changes `boundary_slice` to handle cases where - the index is not sorted - using label-based indexing (loc) - the start or stop is missing See pandas-dev/pandas#8613 for details on the pandas side. * Avoid sorting in `boundary_slice` * Additional tests, falsey endpoints
I didn't directly find an issue about it, or an explanation in the docs, but I stumbled today on the following, which did surprise me a bit:
Considering the following dataframe:
Slicing with a label that is not included in the index works with
.ix
, but not with.loc
:Context: I was updating some older code, and I wanted to replace
.ix
with.loc
(as this is what we recommend if it is purely label based to prevent confusion).Some things:
[]
,.ix[]
and.loc[]
is a bit surprising hereiloc
-> that behaviour was changed in 0.14 to allow out of bound slicing (http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0140-api)df.loc['2012-01-03':'2012-01']
will work and do the expected whiledf.loc['2012-01-03':'2012-01-31']
failsThe text was updated successfully, but these errors were encountered: