-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: add regex parameter to .loc #27363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f668877
to
4a992e5
Compare
pandas/core/indexing.py
Outdated
@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int): | |||
indexer[axis] = locs | |||
return self.obj.iloc[tuple(indexer)] | |||
|
|||
if self.regex and isinstance(key, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would we want to raise if regex and key is not str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO no.
The .loc interface should anyway accept one axis selector that is not a string (e.g. df.loc(regex=True)[df.A ==1, "^col_"]
, so letting neither be a string would not be a big difference and will ust mean no effect from setting the regex parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it could accept bytes. What do you mean by re.pater?
EDIT: I guess you mean re.Pattern. Is it possible to do a regex searches against a array of re.Pattern objects?
pandas/core/indexing.py
Outdated
|
||
def __setitem__(self, key, value): | ||
if self.regex: | ||
raise TypeError("Inserting with regex not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be NotImplementedError
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will change it.
Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-07-26 14:04:40 UTC |
pandas/core/indexing.py
Outdated
@@ -1761,6 +1794,27 @@ def _get_partial_string_timestamp_match_key(self, key, labels): | |||
|
|||
return key | |||
|
|||
def _get_regex_mappings(self, key, axis=None): | |||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can move this to the top of the module
pandas/core/indexing.py
Outdated
@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int): | |||
indexer[axis] = locs | |||
return self.obj.iloc[tuple(indexer)] | |||
|
|||
if self.regex and isinstance(key, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?
I am personally -1 on this proposal, and rather instead would vote for the approach in #27340 (but using It's of course a personal taste, but those keywords to |
Agreed with @jorisvandenbossche |
OTOH using keywords to extend the loc indexer could make things more consistent as different capabilities are added eg. closed argument #27209 (comment) instead of adding various different additional methods and apis to extend indexing functionality. |
Would allowing |
do you mean allow a regular expression object to be part of the key? if so, i'd be wary about indexing performance. in #27209 which is just a POC, allowing another type to be part of a key requires additional checks to allow the key to 'pass through' to the appropriate logic. i liked the suggestion to use a keyword instead, #27209 (comment) as I think this eliminates the need for these additional checks until the appropriate part of the code is reached and therefore also reducing code complexity (which seemed to be an argument against this approach #27363 (comment)) |
I actually agree with @jorisvandenbossche on this. This PR has been significantly more complex to implement than #27340 and will therefore also be more difficult to maintain. This PR required understanding of the class hierarchy of #27340 OTOH was very simple to implement and I more-or-less considered it syntactic sugar for the common use case of selecting columns from a regex. So the approach in #27340 will be better for code maintainability. |
I think we need to decide how we want to add this functionailty. @topper-123 that is a good data point that this is more complex (which of course we always want to limit as much as possible). OTOH introducing a new method on DataFrame needs a very high bar. So I am -1 on the first for the new method name. I actually like the simplicity of this one. All that said. I think this just needs more discussion and consensus. |
I am not at all concerned with indexing perf. These are all vectorized things; mis-using (say in a loop) is non-idiomatic and not something we really can do much about generally. So #27209 is actually fine from a perf perspective. |
To keep the discussion going:
In that respect I think |
Codecov Report
@@ Coverage Diff @@
## master #27363 +/- ##
==========================================
- Coverage 93% 92.98% -0.03%
==========================================
Files 182 182
Lines 50311 50343 +32
==========================================
+ Hits 46793 46809 +16
- Misses 3518 3534 +16
Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #27363 +/- ##
==========================================
- Coverage 93% 92.98% -0.03%
==========================================
Files 182 182
Lines 50311 50343 +32
==========================================
+ Hits 46793 46809 +16
- Misses 3518 3534 +16
Continue to review full report at Codecov.
|
I've got a simpler implementation for this, so please wait to merge. Will look into it tomorrow. |
I've made a simpler implementation, where it doesn't quite feels as if the various parts of .loc/.iloc./.at/.iat step on each toes quite as much. So this version will IMO be much easier to maintain, compared to the previous version. That makes my worry about adding regex functionality to .loc go away, and I'm neutral whether we add a special |
just a thought! a bit out of scope for this PR but may help in deciding on the best course of action. in this case going further off topic, but could still be relevant... if we also used a similar api for a closed argument, then we wouldn't be able to limit the scope the same way as just suggested. so would probably require something like but this would still limit the flexibility as the default would only be overridden on one axis. so maybe starting to get a bit clunky now! accept a dictionary for closed? ``.loc(closed=dict{'rows':'right', 'columns':'left'})[..., ...]` so back to regex... if the scope keyword was added
would now be confusing and less flexible. is the scope for closed or regex? so the dictionary implementation for closed seems more appropriate. and back to regex... does being able to limit the scope using regex='columns'/'rows' make sense? not necessarily in this PR though.
could then potentially be ..
to allow additional validation of the keys. |
This roundabout way of passing arguments df.loc(...)[...] # NO
df.loc[...] # continues to work
df.loc(...) # equivalent
df.loc(..., closed='left') # no longer contentious?
df.loc(..., regex='foo') # possible There might be some impedance mismatch with how |
I think it would be better to pick a new name for the method. It's too easy to confuse parentheses/brackets, especially for inexperienced Python users. A new method would also gives the ability to keep-up the API, by using keyword arguments rather than position for selecting on the index or columns, e.g., |
A function would drop the ability to slice inputs ( |
But you know I argued for that vigorously and was completely ignored. Do you have a plan for achieving a different outcome? And this |
a function is a nonstarter because you can not assign values; that’s they reason we use getitem/setitem |
Should we punt this one for now and close? |
I think this is stalled so going to close to clean up the queue. @topper-123 feel free to reopen if you disagree |
I would like #27617 to get accepted, but that requires that its regex filtering capabilities get a home elsewhere, so the proposal here was to locate it in .loc. I've had another propoasl, where we make a new Th underlying question is probably if we're satisfied with the capabilities of |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Adds a
regex
parameter toNDFrame.loc.__call__
, so we can now do:This is intended to be followed up by deprecating
NDFrame.filter
.This implementation does not allow setting values using
regex=True
, as the most urgent motivation is to get a replacement forNDFrame.filter
. Setting values can come later, even after version 1.0.