ENH: add regex parameter to .loc #27363

topper-123 · 2019-07-12T16:26:09Z

closes Rename NDFrame.filter to .select? #26642
xref ENH: add NDFrame.select_str #27340
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Adds a regex parameter to NDFrame.loc.__call__, so we can now do:

>>> df = pd.DataFrame(1, index=["A", "AB", "BC", "CD"], columns=["CD", "BC", "AB", "A"])
>>> df.loc(regex=True)["B", "B"]
    BC  AB
AB   1   1
BC   1   1

This is intended to be followed up by deprecating NDFrame.filter.

This implementation does not allow setting values using regex=True, as the most urgent motivation is to get a replacement for NDFrame.filter. Setting values can come later, even after version 1.0.

simonjayhawkins · 2019-07-12T19:39:35Z

pandas/core/indexing.py

@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int):
                indexer[axis] = locs
                return self.obj.iloc[tuple(indexer)]

+        if self.regex and isinstance(key, str):


would we want to raise if regex and key is not str?

IMO no.

The .loc interface should anyway accept one axis selector that is not a string (e.g. df.loc(regex=True)[df.A ==1, "^col_"], so letting neither be a string would not be a big difference and will ust mean no effect from setting the regex parameter.

I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?

Yeah, it could accept bytes. What do you mean by re.pater?

EDIT: I guess you mean re.Pattern. Is it possible to do a regex searches against a array of re.Pattern objects?

simonjayhawkins · 2019-07-12T19:48:47Z

pandas/core/indexing.py

+
+    def __setitem__(self, key, value):
+        if self.regex:
+            raise TypeError("Inserting with regex not supported")


should this be NotImplementedError?

Yeah, I will change it.

pep8speaks · 2019-07-12T21:24:20Z

Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-26 14:04:40 UTC

WillAyd · 2019-07-12T22:23:59Z

pandas/core/indexing.py

@@ -1761,6 +1794,27 @@ def _get_partial_string_timestamp_match_key(self, key, labels):

        return key

+    def _get_regex_mappings(self, key, axis=None):
+        import re


Can move this to the top of the module

WillAyd · 2019-07-12T22:26:42Z

pandas/core/indexing.py

@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int):
                indexer[axis] = locs
                return self.obj.iloc[tuple(indexer)]

+        if self.regex and isinstance(key, str):


I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?

jorisvandenbossche · 2019-07-13T02:34:38Z

I am personally -1 on this proposal, and rather instead would vote for the approach in #27340 (but using select, if that is possible?)

It's of course a personal taste, but those keywords to loc are IMO rather unnatural, difficult to discover (you don't typically go looking for the keyword arguments of a [] indexing method), and make the behaviour of what is passed to loc[..] even more complex.

TomAugspurger · 2019-07-15T14:17:57Z

Agreed with @jorisvandenbossche

simonjayhawkins · 2019-07-15T14:55:16Z

Agreed with @jorisvandenbossche

OTOH using keywords to extend the loc indexer could make things more consistent as different capabilities are added eg. closed argument #27209 (comment) instead of adding various different additional methods and apis to extend indexing functionality.

WillAyd · 2019-07-15T15:09:55Z

Would allowing loc to natively just handle a regex be an option then? Obviously in other parts of our API we have regex=True as a parameter in some function but that seems unnecessary, so if we weren't tied to that we could theoretically implement this without a lot of API additions

simonjayhawkins · 2019-07-15T15:24:07Z

Would allowing loc to natively just handle a regex be an option then?

do you mean allow a regular expression object to be part of the key?

if so, i'd be wary about indexing performance. in #27209 which is just a POC, allowing another type to be part of a key requires additional checks to allow the key to 'pass through' to the appropriate logic.

i liked the suggestion to use a keyword instead, #27209 (comment) as I think this eliminates the need for these additional checks until the appropriate part of the code is reached and therefore also reducing code complexity (which seemed to be an argument against this approach #27363 (comment))

topper-123 · 2019-07-15T20:18:25Z

I actually agree with @jorisvandenbossche on this.

This PR has been significantly more complex to implement than #27340 and will therefore also be more difficult to maintain. This PR required understanding of the class hierarchy of _LocIndexer and there's always the danger of stepping on other stuff in _LocIndexer or on its siblings when doing changes here. This will make changing other stuff in .loc/.iloc/.at/.iat more complex also.

#27340 OTOH was very simple to implement and I more-or-less considered it syntactic sugar for the common use case of selecting columns from a regex.

So the approach in #27340 will be better for code maintainability.

@jreback?

jreback · 2019-07-15T20:40:34Z

I think we need to decide how we want to add this functionailty. @topper-123 that is a good data point that this is more complex (which of course we always want to limit as much as possible). OTOH introducing a new method on DataFrame needs a very high bar.

So I am -1 on the first for the new method name. I actually like the simplicity of this one.

All that said. I think this just needs more discussion and consensus.

jreback · 2019-07-15T20:44:03Z

@simonjayhawkins

if so, i'd be wary about indexing performance. in #27209 which is just a POC, allowing another type to be part of a key requires additional checks to allow the key to 'pass through' to the appropriate logic.

I am not at all concerned with indexing perf. These are all vectorized things; mis-using (say in a loop) is non-idiomatic and not something we really can do much about generally.

So #27209 is actually fine from a perf perspective.

topper-123 · 2019-07-16T23:05:11Z

To keep the discussion going:

.loc is a general purpose method, so it will be very difficult to add type specific functionality for many axis types.

In that respect I think select could become the specialized single axis type specific filtering option. So if the user needs something specific to a axis data type (regex here, but it could be other things), they would reach for select, e.g. through an accessor. We would then move type specific subsetting methods from the DataFrame name space, so instead of df.at_time, it would be df.select.at_time etc.

codecov · 2019-07-25T23:50:20Z

Codecov Report

Merging #27363 into master will decrease coverage by 0.02%.
The diff coverage is 96.42%.

@@            Coverage Diff             @@
##           master   #27363      +/-   ##
==========================================
- Coverage      93%   92.98%   -0.03%     
==========================================
  Files         182      182              
  Lines       50311    50343      +32     
==========================================
+ Hits        46793    46809      +16     
- Misses       3518     3534      +16

Flag	Coverage Δ
#multiple	`91.65% <96.42%> (-0.02%)`	⬇️
#single	`42.37% <46.42%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexing.py	`94.85% <96.42%> (-0.21%)`	⬇️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️
pandas/core/series.py	`93.32% <0%> (-0.54%)`	⬇️
pandas/core/frame.py	`97.09% <0%> (-0.18%)`	⬇️
pandas/core/internals/blocks.py	`95.6% <0%> (-0.15%)`	⬇️
pandas/core/groupby/generic.py	`88.93% <0%> (-0.09%)`	⬇️
pandas/core/generic.py	`94.28% <0%> (-0.05%)`	⬇️
pandas/core/base.py	`98.65% <0%> (-0.01%)`	⬇️
pandas/core/groupby/base.py	`100% <0%> (ø)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...01619b5. Read the comment docs.

codecov · 2019-07-25T23:50:23Z

Codecov Report

Merging #27363 into master will decrease coverage by 0.02%.
The diff coverage is 96.42%.

@@            Coverage Diff             @@
##           master   #27363      +/-   ##
==========================================
- Coverage      93%   92.98%   -0.03%     
==========================================
  Files         182      182              
  Lines       50311    50343      +32     
==========================================
+ Hits        46793    46809      +16     
- Misses       3518     3534      +16

Flag	Coverage Δ
#multiple	`91.65% <96.42%> (-0.02%)`	⬇️
#single	`42.37% <46.42%> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexing.py	`94.85% <96.42%> (-0.21%)`	⬇️
pandas/io/gbq.py	`88.88% <0%> (-11.12%)`	⬇️
pandas/core/series.py	`93.32% <0%> (-0.54%)`	⬇️
pandas/core/frame.py	`97.09% <0%> (-0.18%)`	⬇️
pandas/core/internals/blocks.py	`95.6% <0%> (-0.15%)`	⬇️
pandas/core/groupby/generic.py	`88.93% <0%> (-0.09%)`	⬇️
pandas/core/generic.py	`94.28% <0%> (-0.05%)`	⬇️
pandas/core/base.py	`98.65% <0%> (-0.01%)`	⬇️
pandas/core/groupby/base.py	`100% <0%> (ø)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...01619b5. Read the comment docs.

topper-123 · 2019-07-26T01:11:02Z

I've got a simpler implementation for this, so please wait to merge. Will look into it tomorrow.

topper-123 · 2019-07-26T14:18:40Z

I've made a simpler implementation, where it doesn't quite feels as if the various parts of .loc/.iloc./.at/.iat step on each toes quite as much. So this version will IMO be much easier to maintain, compared to the previous version.

That makes my worry about adding regex functionality to .loc go away, and I'm neutral whether we add a special select_str method or add regex functionality to .loc as in this PR (but I do think this should be added one way or another).

simonjayhawkins · 2019-08-01T12:12:03Z

To keep the discussion going:

just a thought!

a bit out of scope for this PR but may help in deciding on the best course of action.

in this case regex is a boolean. so we could also have regex='rows' and regex='columns' if we wanted to limit the scope of regex.

going further off topic, but could still be relevant...

if we also used a similar api for a closed argument, then we wouldn't be able to limit the scope the same way as just suggested.

so would probably require something like .loc(closed='right', scope='rows')[..., ...]

but this would still limit the flexibility as the default would only be overridden on one axis.

so maybe .loc(closed='right', scope='rows')(closed='left', scope='columns')[..., ...]

starting to get a bit clunky now!

accept a dictionary for closed?

``.loc(closed=dict{'rows':'right', 'columns':'left'})[..., ...]`

so back to regex...

if the scope keyword was added

.loc(regex=True, closed='right', scope='rows')

would now be confusing and less flexible. is the scope for closed or regex?

so the dictionary implementation for closed seems more appropriate.

and back to regex...

does being able to limit the scope using regex='columns'/'rows' make sense? not necessarily in this PR though.

The .loc interface should anyway accept one axis selector that is not a string (e.g. df.loc(regex=True)[df.A ==1, "^col_"], so letting neither be a string would not be a big difference and will ust mean no effect from setting the regex parameter.

could then potentially be ..

df.loc(regex='columns')[df.A ==1, "^col_"]

to allow additional validation of the keys.

ghost · 2019-08-08T18:48:01Z

This roundabout way of passing arguments __getitem__ is unidiomatic and I share the reluctance that's already been expressed. I do have an interest in adopting a non-intrusive way to extend loc. As you know, I want it loc to grow a closed='left' argument (xref #27060). So here's a new new alternative. Could we repurpose loc.__call__ in its entirety and migrate loc into becoming a function?

df.loc(...)[...] # NO

df.loc[...] # continues to work
df.loc(...) # equivalent
df.loc(..., closed='left') # no longer contentious?
df.loc(..., regex='foo') # possible

There might be some impedance mismatch with how __getitem__ handles its arguments, but I think it should be fairly minimal. This might actually enable cleanups of some ambiguous cases in loc.

shoyer · 2019-08-08T19:31:53Z

Could we repurpose loc.__call__ in its entirety and migrate loc into becoming a function?

I think it would be better to pick a new name for the method. It's too easy to confuse parentheses/brackets, especially for inexperienced Python users.

A new method would also gives the ability to keep-up the API, by using keyword arguments rather than position for selecting on the index or columns, e.g., df.sel(index='a') and df.sel(columns='b') vs the current df.loc['a', :] and df.loc[:, 'b'].

topper-123 · 2019-08-08T21:20:59Z

A function would drop the ability to slice inputs (df.loc['a':'c']), which I think we should keep. It's too late to change fundamental functionality IMO. I'd be ok with new functionality, e.g. select as I've mentioned previously.

ghost · 2019-08-08T22:04:35Z

I think it would be better to pick a new name for the method.

But you know I argued for that vigorously and was completely ignored. Do you have a plan for achieving a different outcome? And this loc()[] suggestion has divided the maintainers evenly down the middle and seems like another unlikely prospect. How do we move this issue forward?

jreback · 2019-08-08T23:24:55Z

a function is a nonstarter because you can not assign values; that’s they reason we use getitem/setitem

WillAyd · 2019-10-11T21:55:11Z

Should we punt this one for now and close?

WillAyd · 2019-10-22T01:55:07Z

I think this is stalled so going to close to clean up the queue. @topper-123 feel free to reopen if you disagree

topper-123 · 2019-10-24T02:57:38Z

I would like #27617 to get accepted, but that requires that its regex filtering capabilities get a home elsewhere, so the proposal here was to locate it in .loc. I've had another propoasl, where we make a new select or select_str method in #27340.

Th underlying question is probably if we're satisfied with the capabilities of .filter (lots of overlap with other methods, only the regex capability is unique) and that it filters on the axes; I would like the method called filter to filter on values.

topper-123 force-pushed the loc_regex branch 2 times, most recently from f668877 to 4a992e5 Compare July 12, 2019 16:34

topper-123 mentioned this pull request Jul 12, 2019

ENH: add NDFrame.select_str #27340

Closed

5 tasks

topper-123 added Indexing Related to indexing on series/frames, not to indexes themselves Enhancement labels Jul 12, 2019

topper-123 force-pushed the loc_regex branch from 4a992e5 to beb9d67 Compare July 12, 2019 17:31

simonjayhawkins reviewed Jul 12, 2019

View reviewed changes

topper-123 force-pushed the loc_regex branch from 2fdbdce to a12aa31 Compare July 12, 2019 21:30

WillAyd requested changes Jul 12, 2019

View reviewed changes

topper-123 force-pushed the loc_regex branch from a12aa31 to 191ef8d Compare July 12, 2019 22:56

topper-123 force-pushed the loc_regex branch from 191ef8d to 51d2c73 Compare July 13, 2019 06:19

ENH: add regex parameter to .loc

0c656fd

topper-123 force-pushed the loc_regex branch from 51d2c73 to 1037527 Compare July 25, 2019 21:31

topper-123 added 2 commits July 26, 2019 00:07

NotImplementedError + doc string

87a8ce6

Move to 1.0

01619b5

topper-123 force-pushed the loc_regex branch from 1037527 to 01619b5 Compare July 25, 2019 23:50

topper-123 force-pushed the loc_regex branch from 3aef647 to 52f5ce0 Compare July 26, 2019 11:14

Refactored regex handling in LocIndexer

5a7c20d

topper-123 force-pushed the loc_regex branch from 52f5ce0 to 5a7c20d Compare July 26, 2019 14:04

topper-123 mentioned this pull request Jul 26, 2019

DEPR: Deprecate NDFrame.filter #27617

Closed

5 tasks

simonjayhawkins mentioned this pull request Aug 1, 2019

POC: Implement pythonic label-based index slicing NDFrame.slice(closed='left') #27060

Closed

3 tasks

ghost mentioned this pull request Aug 8, 2019

Discussion: Why is loc label-based slicing right-inclusive? #27059

Closed

WillAyd closed this Oct 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add regex parameter to .loc #27363

ENH: add regex parameter to .loc #27363

topper-123 commented Jul 12, 2019

simonjayhawkins Jul 12, 2019

topper-123 Jul 12, 2019 •

edited

Loading

WillAyd Jul 12, 2019

topper-123 Jul 12, 2019 •

edited

Loading

simonjayhawkins Jul 12, 2019

topper-123 Jul 12, 2019

pep8speaks commented Jul 12, 2019 •

edited

Loading

WillAyd Jul 12, 2019

WillAyd Jul 12, 2019

jorisvandenbossche commented Jul 13, 2019

TomAugspurger commented Jul 15, 2019

simonjayhawkins commented Jul 15, 2019

WillAyd commented Jul 15, 2019

simonjayhawkins commented Jul 15, 2019

topper-123 commented Jul 15, 2019

jreback commented Jul 15, 2019

jreback commented Jul 15, 2019 •

edited

Loading

topper-123 commented Jul 16, 2019

codecov bot commented Jul 25, 2019

codecov bot commented Jul 25, 2019

topper-123 commented Jul 26, 2019

topper-123 commented Jul 26, 2019

simonjayhawkins commented Aug 1, 2019

ghost commented Aug 8, 2019 •

edited by ghost

Loading

shoyer commented Aug 8, 2019

topper-123 commented Aug 8, 2019

ghost commented Aug 8, 2019 •

edited by ghost

Loading

jreback commented Aug 8, 2019

WillAyd commented Oct 11, 2019

WillAyd commented Oct 22, 2019

topper-123 commented Oct 24, 2019

ENH: add regex parameter to .loc #27363

ENH: add regex parameter to .loc #27363

Conversation

topper-123 commented Jul 12, 2019

simonjayhawkins Jul 12, 2019

Choose a reason for hiding this comment

topper-123 Jul 12, 2019 • edited Loading

Choose a reason for hiding this comment

WillAyd Jul 12, 2019

Choose a reason for hiding this comment

topper-123 Jul 12, 2019 • edited Loading

Choose a reason for hiding this comment

simonjayhawkins Jul 12, 2019

Choose a reason for hiding this comment

topper-123 Jul 12, 2019

Choose a reason for hiding this comment

pep8speaks commented Jul 12, 2019 • edited Loading

Comment last updated at 2019-07-26 14:04:40 UTC

WillAyd Jul 12, 2019

Choose a reason for hiding this comment

WillAyd Jul 12, 2019

Choose a reason for hiding this comment

jorisvandenbossche commented Jul 13, 2019

TomAugspurger commented Jul 15, 2019

simonjayhawkins commented Jul 15, 2019

WillAyd commented Jul 15, 2019

simonjayhawkins commented Jul 15, 2019

topper-123 commented Jul 15, 2019

jreback commented Jul 15, 2019

jreback commented Jul 15, 2019 • edited Loading

topper-123 commented Jul 16, 2019

codecov bot commented Jul 25, 2019

Codecov Report

codecov bot commented Jul 25, 2019

Codecov Report

topper-123 commented Jul 26, 2019

topper-123 commented Jul 26, 2019

simonjayhawkins commented Aug 1, 2019

ghost commented Aug 8, 2019 • edited by ghost Loading

shoyer commented Aug 8, 2019

topper-123 commented Aug 8, 2019

ghost commented Aug 8, 2019 • edited by ghost Loading

jreback commented Aug 8, 2019

WillAyd commented Oct 11, 2019

WillAyd commented Oct 22, 2019

topper-123 commented Oct 24, 2019

topper-123 Jul 12, 2019 •

edited

Loading

topper-123 Jul 12, 2019 •

edited

Loading

pep8speaks commented Jul 12, 2019 •

edited

Loading

jreback commented Jul 15, 2019 •

edited

Loading

ghost commented Aug 8, 2019 •

edited by ghost

Loading

ghost commented Aug 8, 2019 •

edited by ghost

Loading