Skip to content

ENH: add regex parameter to .loc #27363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

topper-123
Copy link
Contributor

Adds a regex parameter to NDFrame.loc.__call__, so we can now do:

>>> df = pd.DataFrame(1, index=["A", "AB", "BC", "CD"], columns=["CD", "BC", "AB", "A"])
>>> df.loc(regex=True)["B", "B"]
    BC  AB
AB   1   1
BC   1   1

This is intended to be followed up by deprecating NDFrame.filter.

This implementation does not allow setting values using regex=True, as the most urgent motivation is to get a replacement for NDFrame.filter. Setting values can come later, even after version 1.0.

@topper-123 topper-123 force-pushed the loc_regex branch 2 times, most recently from f668877 to 4a992e5 Compare July 12, 2019 16:34
@topper-123 topper-123 mentioned this pull request Jul 12, 2019
5 tasks
@topper-123 topper-123 added Indexing Related to indexing on series/frames, not to indexes themselves Enhancement labels Jul 12, 2019
@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int):
indexer[axis] = locs
return self.obj.iloc[tuple(indexer)]

if self.regex and isinstance(key, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we want to raise if regex and key is not str?

Copy link
Contributor Author

@topper-123 topper-123 Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO no.

The .loc interface should anyway accept one axis selector that is not a string (e.g. df.loc(regex=True)[df.A ==1, "^col_"], so letting neither be a string would not be a big difference and will ust mean no effect from setting the regex parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?

Copy link
Contributor Author

@topper-123 topper-123 Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it could accept bytes. What do you mean by re.pater?

EDIT: I guess you mean re.Pattern. Is it possible to do a regex searches against a array of re.Pattern objects?


def __setitem__(self, key, value):
if self.regex:
raise TypeError("Inserting with regex not supported")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be NotImplementedError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I will change it.

@pep8speaks
Copy link

pep8speaks commented Jul 12, 2019

Hello @topper-123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-07-26 14:04:40 UTC

@@ -1761,6 +1794,27 @@ def _get_partial_string_timestamp_match_key(self, key, labels):

return key

def _get_regex_mappings(self, key, axis=None):
import re
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can move this to the top of the module

@@ -1822,10 +1855,21 @@ def _getitem_axis(self, key, axis: int):
indexer[axis] = locs
return self.obj.iloc[tuple(indexer)]

if self.regex and isinstance(key, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see both points but tend to agree more with @simonjayhawkins. At the very least wouldn't we want to accept other types (bytes, re.pater)?

@jorisvandenbossche
Copy link
Member

I am personally -1 on this proposal, and rather instead would vote for the approach in #27340 (but using select, if that is possible?)

It's of course a personal taste, but those keywords to loc are IMO rather unnatural, difficult to discover (you don't typically go looking for the keyword arguments of a [] indexing method), and make the behaviour of what is passed to loc[..] even more complex.

@TomAugspurger
Copy link
Contributor

Agreed with @jorisvandenbossche

@simonjayhawkins
Copy link
Member

Agreed with @jorisvandenbossche

OTOH using keywords to extend the loc indexer could make things more consistent as different capabilities are added eg. closed argument #27209 (comment) instead of adding various different additional methods and apis to extend indexing functionality.

@WillAyd
Copy link
Member

WillAyd commented Jul 15, 2019

Would allowing loc to natively just handle a regex be an option then? Obviously in other parts of our API we have regex=True as a parameter in some function but that seems unnecessary, so if we weren't tied to that we could theoretically implement this without a lot of API additions

@simonjayhawkins
Copy link
Member

Would allowing loc to natively just handle a regex be an option then?

do you mean allow a regular expression object to be part of the key?

if so, i'd be wary about indexing performance. in #27209 which is just a POC, allowing another type to be part of a key requires additional checks to allow the key to 'pass through' to the appropriate logic.

i liked the suggestion to use a keyword instead, #27209 (comment) as I think this eliminates the need for these additional checks until the appropriate part of the code is reached and therefore also reducing code complexity (which seemed to be an argument against this approach #27363 (comment))

@topper-123
Copy link
Contributor Author

I actually agree with @jorisvandenbossche on this.

This PR has been significantly more complex to implement than #27340 and will therefore also be more difficult to maintain. This PR required understanding of the class hierarchy of _LocIndexer and there's always the danger of stepping on other stuff in _LocIndexer or on its siblings when doing changes here. This will make changing other stuff in .loc/.iloc/.at/.iat more complex also.

#27340 OTOH was very simple to implement and I more-or-less considered it syntactic sugar for the common use case of selecting columns from a regex.

So the approach in #27340 will be better for code maintainability.

@jreback?

@jreback
Copy link
Contributor

jreback commented Jul 15, 2019

I think we need to decide how we want to add this functionailty. @topper-123 that is a good data point that this is more complex (which of course we always want to limit as much as possible). OTOH introducing a new method on DataFrame needs a very high bar.

So I am -1 on the first for the new method name. I actually like the simplicity of this one.

All that said. I think this just needs more discussion and consensus.

@jreback
Copy link
Contributor

jreback commented Jul 15, 2019

@simonjayhawkins

if so, i'd be wary about indexing performance. in #27209 which is just a POC, allowing another type to be part of a key requires additional checks to allow the key to 'pass through' to the appropriate logic.

I am not at all concerned with indexing perf. These are all vectorized things; mis-using (say in a loop) is non-idiomatic and not something we really can do much about generally.

So #27209 is actually fine from a perf perspective.

@topper-123
Copy link
Contributor Author

To keep the discussion going:

.loc is a general purpose method, so it will be very difficult to add type specific functionality for many axis types.

In that respect I think select could become the specialized single axis type specific filtering option. So if the user needs something specific to a axis data type (regex here, but it could be other things), they would reach for select, e.g. through an accessor. We would then move type specific subsetting methods from the DataFrame name space, so instead of df.at_time, it would be df.select.at_time etc.

@codecov
Copy link

codecov bot commented Jul 25, 2019

Codecov Report

Merging #27363 into master will decrease coverage by 0.02%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27363      +/-   ##
==========================================
- Coverage      93%   92.98%   -0.03%     
==========================================
  Files         182      182              
  Lines       50311    50343      +32     
==========================================
+ Hits        46793    46809      +16     
- Misses       3518     3534      +16
Flag Coverage Δ
#multiple 91.65% <96.42%> (-0.02%) ⬇️
#single 42.37% <46.42%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexing.py 94.85% <96.42%> (-0.21%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/series.py 93.32% <0%> (-0.54%) ⬇️
pandas/core/frame.py 97.09% <0%> (-0.18%) ⬇️
pandas/core/internals/blocks.py 95.6% <0%> (-0.15%) ⬇️
pandas/core/groupby/generic.py 88.93% <0%> (-0.09%) ⬇️
pandas/core/generic.py 94.28% <0%> (-0.05%) ⬇️
pandas/core/base.py 98.65% <0%> (-0.01%) ⬇️
pandas/core/groupby/base.py 100% <0%> (ø) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...01619b5. Read the comment docs.

1 similar comment
@codecov
Copy link

codecov bot commented Jul 25, 2019

Codecov Report

Merging #27363 into master will decrease coverage by 0.02%.
The diff coverage is 96.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #27363      +/-   ##
==========================================
- Coverage      93%   92.98%   -0.03%     
==========================================
  Files         182      182              
  Lines       50311    50343      +32     
==========================================
+ Hits        46793    46809      +16     
- Misses       3518     3534      +16
Flag Coverage Δ
#multiple 91.65% <96.42%> (-0.02%) ⬇️
#single 42.37% <46.42%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexing.py 94.85% <96.42%> (-0.21%) ⬇️
pandas/io/gbq.py 88.88% <0%> (-11.12%) ⬇️
pandas/core/series.py 93.32% <0%> (-0.54%) ⬇️
pandas/core/frame.py 97.09% <0%> (-0.18%) ⬇️
pandas/core/internals/blocks.py 95.6% <0%> (-0.15%) ⬇️
pandas/core/groupby/generic.py 88.93% <0%> (-0.09%) ⬇️
pandas/core/generic.py 94.28% <0%> (-0.05%) ⬇️
pandas/core/base.py 98.65% <0%> (-0.01%) ⬇️
pandas/core/groupby/base.py 100% <0%> (ø) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ebcfee4...01619b5. Read the comment docs.

@topper-123
Copy link
Contributor Author

I've got a simpler implementation for this, so please wait to merge. Will look into it tomorrow.

@topper-123
Copy link
Contributor Author

I've made a simpler implementation, where it doesn't quite feels as if the various parts of .loc/.iloc./.at/.iat step on each toes quite as much. So this version will IMO be much easier to maintain, compared to the previous version.

That makes my worry about adding regex functionality to .loc go away, and I'm neutral whether we add a special select_str method or add regex functionality to .loc as in this PR (but I do think this should be added one way or another).

@simonjayhawkins
Copy link
Member

To keep the discussion going:

just a thought!

a bit out of scope for this PR but may help in deciding on the best course of action.

in this case regex is a boolean. so we could also have regex='rows' and regex='columns' if we wanted to limit the scope of regex.

going further off topic, but could still be relevant...

if we also used a similar api for a closed argument, then we wouldn't be able to limit the scope the same way as just suggested.

so would probably require something like .loc(closed='right', scope='rows')[..., ...]

but this would still limit the flexibility as the default would only be overridden on one axis.

so maybe .loc(closed='right', scope='rows')(closed='left', scope='columns')[..., ...]

starting to get a bit clunky now!

accept a dictionary for closed?

``.loc(closed=dict{'rows':'right', 'columns':'left'})[..., ...]`

so back to regex...

if the scope keyword was added

.loc(regex=True, closed='right', scope='rows')

would now be confusing and less flexible. is the scope for closed or regex?

so the dictionary implementation for closed seems more appropriate.

and back to regex...

does being able to limit the scope using regex='columns'/'rows' make sense? not necessarily in this PR though.

The .loc interface should anyway accept one axis selector that is not a string (e.g. df.loc(regex=True)[df.A ==1, "^col_"], so letting neither be a string would not be a big difference and will ust mean no effect from setting the regex parameter.

could then potentially be ..

df.loc(regex='columns')[df.A ==1, "^col_"]

to allow additional validation of the keys.

@ghost
Copy link

ghost commented Aug 8, 2019

This roundabout way of passing arguments __getitem__ is unidiomatic and I share the reluctance that's already been expressed. I do have an interest in adopting a non-intrusive way to extend loc. As you know, I want it loc to grow a closed='left' argument (xref #27060). So here's a new new alternative. Could we repurpose loc.__call__ in its entirety and migrate loc into becoming a function?

df.loc(...)[...] # NO

df.loc[...] # continues to work
df.loc(...) # equivalent
df.loc(..., closed='left') # no longer contentious?
df.loc(..., regex='foo') # possible

There might be some impedance mismatch with how __getitem__ handles its arguments, but I think it should be fairly minimal. This might actually enable cleanups of some ambiguous cases in loc.

@shoyer
Copy link
Member

shoyer commented Aug 8, 2019

Could we repurpose loc.__call__ in its entirety and migrate loc into becoming a function?

I think it would be better to pick a new name for the method. It's too easy to confuse parentheses/brackets, especially for inexperienced Python users.

A new method would also gives the ability to keep-up the API, by using keyword arguments rather than position for selecting on the index or columns, e.g., df.sel(index='a') and df.sel(columns='b') vs the current df.loc['a', :] and df.loc[:, 'b'].

@topper-123
Copy link
Contributor Author

A function would drop the ability to slice inputs (df.loc['a':'c']), which I think we should keep. It's too late to change fundamental functionality IMO. I'd be ok with new functionality, e.g. select as I've mentioned previously.

@ghost
Copy link

ghost commented Aug 8, 2019

I think it would be better to pick a new name for the method.

But you know I argued for that vigorously and was completely ignored. Do you have a plan for achieving a different outcome? And this loc()[] suggestion has divided the maintainers evenly down the middle and seems like another unlikely prospect. How do we move this issue forward?

@jreback
Copy link
Contributor

jreback commented Aug 8, 2019

a function is a nonstarter because you can not assign values; that’s they reason we use getitem/setitem

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2019

Should we punt this one for now and close?

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

I think this is stalled so going to close to clean up the queue. @topper-123 feel free to reopen if you disagree

@WillAyd WillAyd closed this Oct 22, 2019
@topper-123
Copy link
Contributor Author

I would like #27617 to get accepted, but that requires that its regex filtering capabilities get a home elsewhere, so the proposal here was to locate it in .loc. I've had another propoasl, where we make a new select or select_str method in #27340.

Th underlying question is probably if we're satisfied with the capabilities of .filter (lots of overlap with other methods, only the regex capability is unique) and that it filters on the axes; I would like the method called filter to filter on values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rename NDFrame.filter to .select?
8 participants