-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Series.str.extract returns regex matches more conveniently #4696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Invert keys/values of regex.groupsindex | ||
names = dict(zip(regex.groupindex.values(), regex.groupindex.keys())) | ||
result.columns = \ | ||
[names.get(1 + i, str(1 + i)) for i in range(regex.groups)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to start at zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the re module's convention of counting groups at 1. Now that I think about pandas's get
method ignores this and just starts at 0. I'll do the same....
@danielballan in addition to making the change in the ValueError, could you add a few test cases? (edge cases, named groups, unnamed groups, etc) |
@danielballan how's this coming? I really want to use something like this in work that's coming up so if you don't have time to finish it up, I'd be happy to take it up. |
@jtratner If that offer stands, would you handle the docs? I fixed the py3 problem flagged by Travis. I'll add some tests before I quit for the day. |
OK, I added tests analogous to the ones used for P.S. I may have flubbed something while rebasing. Not sure why "Move time operations for Series..." is showing up there. |
@danielballan since it's only one commit, can you just cherry-pick it on
Also, could you add at least one test case that creates something with Other notes, you don't need to check for isinstance with DataFrame because Finally, we'd discussed using named match groups and converting those into Also, need to test what happens if you make a match group optional. I.e. One thing that I was wondering with reading your changes: would it make |
nvm on unicode - I see you already have that. |
I'm of the opinion this is cool/useful enough to go in What's New. |
also in doc section on strings.....you can use the same example |
I found that isinstance + assert_frame_equal pattern in I wrote a test for one named group, multiple named groups, and a mix of named and unnamed groups. (Unnamed groups fall back to numbering.) I wrote tests for optional groups. Happily, they already work as you'd expect. I don't think the list-of-lists thing sounds very useful, and I don't think the overheard here is very serious. I imagine a common use case to be starting with a Series of strings and "converting" them into a Series/DataFrame of cleaned-up or more useful strings with the same index. Thoughts? |
Letter/number example from previous issue was good, it showed off groupnames, Should have row with no match. Also an example with dynamic number of groups (enumerated) ? Is that possible? |
Also, I think I might have found an edge case, but I'm not sure if I'm On Mon, Sep 16, 2013 at 3:08 PM, Andy Hayden [email protected]:
|
Looks good to me:
I went ahead and wrote some docs; as someone suggested, #4685 has all the examples we need. I put it in basics.rst, and I'm about to put something in v0.13.0. My question: Is this considered an API Change, an Enhancement, or both? |
I don't think there is an api change here. Just an enhancement (new feature). |
I was going to suggest example with * or + in the groupname, but I think this behaviour is strange in
perhaps worth mentioning :s |
@hayd Yes:
Separate issue though, no? More of an
Latest rebase includes docs. Otherwise code has not been touched, so I expect this to pass Travis. |
@hayd Are you thinking of a test case with * or + in the group name? From the way this is implemented, if there's a problem with that, the problem would have to be in |
Yeah, this is a subtlety of re. Not to worry, was just throwing it out there. |
OK, now I get you. I will add an example like this to the docs. If nothing else, we'll have something to link to when this comes up on SO. :- ) |
@hayd I think that's the correct behavior with re... |
One other test case (sorry to pile on here!) - can you test that a non-capturing group (e.g., |
@jtratner yeah that's what I was expecting it to do... I think it is worth mentioning (but not correcting), that writing something like (a)* doesn't do the obvious thing and capture multiple groups (but captures the last one), I expect because groups are stored as a dict which gets updated on each sighting :( |
Can we live with this? N.B. The subtle (strange) behavior of .. ipython:: python re.findall('(a)(b|c)', 'abc') also appears in .. ipython:: python Series(['abc']).str.extract('(a)(b|c)') |
Non-capturing groups work correctly:
Will add these as test cases tomorrow. Are there others I should be thinking of? |
I was pretty sure they'd work, just wanted to make sure we covered the |
Great. I hope I do not sound impatient. Thanks for your diligent support, all. |
Latest rebase added the two non-capturing tests above, plus a test showing that a pattern with only non-capturing groups raises, just as a pattern with no groups does. |
Thank you for putting this together (I know pandas can be a bit of a moving |
@danielballan is this ready to go? Would be nice if you could quickly rebase it first, so the history looks nicer. |
@jtratner Yes, it is ready to go. It's already rebased to one commit. If I need to do something else with the rebase, please explain. Thanks. |
@danielballan can you do a quick release notes entry? (one-liner ok) |
release notes is |
@danielballan see the on-line version of the release.rst file: http://pandas.pydata.org/pandas-docs/dev/release.html, |
OK, thanks. Done. |
@danielballan perfect thanks |
ENH: Series.str.extract returns regex matches more conveniently
thanks for the PR! |
awesome! |
Closes #4685