Skip to content

ENH: Series.str.extract returns regex matches more conveniently #4696

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 20, 2013

Conversation

danielballan
Copy link
Contributor

Closes #4685

# Invert keys/values of regex.groupsindex
names = dict(zip(regex.groupindex.values(), regex.groupindex.keys()))
result.columns = \
[names.get(1 + i, str(1 + i)) for i in range(regex.groups)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to start at zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the re module's convention of counting groups at 1. Now that I think about pandas's get method ignores this and just starts at 0. I'll do the same....

@ghost ghost assigned jtratner Aug 31, 2013
@jtratner
Copy link
Contributor

@danielballan in addition to making the change in the ValueError, could you add a few test cases? (edge cases, named groups, unnamed groups, etc)

@jtratner
Copy link
Contributor

@danielballan how's this coming? I really want to use something like this in work that's coming up so if you don't have time to finish it up, I'd be happy to take it up.

@danielballan
Copy link
Contributor Author

@jtratner If that offer stands, would you handle the docs? I fixed the py3 problem flagged by Travis. I'll add some tests before I quit for the day.

@danielballan
Copy link
Contributor Author

OK, I added tests analogous to the ones used for str.extract, plus tests for edge cases including no groups, one group with no matches and with some matches, and two groups with no matches and with some matches.

P.S. I may have flubbed something while rebasing. Not sure why "Move time operations for Series..." is showing up there.

@jtratner
Copy link
Contributor

@danielballan since it's only one commit, can you just cherry-pick it on
top of master? Probably much easier. Also, if at all possible, could you
strip out the trailing whitespace in your tests? :)

git remote add upstream https://www.github.com/pydata/pandas.git
git fetch upstream
git checkout upstream/master
git cherry-pick bc82745
git branch str_extract_backup str_extract
git checkout -b str_extract
git push origin --force

Also, could you add at least one test case that creates something with
unicode and make sure that works without issue?

Other notes, you don't need to check for isinstance with DataFrame because
assert_frame_equal will do that for you already.

Finally, we'd discussed using named match groups and converting those into
named columns. Did you incorporate that here? Can you add a test case for
that?

Also, need to test what happens if you make a match group optional. I.e.
useful to have is something like this
(?P<kind>[LE])?_?(?P<intensity>[0-9]+). I think this is totally trivial
based on how you've set it up, but just useful to have in there.

One thing that I was wondering with reading your changes: would it make
more sense to just create list of lists instead of Series? Seems like that
would have less overhead. Similarly, maybe have empty_row just start our
with nans?

@jtratner
Copy link
Contributor

nvm on unicode - I see you already have that.

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

I'm of the opinion this is cool/useful enough to go in What's New.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

also in doc section on strings.....you can use the same example

@danielballan
Copy link
Contributor Author

I found that isinstance + assert_frame_equal pattern in test_match, and I thought maybe the author (Wes?) knew something I didn't. I will correct both.

I wrote a test for one named group, multiple named groups, and a mix of named and unnamed groups. (Unnamed groups fall back to numbering.)

I wrote tests for optional groups. Happily, they already work as you'd expect.

I don't think the list-of-lists thing sounds very useful, and I don't think the overheard here is very serious. I imagine a common use case to be starting with a Series of strings and "converting" them into a Series/DataFrame of cleaned-up or more useful strings with the same index. Thoughts?

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

Letter/number example from previous issue was good, it showed off groupnames, Should have row with no match.

Also an example with dynamic number of groups (enumerated) ? Is that possible?

@jtratner
Copy link
Contributor

Also, I think I might have found an edge case, but I'm not sure if I'm
using the most recent version. Can you test whether this raises an error?
Series(['a'] * 1000).str.extract('(a)(b)?')

On Mon, Sep 16, 2013 at 3:08 PM, Andy Hayden [email protected]:

Letter/number example from previous issue was good, it showed off
groupnames, Should have row with no match.

Also an example with dynamic number of groups (enumerated) ? Is that
possible?


Reply to this email directly or view it on GitHubhttps://github.com//pull/4696#issuecomment-24535808
.

@danielballan
Copy link
Contributor Author

Looks good to me:

In [1]: Series(['a'] * 1000).str.extract('(a)(b)?').head()
Out[1]: 
   0   1
0  a NaN
1  a NaN
2  a NaN
3  a NaN
4  a NaN

I went ahead and wrote some docs; as someone suggested, #4685 has all the examples we need. I put it in basics.rst, and I'm about to put something in v0.13.0. My question: Is this considered an API Change, an Enhancement, or both?

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

I don't think there is an api change here. Just an enhancement (new feature).

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

I was going to suggest example with * or + in the groupname, but I think this behaviour is strange in re is it the same in this PR:

In [12]: re.findall('(a)(b|c)*', 'abc')
Out[12]: [('a', 'c')]

perhaps worth mentioning :s

@danielballan
Copy link
Contributor Author

@hayd Yes:

In [6]: Series(['abc']).str.extract('(a)(b|c)*')
Out[6]: 
   0  1
0  a  c

Separate issue though, no? More of an re issue? If what you mean is "b or c inclusive," this is an option:

In [8]: Series(['abc']).str.extract('(a)(b)?(c)?')
Out[8]: 
   0  1  2
0  a  b  c

Latest rebase includes docs. Otherwise code has not been touched, so I expect this to pass Travis.

@danielballan
Copy link
Contributor Author

@hayd Are you thinking of a test case with * or + in the group name? From the way this is implemented, if there's a problem with that, the problem would have to be in re, not pandas. I'm happy to add a test case -- just checking that this is what you mean.

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

Yeah, this is a subtlety of re. Not to worry, was just throwing it out there.

@danielballan
Copy link
Contributor Author

OK, now I get you. I will add an example like this to the docs. If nothing else, we'll have something to link to when this comes up on SO. :- )

@jtratner
Copy link
Contributor

@hayd I think that's the correct behavior with re... (b|c)* means: 'capture one group that can be b|c 0 or more times' right?

@jtratner
Copy link
Contributor

One other test case (sorry to pile on here!) - can you test that a non-capturing group (e.g., (?:ab*)) works correctly? I.e., it shouldn't add a column...

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

@jtratner yeah that's what I was expecting it to do...

I think it is worth mentioning (but not correcting), that writing something like (a)* doesn't do the obvious thing and capture multiple groups (but captures the last one), I expect because groups are stored as a dict which gets updated on each sighting :(

@danielballan
Copy link
Contributor Author

Can we live with this?

N.B. The subtle (strange) behavior of re.findall with + and *
illustrated here:

.. ipython:: python

re.findall('(a)(b|c)', 'abc')
re.findall('(a)(b|c)*', 'abc')

also appears in extract:

.. ipython:: python

Series(['abc']).str.extract('(a)(b|c)')
Series(['abc']).str.extract('(a)(b|c)*')

@danielballan
Copy link
Contributor Author

Non-capturing groups work correctly:

In [36]: Series(['A1', 'B2', 'C3']).str.extract('([AB])(?:[123])')
Out[36]: 
0      A
1      B
2    NaN
dtype: object

In [44]: Series(['A11', 'B22', 'C33']).str.extract('([AB])([123])(?:[123])')
Out[44]: 
     0    1
0    A    1
1    B    2
2  NaN  NaN

Will add these as test cases tomorrow. Are there others I should be thinking of?

@jtratner
Copy link
Contributor

I was pretty sure they'd work, just wanted to make sure we covered the
bases. I'm going to look at this again tomorrow/tonight, but we're probably
pretty close to merging here :)

@danielballan
Copy link
Contributor Author

Great. I hope I do not sound impatient. Thanks for your diligent support, all.

@danielballan
Copy link
Contributor Author

Latest rebase added the two non-capturing tests above, plus a test showing that a pattern with only non-capturing groups raises, just as a pattern with no groups does.

@jtratner
Copy link
Contributor

Thank you for putting this together (I know pandas can be a bit of a moving
target changes-wise)

@jtratner
Copy link
Contributor

@danielballan is this ready to go? Would be nice if you could quickly rebase it first, so the history looks nicer.

@danielballan
Copy link
Contributor Author

@jtratner Yes, it is ready to go. It's already rebased to one commit. If I need to do something else with the rebase, please explain. Thanks.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@danielballan can you do a quick release notes entry? (one-liner ok)

@danielballan
Copy link
Contributor Author

@jtratner Is this a release notes entry? Does it belong somewhere else as well?

@cpcloud
Copy link
Member

cpcloud commented Sep 20, 2013

release notes is doc/source/release.rst and that's where every change goes. the v0.*.*.txt is where major changes go

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@danielballan see the on-line version of the release.rst file: http://pandas.pydata.org/pandas-docs/dev/release.html,
put this in the new features section

@danielballan
Copy link
Contributor Author

OK, thanks. Done.

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@danielballan perfect thanks

jreback added a commit that referenced this pull request Sep 20, 2013
ENH: Series.str.extract returns regex matches more conveniently
@jreback jreback merged commit 351ba92 into pandas-dev:master Sep 20, 2013
@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

thanks for the PR!

@hayd
Copy link
Contributor

hayd commented Sep 20, 2013

awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: More convenient regex?
5 participants