ENH: Series.str.extract returns regex matches more conveniently #4696

danielballan · 2013-08-27T19:54:13Z

hayd · 2013-08-27T19:59:12Z

pandas/core/strings.py

+        # Invert keys/values of regex.groupsindex
+        names = dict(zip(regex.groupindex.values(), regex.groupindex.keys()))
+        result.columns = \
+            [names.get(1 + i, str(1 + i)) for i in range(regex.groups)]


any reason not to start at zero?

Following the re module's convention of counting groups at 1. Now that I think about pandas's get method ignores this and just starts at 0. I'll do the same....

jtratner · 2013-08-31T23:31:06Z

@danielballan in addition to making the change in the ValueError, could you add a few test cases? (edge cases, named groups, unnamed groups, etc)

jtratner · 2013-09-15T03:37:41Z

@danielballan how's this coming? I really want to use something like this in work that's coming up so if you don't have time to finish it up, I'd be happy to take it up.

danielballan · 2013-09-16T14:11:00Z

@jtratner If that offer stands, would you handle the docs? I fixed the py3 problem flagged by Travis. I'll add some tests before I quit for the day.

danielballan · 2013-09-16T15:48:49Z

OK, I added tests analogous to the ones used for str.extract, plus tests for edge cases including no groups, one group with no matches and with some matches, and two groups with no matches and with some matches.

P.S. I may have flubbed something while rebasing. Not sure why "Move time operations for Series..." is showing up there.

jtratner · 2013-09-16T16:11:52Z

@danielballan since it's only one commit, can you just cherry-pick it on
top of master? Probably much easier. Also, if at all possible, could you
strip out the trailing whitespace in your tests? :)

git remote add upstream https://www.github.com/pydata/pandas.git
git fetch upstream
git checkout upstream/master
git cherry-pick bc82745
git branch str_extract_backup str_extract
git checkout -b str_extract
git push origin --force

Also, could you add at least one test case that creates something with
unicode and make sure that works without issue?

Other notes, you don't need to check for isinstance with DataFrame because
assert_frame_equal will do that for you already.

Finally, we'd discussed using named match groups and converting those into
named columns. Did you incorporate that here? Can you add a test case for
that?

Also, need to test what happens if you make a match group optional. I.e.
useful to have is something like this
(?P<kind>[LE])?_?(?P<intensity>[0-9]+). I think this is totally trivial
based on how you've set it up, but just useful to have in there.

One thing that I was wondering with reading your changes: would it make
more sense to just create list of lists instead of Series? Seems like that
would have less overhead. Similarly, maybe have empty_row just start our
with nans?

jtratner · 2013-09-16T16:12:18Z

nvm on unicode - I see you already have that.

hayd · 2013-09-16T17:29:10Z

I'm of the opinion this is cool/useful enough to go in What's New.

jreback · 2013-09-16T17:30:28Z

also in doc section on strings.....you can use the same example

danielballan · 2013-09-16T18:12:15Z

I found that isinstance + assert_frame_equal pattern in test_match, and I thought maybe the author (Wes?) knew something I didn't. I will correct both.

I wrote a test for one named group, multiple named groups, and a mix of named and unnamed groups. (Unnamed groups fall back to numbering.)

I wrote tests for optional groups. Happily, they already work as you'd expect.

I don't think the list-of-lists thing sounds very useful, and I don't think the overheard here is very serious. I imagine a common use case to be starting with a Series of strings and "converting" them into a Series/DataFrame of cleaned-up or more useful strings with the same index. Thoughts?

hayd · 2013-09-16T19:08:42Z

Letter/number example from previous issue was good, it showed off groupnames, Should have row with no match.

Also an example with dynamic number of groups (enumerated) ? Is that possible?

jtratner · 2013-09-16T21:22:39Z

Also, I think I might have found an edge case, but I'm not sure if I'm
using the most recent version. Can you test whether this raises an error?
Series(['a'] * 1000).str.extract('(a)(b)?')

On Mon, Sep 16, 2013 at 3:08 PM, Andy Hayden [email protected]:

Letter/number example from previous issue was good, it showed off
groupnames, Should have row with no match.

Also an example with dynamic number of groups (enumerated) ? Is that
possible?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/4696#issuecomment-24535808
.

danielballan · 2013-09-16T21:28:24Z

Looks good to me:

In [1]: Series(['a'] * 1000).str.extract('(a)(b)?').head()
Out[1]: 
   0   1
0  a NaN
1  a NaN
2  a NaN
3  a NaN
4  a NaN

I went ahead and wrote some docs; as someone suggested, #4685 has all the examples we need. I put it in basics.rst, and I'm about to put something in v0.13.0. My question: Is this considered an API Change, an Enhancement, or both?

hayd · 2013-09-16T21:30:13Z

I don't think there is an api change here. Just an enhancement (new feature).

hayd · 2013-09-16T21:34:46Z

I was going to suggest example with * or + in the groupname, but I think this behaviour is strange in re is it the same in this PR:

In [12]: re.findall('(a)(b|c)*', 'abc')
Out[12]: [('a', 'c')]

perhaps worth mentioning :s

danielballan · 2013-09-16T21:43:21Z

@hayd Yes:

In [6]: Series(['abc']).str.extract('(a)(b|c)*')
Out[6]: 
   0  1
0  a  c

Separate issue though, no? More of an re issue? If what you mean is "b or c inclusive," this is an option:

In [8]: Series(['abc']).str.extract('(a)(b)?(c)?')
Out[8]: 
   0  1  2
0  a  b  c

Latest rebase includes docs. Otherwise code has not been touched, so I expect this to pass Travis.

danielballan · 2013-09-16T21:45:27Z

@hayd Are you thinking of a test case with * or + in the group name? From the way this is implemented, if there's a problem with that, the problem would have to be in re, not pandas. I'm happy to add a test case -- just checking that this is what you mean.

hayd · 2013-09-16T21:46:11Z

Yeah, this is a subtlety of re. Not to worry, was just throwing it out there.

danielballan · 2013-09-16T21:50:21Z

OK, now I get you. I will add an example like this to the docs. If nothing else, we'll have something to link to when this comes up on SO. :- )

jtratner · 2013-09-16T21:53:40Z

@hayd I think that's the correct behavior with re... (b|c)* means: 'capture one group that can be b|c 0 or more times' right?

jtratner · 2013-09-16T21:54:51Z

One other test case (sorry to pile on here!) - can you test that a non-capturing group (e.g., (?:ab*)) works correctly? I.e., it shouldn't add a column...

hayd · 2013-09-16T22:00:13Z

@jtratner yeah that's what I was expecting it to do...

I think it is worth mentioning (but not correcting), that writing something like (a)* doesn't do the obvious thing and capture multiple groups (but captures the last one), I expect because groups are stored as a dict which gets updated on each sighting :(

danielballan · 2013-09-16T22:01:35Z

Can we live with this?

N.B. The subtle (strange) behavior of re.findall with + and *
illustrated here:

.. ipython:: python

re.findall('(a)(b|c)', 'abc')
re.findall('(a)(b|c)*', 'abc')

also appears in extract:

.. ipython:: python

Series(['abc']).str.extract('(a)(b|c)')
Series(['abc']).str.extract('(a)(b|c)*')

danielballan · 2013-09-16T22:08:20Z

Non-capturing groups work correctly:

In [36]: Series(['A1', 'B2', 'C3']).str.extract('([AB])(?:[123])')
Out[36]: 
0      A
1      B
2    NaN
dtype: object

In [44]: Series(['A11', 'B22', 'C33']).str.extract('([AB])([123])(?:[123])')
Out[44]: 
     0    1
0    A    1
1    B    2
2  NaN  NaN

Will add these as test cases tomorrow. Are there others I should be thinking of?

jtratner · 2013-09-16T22:34:54Z

I was pretty sure they'd work, just wanted to make sure we covered the
bases. I'm going to look at this again tomorrow/tonight, but we're probably
pretty close to merging here :)

danielballan · 2013-09-16T23:31:10Z

Great. I hope I do not sound impatient. Thanks for your diligent support, all.

danielballan · 2013-09-16T23:33:10Z

Latest rebase added the two non-capturing tests above, plus a test showing that a pattern with only non-capturing groups raises, just as a pattern with no groups does.

jtratner · 2013-09-17T00:32:53Z

Thank you for putting this together (I know pandas can be a bit of a moving
target changes-wise)

jtratner · 2013-09-20T02:53:33Z

@danielballan is this ready to go? Would be nice if you could quickly rebase it first, so the history looks nicer.

danielballan · 2013-09-20T13:58:18Z

@jtratner Yes, it is ready to go. It's already rebased to one commit. If I need to do something else with the rebase, please explain. Thanks.

jreback · 2013-09-20T14:07:14Z

@danielballan can you do a quick release notes entry? (one-liner ok)

danielballan · 2013-09-20T14:11:47Z

@jtratner Is this a release notes entry? Does it belong somewhere else as well?

cpcloud · 2013-09-20T14:13:41Z

release notes is doc/source/release.rst and that's where every change goes. the v0.*.*.txt is where major changes go

jreback · 2013-09-20T14:17:08Z

@danielballan see the on-line version of the release.rst file: http://pandas.pydata.org/pandas-docs/dev/release.html,
put this in the new features section

danielballan · 2013-09-20T14:17:48Z

OK, thanks. Done.

jreback · 2013-09-20T14:18:06Z

@danielballan perfect thanks

ENH: Series.str.extract returns regex matches more conveniently

jreback · 2013-09-20T14:37:47Z

thanks for the PR!

hayd · 2013-09-20T16:01:23Z

awesome!

hayd reviewed Aug 27, 2013
View reviewed changes

ghost assigned jtratner Aug 31, 2013

ENH: Series.str.extract returns regex matches more conveniently

49eb724

jreback added a commit that referenced this pull request Sep 20, 2013

Merge pull request #4696 from danielballan/str_extract

351ba92

ENH: Series.str.extract returns regex matches more conveniently

jreback merged commit 351ba92 into pandas-dev:master Sep 20, 2013

ENH: Series.str.extract returns regex matches more conveniently #4696

ENH: Series.str.extract returns regex matches more conveniently #4696

Conversation

danielballan commented Aug 27, 2013

hayd Aug 27, 2013

Choose a reason for hiding this comment

danielballan Aug 27, 2013

Choose a reason for hiding this comment

jtratner commented Aug 31, 2013

jtratner commented Sep 15, 2013

danielballan commented Sep 16, 2013

danielballan commented Sep 16, 2013

jtratner commented Sep 16, 2013

jtratner commented Sep 16, 2013

hayd commented Sep 16, 2013

jreback commented Sep 16, 2013

danielballan commented Sep 16, 2013

hayd commented Sep 16, 2013

jtratner commented Sep 16, 2013

danielballan commented Sep 16, 2013

hayd commented Sep 16, 2013

hayd commented Sep 16, 2013

danielballan commented Sep 16, 2013

danielballan commented Sep 16, 2013

hayd commented Sep 16, 2013

danielballan commented Sep 16, 2013

jtratner commented Sep 16, 2013

jtratner commented Sep 16, 2013

hayd commented Sep 16, 2013

danielballan commented Sep 16, 2013

danielballan commented Sep 16, 2013

jtratner commented Sep 16, 2013

danielballan commented Sep 16, 2013

danielballan commented Sep 16, 2013

jtratner commented Sep 17, 2013

jtratner commented Sep 20, 2013

danielballan commented Sep 20, 2013

jreback commented Sep 20, 2013

danielballan commented Sep 20, 2013

cpcloud commented Sep 20, 2013

jreback commented Sep 20, 2013

danielballan commented Sep 20, 2013

jreback commented Sep 20, 2013

jreback commented Sep 20, 2013

hayd commented Sep 20, 2013