-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Replace with regular expression #2285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It seems to me this would be a good ticket for a first-time contributor who is familiar with (or willing to learn about) regular expressions. You could model your test case after the code at the Stack Overflow post that @changhiskhan linked to here. You would presumably be modifying "class DataFrame" in pandas/core/frame.py by adding a method to it. If it makes things more convenient, presumably your new method could insist that the regular expression that comes has already been compiled with re.compile, so that your function does not have to compile it. |
I would also be in favor of such a feature provided we can also pass exact matches as literals. I have eyetracker data in ASCII file format in which missing values are indicated with the I suggest |
possibly #3276 would half do the job. |
I would be happy to implement this. @paulproteus regex compiles are so fast (a string of length 1000 compiles in < 1 us) that it's not really an issue, plus you can get the string via the @louist87 A regular expression consisting of no metacharacters (escape metacharacters to match them) does exactly that. I think |
@cpcloud FYI replace exists in core/series and in the core/internals (right now) |
@jreback maybe better to wait until series-as-ndframe is merged before working on this? |
@cpcloud up 2 you....series-as-ndframe is coming in 0.12 ...why don't you add to the core/internals/Block/replace....that way no wasted code.....(and do for 0.12), and if I recall...I didn't fix replace in series yet anyhow......so go ahead and do if you would like... |
Ok. Cool. This should only work on strings correct? For example passing the numeric token regexes from the tokenize module will not match anything unless that number is actually a string. |
@cpcloud I think it has to be a valid re.compileable expression (which i think stringifies)? |
actually.....are we adding an argument for this? e.g. |
@jreback Ok hold on, I think I may be confused about something. Do we want to be able to pass a |
my understanding may be wrong but isn't this something like replace_expr -> value ? so 'foo' is and exact match but 'foo*' is the re I guess u could always do re matching |
Always re matching was my thinking since straight strings are just special cases of regexes (in the most formal sense). Of course, once I actually start working on this things may change. |
I would make a dict/list always exact match (like now) I see the problem, if I pass the number 2 what do u match on the string 2? you could handle it by dtype |
Right that was my orig. concern. Do we want regexes as replacement values as well? For example, I think the following is intuitive (ascii eyelink example from @louist87) df.replace({r'^\s*\.\s*$': NaN}) But what if the user wants to replace a df.replace({r'\s*(\w{2})(\w{2})\s*': r'\1\2'}) Additionally, how should the empty match case be handled (this could come for free from pandas' convenient |
this makes sense (and if you can conditionally handle via different block dtypes), e.g. a re is only valid for ObjectBlocks, whereas you would only allow numeric replacements via Int/Float Block types (maybe punt Series replace for now as this gets tricky) You can prob leave the Block.replace alone and just right a new one for the ObjectBlock (though you CAN putmask btw, if you know the values to replace) |
ok. will do. i had already started looking at object block replace :) since that sidesteps the issue of re's matching numeric types. well not sidesteps...but makes it easier |
@jreback Are the keys of |
what do u mean? how r u using it? |
sorry. nvm. preemptive. |
Do we want to be able to match regexs on the columns similar to df.replace({'\w+\.\w+': nan}, {'\w+\.\w+': 1}) I think this is probably a bad idea, after typing that out. However, something like df.replace({'\w+\.\w+': {'to_replace_regex': nan}}) would be cool I think. This would mean get all the columns with names matching the first level of regexs then within those columns replace the values matching each inner dict's regexs. However, this will obviously take more time. |
I would not add the column matching, too much magic |
@jreback Can u give a description of what the single df.replace({'a': 'b'}) will replace all occurrences of |
I have another question. Hope this isn't too annoying. When should one use |
almost always use make_block if u want to be a specific dtype pass the klass arg Very rarely do i set the values directly |
FYI there is a possible back compat issue here with my implementation, since every string is now treated as a regex. For example, |
I think u could do a minimal validation and reject too general regexes unless u want to do string exact matching by default |
i think a |
yes that sounds right |
the docs for replace are going to be kind of a beast since there are so many ways to slice and dice here... |
@jreback this is ready 2 go modulo some additional documentation (regexs and examples in the method) and your word on what to do with single dict |
closed via #3584 |
This replace is quite flexible! http://stackoverflow.com/questions/16818871/extracting-value-and-creating-new-column-out-of-it It might be slower than method one though? as that uses the vectorized string methods (which in theory this could use too) |
glad u like! |
FYI, this is a pretty common idiom
|
ok. yeah i'm always paranoid about copies even though i know that reshape shares data. matlab and r have scarred me 4 life |
hahah....this is all view based, no copies! |
actually just made then change (using |
(this is using |
ok....so what you are doing is fine; but outght to change all of the string methods to use |
hm writing them in cython could be done via |
look at #2802, can you just do a quick test using vectorize on the example? |
In [15]: timeit vectorize(lambda s: s.endswith('world'))(p.strings)
100 loops, best of 3: 3.35 ms per loop
In [16]: %%timeit
....: for ii in xrange(1000):
....: p['ishello'] = p['strings'].str.endswith('world')
....:
1 loops, best of 3: 3.61 s per loop
In [17]: %%timeit
....: for ii in xrange(1000) :
....: p['isHello'] = [s.endswith('world') for s in p['strings'].values]
....:
1 loops, best of 3: 2.69 s per loop
In [18]: %%timeit
....: for ii in xrange(1000) :
....: p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
....:
1 loops, best of 3: 2.27 s per loop
In [19]: %%timeit
....: for ii in xrange(1000) :
....: p['isHello'] = pandas.Series([s.endswith('world') for s in p['strings'].values])
KeyboardInterrupt
In [19]: f = vectorize(lambda x: x.endswith('world'))
In [20]: %%timeit
for ii in xrange(1000) :
p['isHello'] = f(p['strings'])
....:
1 loops, best of 3: 3.49 s per loop
In [21]: %%timeit
for ii in xrange(1000) :
p['isHello'] = f(p['strings'])
....:
1 loops, best of 3: 3.49 s per loop |
not really a big difference and the current methods are faster |
yep.....got the same....ok...not a big deal then |
xref: http://stackoverflow.com/questions/13445241/replacing-blank-values-white-space-with-nan-in-pandas
The text was updated successfully, but these errors were encountered: