WIP: add df.dgrep, df.neighbours #3276

ghost · 2013-04-08T05:25:49Z

partly fulfilles #3269, should be extended to an equivalent method on
index labels? (vectorized? numexpr?). Just experimenting with the API,
performance later (as much as possible with unindexed data).

need to add option to interpret multiple columns as a predicate expecting
a list of values corresponding to the row values on the specified columns,
to allow for boolean expressions across columns, rather then single value at a time.
tests
make lazy like groupby?
need to add axis argument, per jeff's suggestion.
documentation.

In [7]: pd.options.sandbox.dgrep=True

This is an experimental feature being considered for inclusion in pandas core.
We'd appreciate your feedback on it in the Github issue page:

    http://github.com/pydata/pandas/issues/2460

If you find this useful, lacking in major functionality or buggy please
take a moment to let us know, so we can make pandas (even) better.

Thank you,

The Pandas dev team

P.S.


Series/DataFrame now have a .dgrep method.
See the docstring for usage examples.


In [8]: df=mkdf(30,4,r_idx_nlevels=3)
   ...: df.index=range(30)
   ...: df.iloc[5,0] = "supercool"
   ...: df.iloc[6,0] = "supercool"
   ...: df.iloc[29,0] = "supercool"
   ...: df.iloc[15,1] = "supercool"
   ...: # accepts colname and regex string
   ...: print "\n" + str(df.dgrep(".cool$","C_l0_g0"))
   ...: # accepts lists of cols, 
   ...: print "\n" + str(df.dgrep(".cool$",["C_l0_g0",'C_l0_g1']))
   ...: # specifying C=2 (or A/B=) does a grep context , providing
   ...: # context lines around the hit
   ...: # NB overlapping context lines do not cause line duplication (*)
   ...: print "\n" + str(df.dgrep(".cool$",["C_l0_g0"],C=2))
   ...: # also accepts lambda
   ...: # NB, last match is at end, so only previous line of context displayed
   ...: print "\n" + str(df.dgrep(lambda x: bool(re.search(".cool$",x)),["C_l0_g0"],C=3))
   ...: # split=True returns a series of (index_label_matched, dataframe)
   ...: # pairs, similar to groupby
   ...: # NB some lines appear in more then one group in this case (*)
   ...: print "\n" + "\n".join(map(str,df.dgrep(".cool$",["C_l0_g0"],split=True,C=3)))
   ...: 
   ...: # works on series too
   ...: print "\n" + str(df.C_l0_g0.dgrep(".cool$",C=3))


C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
29  supercool   R29C1   R29C2   R29C3

C0    C_l0_g0    C_l0_g1 C_l0_g2 C_l0_g3
5   supercool       R5C1    R5C2    R5C3
6   supercool       R6C1    R6C2    R6C3
15      R15C0  supercool   R15C2   R15C3
29  supercool      R29C1   R29C2   R29C3

C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3

C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
7        R7C0    R7C1    R7C2    R7C3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3

(5, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3)
(6, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
7        R7C0    R7C1    R7C2    R7C3)
(29, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3)

4          R4C0
5     supercool
6     supercool
7          R7C0
28        R28C0
29    supercool
Name: C_l0_g0, dtype: object

# can also get the values "applied" onto the function
df.dgrep(lambda c1,c2: "cool" in c1 or "cool" in c2,df.columns[:2])

# which also works with *args
df.dgrep(lambda *args: "supercool" in args,df.columns[:3])

ghost · 2013-04-09T01:22:25Z

Thinking more on this, and seeing some examples, maybe this conflates two things
which are useful independently:

# dgrep to get the rows, via regx match or predicate, useful in itself
ixs = df.dgrep("foo",df.C0)
R0 foo
R5 fools
R12 food

# `context` should be a standalone operation, applicable to results
# no matter how they are reached. to get a "context" of rows (B)efore
# and (A)fter a matched row.
df.context(ixs,A=1,B=2,as_seq=False)
R0 foo
R1 baz
R3 baz
R4 baz
R5 fools
R6 bazz
R10 pugz
R11 bugs
R12 food
R13 buggy

#because you might want to do this to get a window
#  around all points of large change
df.context(df.C0/diff()>thresh,C=5,as_seq=True)
[<index_label,frame of 5rows>,...]

ghost · 2013-07-22T20:05:13Z

No more time to work on this, closing. Will leave #3269 open if someone wants to pick this
up in the future.

This was referenced Apr 8, 2013

ENH: df.grep(col,pat) and df.dselect(col,"expr") #2460

Closed

ENH: an example of a sandbox feature #3274

Closed

y-p added 4 commits April 12, 2013 11:03

ENH: add df.dgrep to sandbox, add sandbox.dgrep option

951f952

TST: add tests for df.dgrep

e46c85b

ENH: Breakup into dgrep and neighbours

3d05c40

ENH: use numpy operations where possible

283ac0d

ghost mentioned this pull request Apr 12, 2013

Replace with regular expression #2285

Closed

ghost closed this Jul 22, 2013

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: add df.dgrep, df.neighbours #3276

WIP: add df.dgrep, df.neighbours #3276

ghost commented Apr 8, 2013

ghost commented Apr 9, 2013

ghost commented Jul 22, 2013

WIP: add df.dgrep, df.neighbours #3276

WIP: add df.dgrep, df.neighbours #3276

Conversation

ghost commented Apr 8, 2013

ghost commented Apr 9, 2013

ghost commented Jul 22, 2013