Skip to content

WIP: add df.dgrep, df.neighbours #3276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

WIP: add df.dgrep, df.neighbours #3276

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Apr 8, 2013

#2460

partly fulfilles #3269, should be extended to an equivalent method on
index labels? (vectorized? numexpr?). Just experimenting with the API,
performance later (as much as possible with unindexed data).

  • need to add option to interpret multiple columns as a predicate expecting
    a list of values corresponding to the row values on the specified columns,
    to allow for boolean expressions across columns, rather then single value at a time.
  • tests
  • make lazy like groupby?
  • need to add axis argument, per jeff's suggestion.
  • documentation.
In [7]: pd.options.sandbox.dgrep=True

This is an experimental feature being considered for inclusion in pandas core.
We'd appreciate your feedback on it in the Github issue page:

    http://github.com/pydata/pandas/issues/2460

If you find this useful, lacking in major functionality or buggy please
take a moment to let us know, so we can make pandas (even) better.

Thank you,

The Pandas dev team

P.S.


Series/DataFrame now have a .dgrep method.
See the docstring for usage examples.


In [8]: df=mkdf(30,4,r_idx_nlevels=3)
   ...: df.index=range(30)
   ...: df.iloc[5,0] = "supercool"
   ...: df.iloc[6,0] = "supercool"
   ...: df.iloc[29,0] = "supercool"
   ...: df.iloc[15,1] = "supercool"
   ...: # accepts colname and regex string
   ...: print "\n" + str(df.dgrep(".cool$","C_l0_g0"))
   ...: # accepts lists of cols, 
   ...: print "\n" + str(df.dgrep(".cool$",["C_l0_g0",'C_l0_g1']))
   ...: # specifying C=2 (or A/B=) does a grep context , providing
   ...: # context lines around the hit
   ...: # NB overlapping context lines do not cause line duplication (*)
   ...: print "\n" + str(df.dgrep(".cool$",["C_l0_g0"],C=2))
   ...: # also accepts lambda
   ...: # NB, last match is at end, so only previous line of context displayed
   ...: print "\n" + str(df.dgrep(lambda x: bool(re.search(".cool$",x)),["C_l0_g0"],C=3))
   ...: # split=True returns a series of (index_label_matched, dataframe)
   ...: # pairs, similar to groupby
   ...: # NB some lines appear in more then one group in this case (*)
   ...: print "\n" + "\n".join(map(str,df.dgrep(".cool$",["C_l0_g0"],split=True,C=3)))
   ...: 
   ...: # works on series too
   ...: print "\n" + str(df.C_l0_g0.dgrep(".cool$",C=3))


C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
29  supercool   R29C1   R29C2   R29C3

C0    C_l0_g0    C_l0_g1 C_l0_g2 C_l0_g3
5   supercool       R5C1    R5C2    R5C3
6   supercool       R6C1    R6C2    R6C3
15      R15C0  supercool   R15C2   R15C3
29  supercool      R29C1   R29C2   R29C3

C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3

C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
7        R7C0    R7C1    R7C2    R7C3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3

(5, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
4        R4C0    R4C1    R4C2    R4C3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3)
(6, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
5   supercool    R5C1    R5C2    R5C3
6   supercool    R6C1    R6C2    R6C3
7        R7C0    R7C1    R7C2    R7C3)
(29, C0    C_l0_g0 C_l0_g1 C_l0_g2 C_l0_g3
28      R28C0   R28C1   R28C2   R28C3
29  supercool   R29C1   R29C2   R29C3)

4          R4C0
5     supercool
6     supercool
7          R7C0
28        R28C0
29    supercool
Name: C_l0_g0, dtype: object

# can also get the values "applied" onto the function
df.dgrep(lambda c1,c2: "cool" in c1 or "cool" in c2,df.columns[:2])

# which also works with *args
df.dgrep(lambda *args: "supercool" in args,df.columns[:3]) 

@ghost
Copy link
Author

ghost commented Apr 9, 2013

Thinking more on this, and seeing some examples, maybe this conflates two things
which are useful independently:

# dgrep to get the rows, via regx match or predicate, useful in itself
ixs = df.dgrep("foo",df.C0)
R0 foo
R5 fools
R12 food

# `context` should be a standalone operation, applicable to results
# no matter how they are reached. to get a "context" of rows (B)efore
# and (A)fter a matched row.
df.context(ixs,A=1,B=2,as_seq=False)
R0 foo
R1 baz
R3 baz
R4 baz
R5 fools
R6 bazz
R10 pugz
R11 bugs
R12 food
R13 buggy

#because you might want to do this to get a window
#  around all points of large change
df.context(df.C0/diff()>thresh,C=5,as_seq=True)
[<index_label,frame of 5rows>,...]

@ghost ghost mentioned this pull request Apr 12, 2013
@ghost
Copy link
Author

ghost commented Jul 22, 2013

No more time to work on this, closing. Will leave #3269 open if someone wants to pick this
up in the future.

@ghost ghost closed this Jul 22, 2013
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants