Skip to content

Add df.split with predicate function #3066

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Mar 16, 2013 · 7 comments
Closed

Add df.split with predicate function #3066

ghost opened this issue Mar 16, 2013 · 7 comments

Comments

@ghost
Copy link

ghost commented Mar 16, 2013

xref: http://stackoverflow.com/questions/13353233/best-way-to-split-a-dataframe-given-an-edge/15449992#15449992

Here's one for you, jeff.

@wesm
Copy link
Member

wesm commented Mar 16, 2013

Maybe this merits a new API? Essentially "split with predicate function". so we'd do:

df.split(lambda x: x == 'B', axis=0)

@jreback
Copy link
Contributor

jreback commented Mar 16, 2013

I think more of a scalar or list of values for that axis,
to allow multiple pieces
but then how do u control where the split value goes

maybe interval=open/closed? and don't return empty groups (eg if I select the last column)

if cols are list('abcdefg')
df.split(['a','c'])

groups of
a
bc
defg

@ghost
Copy link
Author

ghost commented Mar 16, 2013

Another addition would be to introspect the lambda for it's argcount
and provide a moving window of values:

df.groupby(lambda prev,curr: curr != prev).

and add a win_offset arg to specify nvals before, nvals ahead.

@ghost
Copy link
Author

ghost commented Mar 17, 2013

implementing that would also answer #414

@dalejung
Copy link
Contributor

I wonder if a general edge binner would be useful here.

def edge_groupby(df, edges):
    edges[0] = True
    edges.iloc[-1] = True

    trues = edges[edges].index.values
    trues[-1] = trues[-1] + 1 # make sure we include last value

    bins = lib.generate_bins_dt64(edges.index, trues, closed='left')
    binlabels = [0] + list(bins[:-1]) # label=left
    grouper = BinGrouper(bins, binlabels)
    return df.groupby(grouper)

grouped = edge_groupby(df, df.a == 'B')

That would take in a bool series where the True values are the edges.

@ghost
Copy link
Author

ghost commented Mar 17, 2013

Maybe a sliding window with a reduce style operation?

df.groupby_reduce(lambda acc,prev,curr: acc + (prev and prev == 'B'))

or

df.groupby_reduce(lambda acc,*vs: acc + (vs[0] and vs[0] == 'B'),2,'right')

with acc=0 on init.

@ghost
Copy link
Author

ghost commented Nov 22, 2013

closing this in favor of cleaner implementation for #4059

@ghost ghost closed this as completed Nov 22, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants