Split/Partition Master Issue #7387

cpcloud · 2014-06-07T16:57:58Z

As pointed out by @dsm054, there are multiple lurking split/partition API requests. Here are the issues and a short summary of what they would do (there are some duplicates here, I've checked off those issues/PRs that have been closed in favor of a related issue):

Conditional split/groupby #414: (i think) original issue for these ideas going back 3 years
Add time-length windowing capability to moving statistics #936: windowing with time-length windows like pd.rolling_mean(ts, window='30min') and possibly even arbitrary windows using another column
Add df.split with predicate function #3066: split method on pandas objects, playing around with ideas
API experiment: lambda grouper based on sliding window #3101: a closed PR by @y-p to use the args of lambda to group a frame into views of a sliding window
ENH: resample(..., how='range', start=a, stop=b) #3685: resampling using the first n samples of a bin.
API for splitting pandas objects #4059: np.array_split style API where you can split a pandas object into a list of k groups of possibly unequal size (could be a thin wrapper around np.array_split, or more integrated into the pandas DSL). IMO, this issue provides the best starting point for an API. SO usage
ENH: make contiguous groupby easier #5494: an API for to allow pandas' groupby to have itertools.groupby semantics (i.e., preserve the order of duplicated group keys), i.e., 'aabbaa' would yield groups ['aa', 'bb', 'aa'] rather than ['aaaa', 'bb']. There'd have to be some changes to the use of dict in the groupby backend as noted by @y-p here API for splitting pandas objects #4059 (comment).
API for selecting ranged groups #6675: Ability to select ranged groups via another column, like "select all rows between the values X and Y from column C", e.g., an "events" column where you have a start and end markers and you want to get the data in between the markers. There are a couple of ways you can do this, but it would be nice to have an API for this. This is very similar to Add time-length windowing capability to moving statistics #936.

The toolz library has a partitionby function that provides a nice way to do some of the splitting on sequences and might provide us with some insight on how to approach the API.

cc @jreback @jorisvandenbossche @hayd @danielballan

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-07T17:43:57Z

Here's a way to get started I think. Since pd.Grouper is already a pretty generic grouper in that it will return a pd.TimeGrouper (if freq is specified), its pretty easy to make this create a pd.SplitGrouper as needed. (more on this in a bit). This is essentially an implementation for this issue, rather than specific API recommendations (though I touch on this a bit)

something like this:

def DataFrame.split(self, grouper=None, **kwargs):

    if grouper is None:
        grouper = pd.Grouper(**kwargs)

    return df.groupby(grouper).split()

and Groupby.split is equivalent of this (list using a list on a groupby object, but just returning the grp's themselves), so this is 'trivial'

def split(self):
    return [ grp for g, grp in self ]

So then can easily add some more keyword args to the current pd.Grouper:

here is the current:

class Grouper(object):
    """
    A Grouper allows the user to specify a groupby instruction for a target object

    This specification will select a column via the key parameter, or if the level and/or
    axis parameters are given, a level of the index of the target object.

    These are local specifications and will override 'global' settings, that is the parameters
    axis and level which are passed to the groupby itself.

    Parameters
    ----------
    key : string, defaults to None
        groupby key, which selects the grouping column of the target
    level : name/number, defaults to None
        the level for the target index
    freq : string / freqency object, defaults to None
        This will groupby the specified frequency if the target selection (via key or level) is
        a datetime-like object
    axis : number/name of the axis, defaults to None
    sort : boolean, default to False
        whether to sort the resulting labels

    additional kwargs to control time-like groupers (when freq is passed)

    closed : closed end of interval; left or right
    label : interval boundary to use for labeling; left or right
    convention : {'start', 'end', 'e', 's'}
        If grouper is PeriodIndex

    Returns
    -------
    A specification for a groupby instruction

so pd.SplitGrouper simply create the groups (like TimeGrouper and Grouper do now)

I would propose this functionaility:

partition and pad (stealing from here, partition by n, these are both new keywords (API for splitting pandas objects #4059).

I guess partition could be also be a tuple, e.g. (2,3) to specify that I want n groups to have these lengths (and then repeat)
partition and freq (only partition is new): resample first partition of that freq (could also take -partition for last partiion of that freq) (ENH: resample(..., how='range', start=a, stop=b) #3685, maybe Add time-length windowing capability to moving statistics #936); this could also be a new keyword n to not confuse it with parttion
partition and sort (only partition is new): with sort=False would create groups but do it similarly to itertools.groupby (e.g. not reduce the groups, but do them in order) (ENH: make contiguous groupby easier #5494), could also take sort='by_group' to make this unambiguous

something like this:

class SplitGrouper(Grouper):

    Parameters
    ----------
    partition : integer, default None, number of rows to include in each group
    pad : (need to have options from toolz), how to pad groups specified by partition

If partition has too many meanings, then could add a n as well. (but would need to define these cases).

cpcloud · 2014-06-07T19:08:04Z

looks good, only suggestion i have is to use the partition_all behavior instead of having a pad arg, i.e., instead of padding, just split into partition number groups possibly of unequal # of rows

cpcloud · 2014-06-07T19:08:49Z

could pad if it's a tuple tho

dsm054 · 2014-06-07T19:20:40Z

I'll make some time to dig through the pandas questions on SO and look for use cases that we should cover while we're addressing this.

mroeschke · 2021-04-11T04:42:54Z

Since there's only 2 issues here, I think it's okay to track those issue separately as this tracker isnt really used anymore. Closing

cpcloud added this to the 0.15.0 milestone Jun 7, 2014

cpcloud added Enhancement labels Jun 7, 2014

jreback modified the milestones: 0.15.0, 0.15.1 Jul 7, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014

jorisvandenbossche mentioned this issue Nov 18, 2014

splitting pandas dataframe - np.array_split error #8846

Closed

jreback mentioned this issue Jan 26, 2015

ENH: resample(..., how='range', start=a, stop=b) #3685

Closed

jreback added the Master Tracker High level tracker for similar issues label Mar 6, 2015

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jorisvandenbossche mentioned this issue Jul 10, 2015

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

Open

jreback mentioned this issue Oct 7, 2015

added random_split in generic.py, for DataFrames etc. #11253

Closed

jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017

TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018

TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018

jreback added the Window rolling, ewma, expanding label Nov 25, 2020

mroeschke closed this as completed Apr 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split/Partition Master Issue #7387

Split/Partition Master Issue #7387

cpcloud commented Jun 7, 2014 •

edited by mroeschke

Loading

jreback commented Jun 7, 2014

cpcloud commented Jun 7, 2014

cpcloud commented Jun 7, 2014

dsm054 commented Jun 7, 2014

mroeschke commented Apr 11, 2021

Split/Partition Master Issue #7387

Split/Partition Master Issue #7387

Comments

cpcloud commented Jun 7, 2014 • edited by mroeschke Loading

jreback commented Jun 7, 2014

cpcloud commented Jun 7, 2014

cpcloud commented Jun 7, 2014

dsm054 commented Jun 7, 2014

mroeschke commented Apr 11, 2021

cpcloud commented Jun 7, 2014 •

edited by mroeschke

Loading