Skip to content

Split/Partition Master Issue #7387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 of 8 tasks
cpcloud opened this issue Jun 7, 2014 · 5 comments
Closed
6 of 8 tasks

Split/Partition Master Issue #7387

cpcloud opened this issue Jun 7, 2014 · 5 comments
Labels

Comments

@cpcloud
Copy link
Member

cpcloud commented Jun 7, 2014

As pointed out by @dsm054, there are multiple lurking split/partition API requests. Here are the issues and a short summary of what they would do (there are some duplicates here, I've checked off those issues/PRs that have been closed in favor of a related issue):

The toolz library has a partitionby function that provides a nice way to do some of the splitting on sequences and might provide us with some insight on how to approach the API.

cc @jreback @jorisvandenbossche @hayd @danielballan

@cpcloud cpcloud added this to the 0.15.0 milestone Jun 7, 2014
@jreback
Copy link
Contributor

jreback commented Jun 7, 2014

Here's a way to get started I think. Since pd.Grouper is already a pretty generic grouper in that it will return a pd.TimeGrouper (if freq is specified), its pretty easy to make this create a pd.SplitGrouper as needed. (more on this in a bit). This is essentially an implementation for this issue, rather than specific API recommendations (though I touch on this a bit)

something like this:

def DataFrame.split(self, grouper=None, **kwargs):

    if grouper is None:
        grouper = pd.Grouper(**kwargs)

    return df.groupby(grouper).split()

and Groupby.split is equivalent of this (list using a list on a groupby object, but just returning the grp's themselves), so this is 'trivial'

def split(self):
    return [ grp for g, grp in self ]

So then can easily add some more keyword args to the current pd.Grouper:

here is the current:

class Grouper(object):
    """
    A Grouper allows the user to specify a groupby instruction for a target object

    This specification will select a column via the key parameter, or if the level and/or
    axis parameters are given, a level of the index of the target object.

    These are local specifications and will override 'global' settings, that is the parameters
    axis and level which are passed to the groupby itself.

    Parameters
    ----------
    key : string, defaults to None
        groupby key, which selects the grouping column of the target
    level : name/number, defaults to None
        the level for the target index
    freq : string / freqency object, defaults to None
        This will groupby the specified frequency if the target selection (via key or level) is
        a datetime-like object
    axis : number/name of the axis, defaults to None
    sort : boolean, default to False
        whether to sort the resulting labels

    additional kwargs to control time-like groupers (when freq is passed)

    closed : closed end of interval; left or right
    label : interval boundary to use for labeling; left or right
    convention : {'start', 'end', 'e', 's'}
        If grouper is PeriodIndex

    Returns
    -------
    A specification for a groupby instruction

so pd.SplitGrouper simply create the groups (like TimeGrouper and Grouper do now)

I would propose this functionaility:

something like this:

class SplitGrouper(Grouper):

    Parameters
    ----------
    partition : integer, default None, number of rows to include in each group
    pad : (need to have options from toolz), how to pad groups specified by partition

If partition has too many meanings, then could add a n as well. (but would need to define these cases).

@cpcloud
Copy link
Member Author

cpcloud commented Jun 7, 2014

looks good, only suggestion i have is to use the partition_all behavior instead of having a pad arg, i.e., instead of padding, just split into partition number groups possibly of unequal # of rows

@cpcloud
Copy link
Member Author

cpcloud commented Jun 7, 2014

could pad if it's a tuple tho

@dsm054
Copy link
Contributor

dsm054 commented Jun 7, 2014

I'll make some time to dig through the pandas questions on SO and look for use cases that we should cover while we're addressing this.

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 7, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@jreback jreback added the Master Tracker High level tracker for similar issues label Mar 6, 2015
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback modified the milestones: Next Major Release, High Level Issue Tracking Sep 24, 2017
@TomAugspurger TomAugspurger removed the Master Tracker High level tracker for similar issues label Jul 6, 2018
@TomAugspurger TomAugspurger removed this from the High Level Issue Tracking milestone Jul 6, 2018
@jreback jreback added the Window rolling, ewma, expanding label Nov 25, 2020
@mroeschke
Copy link
Member

Since there's only 2 issues here, I think it's okay to track those issue separately as this tracker isnt really used anymore. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants