ENH: make contiguous groupby easier #5494

dsm054 · 2013-11-11T19:00:23Z

itertools.groupby groups things contiguously-- great for run length encoding, not so great for partitioning. This necessitates the groupby(sorted(items,key=keyfn), keyfn) dance if you want to separate it. That's not always what you want either, so you wind up writing

def partition(seq, keyfn):
    d = {}
    for x in seq:
        d.setdefault(keyfn(x), []).append(x)
    return d

and so on.

DataFrame.groupby is great for data partitioning, but merges discontiguous groups. Wanting to cluster timeseries -- first x since the last y, etc. -- is a common task. With some cumsum hacks you can do it, but "get a boolean series, see when it's equal to its shifted value to find the transitions, take advantage of the fact that False == 0 and True == 1 to cumsum that to get something which grows for each cluster, and then groupby on that" is maybe a little more than I'd expect a beginner to have to do to get back what itertools.groupby does naturally. And if there's an easier way, then we at least should at least make it more obvious.

I'm not sure what the best way to proceed is, but I've answered variants of this several times on SO, and people wanting a cumsum/cumprod-with-reset is a pretty common numpy request.

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-11-11T20:04:20Z

Big +1 here. I often wish I could keep the discontinuity of groups. Maybe a merge_groups=True keyword?

jreback · 2014-02-15T21:21:43Z

@dsm054 can you put up a simple example (and use the cumsum soln) so can see what this looks like?

dsm054 · 2014-02-28T15:28:18Z

@jreback: I often do something like

>>> df = pd.DataFrame({"A": [1,1,2,3,2,2,3], "B": [1]*7})
>>> df
   A  B
0  1  1
1  1  1
2  2  1
3  3  1
4  2  1
5  2  1
6  3  1

[7 rows x 2 columns]
>>> df.groupby("A")["B"].sum()
A
1    2
2    3
3    2
Name: B, dtype: int64
>>> df.groupby((df.A != df.A.shift()).cumsum())["B"].sum()
A
1    2
2    1
3    1
4    2
5    1
Name: B, dtype: int64

which seems obvious now but I remember it not being at all obvious the first time I did it. There's also the "new groups start at delimiters" (df.A == header).cumsum() variant.

Maybe this should be closed in favour of #4059 which seems broader in scope.

jreback · 2014-02-28T15:35:05Z

ok...do you want to contribute that as a cookrecipe and in groupby.rst (in the examples section at the end)?...

i'll change this issue to a doc issue then

jreback · 2014-02-28T15:35:51Z

though...not averse to a partition function as well ?

shumpohl · 2022-10-24T15:42:18Z

I implemented some little helper for this since I require it quite often and the workaround performance was not sufficient:

def consecutive_groupby(df: pd.DataFrame,
                        columns: Union[Hashable, List[Hashable]]) -> Iterator[Tuple[Any, pd.DataFrame]]:
    group_vals = df[columns]

    splits = np.not_equal(group_vals.values[1:, ...], group_vals.values[:-1, ...])
    if splits.ndim > 1:
        splits = splits.any(axis=1)
        def get_group_val(i): return tuple(group_vals.values[i])
    else:
        get_group_val = group_vals.values.__getitem__

    split_idx = np.flatnonzero(splits)
    split_idx += 1

    start_idx = 0
    for idx in split_idx:
        group_val = get_group_val(start_idx)
        yield group_val, df.iloc[start_idx:idx, :]
        start_idx = idx

    group_val = get_group_val(start_idx)
    yield group_val, df.iloc[start_idx:, :]

ghost mentioned this issue Dec 19, 2013

API for splitting pandas objects #4059

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

jreback added the Docs label Feb 28, 2014

cpcloud mentioned this issue Jun 7, 2014

Split/Partition Master Issue #7387

Closed

8 tasks

ischwabacher mentioned this issue Jul 28, 2014

Proposal: New Index type for binned data (IntervalIndex) #7640

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jorisvandenbossche mentioned this issue Jul 10, 2015

ENH: drop_duplicates(consecutive=True) to drop only consecutive duplicates #10540

Open

datapythonista closed this as completed Jul 6, 2018

datapythonista modified the milestones: Contributions Welcome, No action Jul 6, 2018

jorisvandenbossche mentioned this issue May 2, 2022

ENH: Add option to resample data by a non-timeseries column (e.g. Price) #46794

Closed

shumpohl mentioned this issue Oct 24, 2022

ENH: add efficient groupby for sorted dataframe #43011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: make contiguous groupby easier #5494

ENH: make contiguous groupby easier #5494

dsm054 commented Nov 11, 2013

cpcloud commented Nov 11, 2013

jreback commented Feb 15, 2014

dsm054 commented Feb 28, 2014

jreback commented Feb 28, 2014

jreback commented Feb 28, 2014

shumpohl commented Oct 24, 2022

ENH: make contiguous groupby easier #5494

ENH: make contiguous groupby easier #5494

Comments

dsm054 commented Nov 11, 2013

cpcloud commented Nov 11, 2013

jreback commented Feb 15, 2014

dsm054 commented Feb 28, 2014

jreback commented Feb 28, 2014

jreback commented Feb 28, 2014

shumpohl commented Oct 24, 2022