Skip to content

ENH: make contiguous groupby easier #5494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Nov 11, 2013 · 6 comments
Closed

ENH: make contiguous groupby easier #5494

dsm054 opened this issue Nov 11, 2013 · 6 comments

Comments

@dsm054
Copy link
Contributor

dsm054 commented Nov 11, 2013

itertools.groupby groups things contiguously-- great for run length encoding, not so great for partitioning. This necessitates the groupby(sorted(items,key=keyfn), keyfn) dance if you want to separate it. That's not always what you want either, so you wind up writing

def partition(seq, keyfn):
    d = {}
    for x in seq:
        d.setdefault(keyfn(x), []).append(x)
    return d

and so on.

DataFrame.groupby is great for data partitioning, but merges discontiguous groups. Wanting to cluster timeseries -- first x since the last y, etc. -- is a common task. With some cumsum hacks you can do it, but "get a boolean series, see when it's equal to its shifted value to find the transitions, take advantage of the fact that False == 0 and True == 1 to cumsum that to get something which grows for each cluster, and then groupby on that" is maybe a little more than I'd expect a beginner to have to do to get back what itertools.groupby does naturally. And if there's an easier way, then we at least should at least make it more obvious.

I'm not sure what the best way to proceed is, but I've answered variants of this several times on SO, and people wanting a cumsum/cumprod-with-reset is a pretty common numpy request.

@cpcloud
Copy link
Member

cpcloud commented Nov 11, 2013

Big +1 here. I often wish I could keep the discontinuity of groups. Maybe a merge_groups=True keyword?

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

@dsm054 can you put up a simple example (and use the cumsum soln) so can see what this looks like?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@dsm054
Copy link
Contributor Author

dsm054 commented Feb 28, 2014

@jreback: I often do something like

>>> df = pd.DataFrame({"A": [1,1,2,3,2,2,3], "B": [1]*7})
>>> df
   A  B
0  1  1
1  1  1
2  2  1
3  3  1
4  2  1
5  2  1
6  3  1

[7 rows x 2 columns]
>>> df.groupby("A")["B"].sum()
A
1    2
2    3
3    2
Name: B, dtype: int64
>>> df.groupby((df.A != df.A.shift()).cumsum())["B"].sum()
A
1    2
2    1
3    1
4    2
5    1
Name: B, dtype: int64

which seems obvious now but I remember it not being at all obvious the first time I did it. There's also the "new groups start at delimiters" (df.A == header).cumsum() variant.

Maybe this should be closed in favour of #4059 which seems broader in scope.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

ok...do you want to contribute that as a cookrecipe and in groupby.rst (in the examples section at the end)?...

i'll change this issue to a doc issue then

@jreback jreback added the Docs label Feb 28, 2014
@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

though...not averse to a partition function as well ?

@shumpohl
Copy link

I implemented some little helper for this since I require it quite often and the workaround performance was not sufficient:

def consecutive_groupby(df: pd.DataFrame,
                        columns: Union[Hashable, List[Hashable]]) -> Iterator[Tuple[Any, pd.DataFrame]]:
    group_vals = df[columns]

    splits = np.not_equal(group_vals.values[1:, ...], group_vals.values[:-1, ...])
    if splits.ndim > 1:
        splits = splits.any(axis=1)
        def get_group_val(i): return tuple(group_vals.values[i])
    else:
        get_group_val = group_vals.values.__getitem__

    split_idx = np.flatnonzero(splits)
    split_idx += 1

    start_idx = 0
    for idx in split_idx:
        group_val = get_group_val(start_idx)
        yield group_val, df.iloc[start_idx:idx, :]
        start_idx = idx

    group_val = get_group_val(start_idx)
    yield group_val, df.iloc[start_idx:, :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants