-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: make contiguous groupby easier #5494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Big +1 here. I often wish I could keep the discontinuity of groups. Maybe a |
@dsm054 can you put up a simple example (and use the |
@jreback: I often do something like
which seems obvious now but I remember it not being at all obvious the first time I did it. There's also the "new groups start at delimiters" Maybe this should be closed in favour of #4059 which seems broader in scope. |
ok...do you want to contribute that as a cookrecipe and in groupby.rst (in the examples section at the end)?... i'll change this issue to a doc issue then |
though...not averse to a |
I implemented some little helper for this since I require it quite often and the workaround performance was not sufficient: def consecutive_groupby(df: pd.DataFrame,
columns: Union[Hashable, List[Hashable]]) -> Iterator[Tuple[Any, pd.DataFrame]]:
group_vals = df[columns]
splits = np.not_equal(group_vals.values[1:, ...], group_vals.values[:-1, ...])
if splits.ndim > 1:
splits = splits.any(axis=1)
def get_group_val(i): return tuple(group_vals.values[i])
else:
get_group_val = group_vals.values.__getitem__
split_idx = np.flatnonzero(splits)
split_idx += 1
start_idx = 0
for idx in split_idx:
group_val = get_group_val(start_idx)
yield group_val, df.iloc[start_idx:idx, :]
start_idx = idx
group_val = get_group_val(start_idx)
yield group_val, df.iloc[start_idx:, :] |
itertools.groupby
groups things contiguously-- great for run length encoding, not so great for partitioning. This necessitates thegroupby(sorted(items,key=keyfn), keyfn)
dance if you want to separate it. That's not always what you want either, so you wind up writingand so on.
DataFrame.groupby
is great for data partitioning, but merges discontiguous groups. Wanting to cluster timeseries -- first x since the last y, etc. -- is a common task. With somecumsum
hacks you can do it, but "get a boolean series, see when it's equal to its shifted value to find the transitions, take advantage of the fact that False == 0 and True == 1 tocumsum
that to get something which grows for each cluster, and thengroupby
on that" is maybe a little more than I'd expect a beginner to have to do to get back whatitertools.groupby
does naturally. And if there's an easier way, then we at least should at least make it more obvious.I'm not sure what the best way to proceed is, but I've answered variants of this several times on SO, and people wanting a cumsum/cumprod-with-reset is a pretty common numpy request.
The text was updated successfully, but these errors were encountered: