Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 119 commits into from
Oct 15, 2021
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
72fd66d
ENH: A new GroupBy method to slice rows preserving index and order
johnzangwill Aug 9, 2021
d0ebbeb
Formatting
johnzangwill Aug 9, 2021
33d7992
Formatting
johnzangwill Aug 9, 2021
78e9ced
Formatting
johnzangwill Aug 9, 2021
4d098cd
Formatting
johnzangwill Aug 9, 2021
f84c365
Formatting
johnzangwill Aug 9, 2021
d937757
Add iloc to test_tab_completion
johnzangwill Aug 9, 2021
e206912
Add iloc to groupby/base.py
johnzangwill Aug 9, 2021
1788f1b
Documentation
johnzangwill Aug 10, 2021
f6977fa
Cosmetics to make pre-commit happy
johnzangwill Aug 10, 2021
bca4fdd
Improve docstring
johnzangwill Aug 11, 2021
66536b1
Delete a.md
johnzangwill Aug 11, 2021
d075c67
Add to doc and improve test
johnzangwill Aug 19, 2021
df1a767
Tidy-up for pre-commit
johnzangwill Aug 19, 2021
f2e9f79
Update groupbyindexing.py
johnzangwill Aug 19, 2021
a9f9848
Split a long line
johnzangwill Aug 20, 2021
e42c86d
GroupBy.rows implementation
johnzangwill Sep 2, 2021
bab88c9
Add rows to rst file
johnzangwill Sep 2, 2021
a74bd33
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 2, 2021
c77de1d
Change iloc to rows in test_allowlist.py
johnzangwill Sep 2, 2021
0d750bb
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 2, 2021
e952c25
Add to base.py
johnzangwill Sep 2, 2021
2a6aafc
Tidy some whitespace for pep8speaks
johnzangwill Sep 2, 2021
b7f8bfe
Tidied mask code
johnzangwill Sep 4, 2021
86e0c2e
test_rows.py formatting
johnzangwill Sep 4, 2021
6f75502
Correct docstring bullet format
johnzangwill Sep 4, 2021
8de5ff2
Update test_rows.py
johnzangwill Sep 5, 2021
f51fa88
Remove blank line at end of docstring
johnzangwill Sep 6, 2021
3063f3a
Small change to force rebuild
johnzangwill Sep 6, 2021
4228251
Make rows 100% compatible with nth
johnzangwill Sep 8, 2021
41b1c73
Temporarily reroute nth list and slice to rows
johnzangwill Sep 8, 2021
ce36210
Rows for all non-dropna calls + types and tests
johnzangwill Sep 9, 2021
70dcdb5
Merge branch 'master' into groupby_iloc
johnzangwill Sep 9, 2021
c024e41
Changes for flake8
johnzangwill Sep 9, 2021
8abcac3
just one more comma...
johnzangwill Sep 9, 2021
add5727
Add type hints
johnzangwill Sep 10, 2021
bcd1dd9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 10, 2021
25459f7
Delete my build.cmd. Accidental commit
johnzangwill Sep 12, 2021
fa6b86c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 12, 2021
fefbacf
jreback 12 Sep requested changes
johnzangwill Sep 13, 2021
fa9f7e3
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
b589420
remove white-space
johnzangwill Sep 13, 2021
89deee3
Get rid of np.int test
johnzangwill Sep 13, 2021
e28cdfb
Revert "Get rid of np.int test"
johnzangwill Sep 13, 2021
424ab14
Try again...
johnzangwill Sep 13, 2021
258530d
More jreback requested changes
johnzangwill Sep 13, 2021
d49e48f
More tweaks
johnzangwill Sep 13, 2021
1dd6258
Whitespace
johnzangwill Sep 13, 2021
f84f5c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
c068162
Remove blank lines in conditionals
johnzangwill Sep 14, 2021
536298e
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0e73278
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
4cfde7b
Mainly variable changes and some formatting
johnzangwill Sep 14, 2021
6343c9f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
33a2225
Make group_selection_context a private GroupBy class method
johnzangwill Sep 14, 2021
6ca80c2
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
e94d4a8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0d91dca
Add conditional typing for groupby import
johnzangwill Sep 14, 2021
acc3993
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
df52694
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
f42ae41
Delete Example section
johnzangwill Sep 15, 2021
898fad4
Changes for @rhshadrach.
johnzangwill Sep 17, 2021
ffaaf25
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
02ec03c
Remove more docstrings from tests
johnzangwill Sep 17, 2021
0691f99
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 17, 2021
7cad2c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
44120e1
Don't need to check for None anymore
johnzangwill Sep 17, 2021
88b8ac5
Speed up by checking dropna
johnzangwill Sep 18, 2021
0ee53cd
Implement head, tail. column axis, change _rows to _middle and remove…
johnzangwill Sep 20, 2021
945a482
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 20, 2021
9412e3e
Change _middle to _body
johnzangwill Sep 21, 2021
138b791
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
6b29c82
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 21, 2021
ea45bc6
Change class name to match
johnzangwill Sep 21, 2021
179912e
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
19edf00
Add negative values to test_body.py/test_against_head_and_tail()
johnzangwill Sep 22, 2021
94f6e99
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 22, 2021
ae21059
Add _body docstring
johnzangwill Sep 25, 2021
5b8142b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 25, 2021
c8e0950
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 25, 2021
6ce90c4
Make nth a link
johnzangwill Sep 25, 2021
4f6cbe1
Improve doc
johnzangwill Sep 26, 2021
7d92c79
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
19b21bb
Simplify examples
johnzangwill Sep 26, 2021
1a055e4
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
ca164cf
Fix FrameOrSeries typing problem
johnzangwill Sep 26, 2021
337b15c
Fix more new typing problems
johnzangwill Sep 26, 2021
69d8956
More typing problems
johnzangwill Sep 26, 2021
cecc674
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
98a9460
More typing woes
johnzangwill Sep 26, 2021
95eb548
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 26, 2021
4c8644b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 27, 2021
d0e9aa0
Create test_body.py
johnzangwill Sep 28, 2021
10cca16
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
9ccebf1
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
a3db969
Resolve conflicts
johnzangwill Sep 28, 2021
4c4ba92
Avoid groupby name clash
johnzangwill Sep 28, 2021
13ff29f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 28, 2021
acf67b1
Delete duplicated test_body.py
johnzangwill Sep 29, 2021
a3db6d1
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 29, 2021
ee8a86b
Merge branch 'master' into groupby_iloc
johnzangwill Sep 29, 2021
f4b24b0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 30, 2021
82360f5
Rename test_body.py to test_indexing.py
johnzangwill Oct 4, 2021
ba836dc
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
8abcad7
@jreback suggested renames
johnzangwill Oct 4, 2021
86c8e20
Update whatsnew v1.4.0
johnzangwill Oct 4, 2021
ee33df0
Correct typo in doc
johnzangwill Oct 4, 2021
4a1aac9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
d9671a6
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 5, 2021
a6dbc61
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 6, 2021
f65093c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 8, 2021
534ea54
Resolve with another branch
johnzangwill Oct 9, 2021
511c8fd
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 9, 2021
97c3ac0
NDFrameT cannot be used like that
johnzangwill Oct 9, 2021
90a4cb8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 10, 2021
b58b235
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
f5ed6bf
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
21b3637
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 13, 2021
88613a9
Merge branch 'master' into groupby_iloc
johnzangwill Oct 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ Computations / descriptive stats
GroupBy.min
GroupBy.ngroup
GroupBy.nth
GroupBy.iloc
GroupBy.ohlc
GroupBy.pad
GroupBy.prod
Expand Down
1 change: 1 addition & 0 deletions pandas/core/groupby/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@
"groups",
"head",
"hist",
"iloc",
"indices",
"ndim",
"ngroups",
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ class providing the base-class of operations.
numba_,
ops,
)
from pandas.core.groupby.groupbyindexing import GroupByIndexingMixin
from pandas.core.indexes.api import (
CategoricalIndex,
Index,
Expand Down Expand Up @@ -565,7 +566,7 @@ def group_selection_context(groupby: GroupBy) -> Iterator[GroupBy]:
]


class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries]):
class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries], GroupByIndexingMixin):
_group_selection: IndexLabel | None = None
_apply_allowlist: frozenset[str] = frozenset()
_hidden_attrs = PandasObject._hidden_attrs | {
Expand Down
251 changes: 251 additions & 0 deletions pandas/core/groupby/groupbyindexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
from __future__ import annotations

import numpy as np

from pandas.util._decorators import doc


class GroupByIndexingMixin:
"""
Mixin for adding .iloc to GroupBy.
"""

@property
def iloc(self) -> _ilocGroupByIndexer:
"""
Purely integer-location based indexing for selection by position per group.

``.iloc[]`` is primarily integer position based (from ``0`` to
``length-1`` of the axis),

Allowed inputs for the first index are:

- An integer, e.g. ``5``.
- A slice object with ints and positive step, e.g. ``1:``, ``4:-3:2``.

Allowed inputs for the second index are as for DataFrame.iloc, namely:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above).

The output format is the same as GroupBy.head and GroupBy.tail, namely a subset
of the original DataFrame or Series with the index and order preserved.

The effect of ``grouped.iloc[i:j, k:l]`` is similar to

grouped.apply(lambda x: x.iloc[i:j, k:l])

but very much faster and preserving the original index and order.

The behaviour is different from GroupBy.take:
- Input to iloc is a slice of indexes rather than a list of indexes.
- Output from iloc is:
- In the same order as the original grouped DataFrame or Series.
- Has the same index columns as the original grouped DataFrame or
Series. (GroupBy.take introduces an additional index)
- GroupBy.take is extremely slow when there is a high group count.

The behaviour is different from GroupBy.nth:
- Input to iloc is a slice of indexes rather than a list of indexes.
- Output from iloc is:
- In the same order as the original grouped DataFrame or Series.
- Has the same index columns as the original grouped DataFrame or
Series. (nth behaves like an aggregator and removes the non-grouped
indexes)
- GroupBy.nth is quite fast for a high group count but slower than head,
tail and iloc.

Since GroupBy.take and GroupBy.nth only accept a list of individual indexes
it is not possible to define a slice that ends relative to the last row of
each group.

An important use case for GroupBy.iloc is a multi-indexed DataFrame with a
large primary index (Date, say) and a secondary index sorted to a different
order for each Date.
To reduce the DataFrame to a middle slice of each Date:

df.groupby("Date").iloc[5:-5]

This returns a subset of df containing just the middle rows for each Date
and with its original order and indexing preserved.
(See test_multiindex() in tests/groupby/test_groupby_iloc.py)

Returns
-------
Series
The filtered subset of the original grouped Series.
DataFrame
The filtered subset of the original grouped DataFrame.

See Also
--------
DataFrame.iloc : Purely integer-location based indexing for selection by
position.
GroupBy.head : Return first n rows of each group.
GroupBy.tail : Return last n rows of each group.
GroupBy.nth : Take the nth row from each group if n is an int, or a
subset of rows, if n is a list of ints.
DataFrameGroupBy.take : Return the elements in the given positional indices along
an axis.

Examples
--------
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]],
... columns=["A", "B"])
>>> df.groupby("A").iloc[1:2]
A B
1 a 2
4 b 5
>>> df.groupby("A").iloc[:-1, -1:]
B
0 1
1 2
3 4
"""
return _ilocGroupByIndexer(self)


@doc(GroupByIndexingMixin.iloc)
class _ilocGroupByIndexer:
def __init__(self, grouped):
self.grouped = grouped
self.reversed = False
self._cached_ascending_count = None
self._cached_descending_count = None

def __getitem__(self, arg):
self.reversed = False

if type(arg) == tuple:
return self._handle_item(arg[0], arg[1])

else:
return self._handle_item(arg, None)

def _handle_item(self, arg0, arg1):
typeof_arg = type(arg0)

if typeof_arg == slice:
start = arg0.start
stop = arg0.stop
step = arg0.step

if step is not None and step < 0:
raise ValueError(
f"GroupBy.iloc row slice step must be positive."
" Slice was {start}:{stop}:{step}"
)
# self.reversed = True
# start = None if start is None else -start - 1
# stop = None if stop is None else -stop - 1
# step = -step

return self._handle_slice(start, stop, step, arg1)

elif typeof_arg == int:
return self._handle_slice(arg0, arg0 + 1, 1, arg1)

else:
raise ValueError(
f"GroupBy.iloc row must be an integer or a slice, not a {typeof_arg}"
)

def _handle_slice(self, start, stop, step, arg1):
mask = None
if step is None:
step = 1

self.grouped._reset_group_selection()

if start is None:
if step > 1:
mask = self._ascending_count % step == 0

else:
if start >= 0:
mask = self._ascending_count >= start

if step > 1:
mask &= (self._ascending_count - start) % step == 0

else:
mask = self._descending_count < -start

if step > 1:
#
# if start is -ve and -start exceedes the length of a group
# then step must count from the
# first row of that group rather than the calculated offset
#
# count_array + reverse_array gives the length of the
# current group enabling to switch between
# the offset_array and the count_array depending on whether
# -start exceedes the group size
#
offset_array = self._descending_count + start + 1
limit_array = (
self._ascending_count + self._descending_count + (start + 1)
) < 0
offset_array = np.where(
limit_array, self._ascending_count, offset_array
)

mask &= offset_array % step == 0

if stop is not None:
if stop >= 0:
if mask is None:
mask = self._ascending_count < stop

else:
mask &= self._ascending_count < stop
else:
if mask is None:
mask = self._descending_count >= -stop

else:
mask &= self._descending_count >= -stop

if mask is None:
arg0 = slice(None)

else:
arg0 = mask

if arg1 is None:
return self._selected_obj.iloc[arg0]

else:
return self._selected_obj.iloc[arg0, arg1]

@property
def _ascending_count(self):
if self._cached_ascending_count is None:
self._cached_ascending_count = self.grouped._cumcount_array()
if self.reversed:
self._cached_ascending_count = self._cached_ascending_count[::-1]

return self._cached_ascending_count

@property
def _descending_count(self):
if self._cached_descending_count is None:
self._cached_descending_count = self.grouped._cumcount_array(
ascending=False
)
if self.reversed:
self._cached_descending_count = self._cached_descending_count[::-1]

return self._cached_descending_count

@property
def _selected_obj(self):
if self.reversed:
return self.grouped._selected_obj.iloc[::-1]

else:
return self.grouped._selected_obj
1 change: 1 addition & 0 deletions pandas/tests/groupby/test_allowlist.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,7 @@ def test_tab_completion(mframe):
"rank",
"cumprod",
"tail",
"iloc",
"resample",
"cummin",
"fillna",
Expand Down
Loading