Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 119 commits into from
Oct 15, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
72fd66d
ENH: A new GroupBy method to slice rows preserving index and order
johnzangwill Aug 9, 2021
d0ebbeb
Formatting
johnzangwill Aug 9, 2021
33d7992
Formatting
johnzangwill Aug 9, 2021
78e9ced
Formatting
johnzangwill Aug 9, 2021
4d098cd
Formatting
johnzangwill Aug 9, 2021
f84c365
Formatting
johnzangwill Aug 9, 2021
d937757
Add iloc to test_tab_completion
johnzangwill Aug 9, 2021
e206912
Add iloc to groupby/base.py
johnzangwill Aug 9, 2021
1788f1b
Documentation
johnzangwill Aug 10, 2021
f6977fa
Cosmetics to make pre-commit happy
johnzangwill Aug 10, 2021
bca4fdd
Improve docstring
johnzangwill Aug 11, 2021
66536b1
Delete a.md
johnzangwill Aug 11, 2021
d075c67
Add to doc and improve test
johnzangwill Aug 19, 2021
df1a767
Tidy-up for pre-commit
johnzangwill Aug 19, 2021
f2e9f79
Update groupbyindexing.py
johnzangwill Aug 19, 2021
a9f9848
Split a long line
johnzangwill Aug 20, 2021
e42c86d
GroupBy.rows implementation
johnzangwill Sep 2, 2021
bab88c9
Add rows to rst file
johnzangwill Sep 2, 2021
a74bd33
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 2, 2021
c77de1d
Change iloc to rows in test_allowlist.py
johnzangwill Sep 2, 2021
0d750bb
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 2, 2021
e952c25
Add to base.py
johnzangwill Sep 2, 2021
2a6aafc
Tidy some whitespace for pep8speaks
johnzangwill Sep 2, 2021
b7f8bfe
Tidied mask code
johnzangwill Sep 4, 2021
86e0c2e
test_rows.py formatting
johnzangwill Sep 4, 2021
6f75502
Correct docstring bullet format
johnzangwill Sep 4, 2021
8de5ff2
Update test_rows.py
johnzangwill Sep 5, 2021
f51fa88
Remove blank line at end of docstring
johnzangwill Sep 6, 2021
3063f3a
Small change to force rebuild
johnzangwill Sep 6, 2021
4228251
Make rows 100% compatible with nth
johnzangwill Sep 8, 2021
41b1c73
Temporarily reroute nth list and slice to rows
johnzangwill Sep 8, 2021
ce36210
Rows for all non-dropna calls + types and tests
johnzangwill Sep 9, 2021
70dcdb5
Merge branch 'master' into groupby_iloc
johnzangwill Sep 9, 2021
c024e41
Changes for flake8
johnzangwill Sep 9, 2021
8abcac3
just one more comma...
johnzangwill Sep 9, 2021
add5727
Add type hints
johnzangwill Sep 10, 2021
bcd1dd9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 10, 2021
25459f7
Delete my build.cmd. Accidental commit
johnzangwill Sep 12, 2021
fa6b86c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 12, 2021
fefbacf
jreback 12 Sep requested changes
johnzangwill Sep 13, 2021
fa9f7e3
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
b589420
remove white-space
johnzangwill Sep 13, 2021
89deee3
Get rid of np.int test
johnzangwill Sep 13, 2021
e28cdfb
Revert "Get rid of np.int test"
johnzangwill Sep 13, 2021
424ab14
Try again...
johnzangwill Sep 13, 2021
258530d
More jreback requested changes
johnzangwill Sep 13, 2021
d49e48f
More tweaks
johnzangwill Sep 13, 2021
1dd6258
Whitespace
johnzangwill Sep 13, 2021
f84f5c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
c068162
Remove blank lines in conditionals
johnzangwill Sep 14, 2021
536298e
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0e73278
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
4cfde7b
Mainly variable changes and some formatting
johnzangwill Sep 14, 2021
6343c9f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
33a2225
Make group_selection_context a private GroupBy class method
johnzangwill Sep 14, 2021
6ca80c2
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
e94d4a8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0d91dca
Add conditional typing for groupby import
johnzangwill Sep 14, 2021
acc3993
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
df52694
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
f42ae41
Delete Example section
johnzangwill Sep 15, 2021
898fad4
Changes for @rhshadrach.
johnzangwill Sep 17, 2021
ffaaf25
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
02ec03c
Remove more docstrings from tests
johnzangwill Sep 17, 2021
0691f99
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 17, 2021
7cad2c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
44120e1
Don't need to check for None anymore
johnzangwill Sep 17, 2021
88b8ac5
Speed up by checking dropna
johnzangwill Sep 18, 2021
0ee53cd
Implement head, tail. column axis, change _rows to _middle and remove…
johnzangwill Sep 20, 2021
945a482
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 20, 2021
9412e3e
Change _middle to _body
johnzangwill Sep 21, 2021
138b791
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
6b29c82
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 21, 2021
ea45bc6
Change class name to match
johnzangwill Sep 21, 2021
179912e
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
19edf00
Add negative values to test_body.py/test_against_head_and_tail()
johnzangwill Sep 22, 2021
94f6e99
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 22, 2021
ae21059
Add _body docstring
johnzangwill Sep 25, 2021
5b8142b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 25, 2021
c8e0950
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 25, 2021
6ce90c4
Make nth a link
johnzangwill Sep 25, 2021
4f6cbe1
Improve doc
johnzangwill Sep 26, 2021
7d92c79
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
19b21bb
Simplify examples
johnzangwill Sep 26, 2021
1a055e4
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
ca164cf
Fix FrameOrSeries typing problem
johnzangwill Sep 26, 2021
337b15c
Fix more new typing problems
johnzangwill Sep 26, 2021
69d8956
More typing problems
johnzangwill Sep 26, 2021
cecc674
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
98a9460
More typing woes
johnzangwill Sep 26, 2021
95eb548
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 26, 2021
4c8644b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 27, 2021
d0e9aa0
Create test_body.py
johnzangwill Sep 28, 2021
10cca16
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
9ccebf1
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
a3db969
Resolve conflicts
johnzangwill Sep 28, 2021
4c4ba92
Avoid groupby name clash
johnzangwill Sep 28, 2021
13ff29f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 28, 2021
acf67b1
Delete duplicated test_body.py
johnzangwill Sep 29, 2021
a3db6d1
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 29, 2021
ee8a86b
Merge branch 'master' into groupby_iloc
johnzangwill Sep 29, 2021
f4b24b0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 30, 2021
82360f5
Rename test_body.py to test_indexing.py
johnzangwill Oct 4, 2021
ba836dc
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
8abcad7
@jreback suggested renames
johnzangwill Oct 4, 2021
86c8e20
Update whatsnew v1.4.0
johnzangwill Oct 4, 2021
ee33df0
Correct typo in doc
johnzangwill Oct 4, 2021
4a1aac9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
d9671a6
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 5, 2021
a6dbc61
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 6, 2021
f65093c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 8, 2021
534ea54
Resolve with another branch
johnzangwill Oct 9, 2021
511c8fd
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 9, 2021
97c3ac0
NDFrameT cannot be used like that
johnzangwill Oct 9, 2021
90a4cb8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 10, 2021
b58b235
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
f5ed6bf
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
21b3637
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 13, 2021
88613a9
Merge branch 'master' into groupby_iloc
johnzangwill Oct 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pandas/core/groupby/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@
"groups",
"head",
"hist",
"iloc",
"indices",
"ndim",
"ngroups",
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,8 @@ class providing the base-class of operations.
maybe_use_numba,
)

from pandas.core.groupby.groupbyindexing import GroupByIndexingMixin

_common_see_also = """
See Also
--------
Expand Down Expand Up @@ -565,7 +567,7 @@ def group_selection_context(groupby: GroupBy) -> Iterator[GroupBy]:
]


class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries]):
class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries], GroupByIndexingMixin):
_group_selection: IndexLabel | None = None
_apply_allowlist: frozenset[str] = frozenset()
_hidden_attrs = PandasObject._hidden_attrs | {
Expand Down
214 changes: 214 additions & 0 deletions pandas/core/groupby/groupbyindexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@
from __future__ import annotations

from pandas.util._decorators import doc
import numpy as np


class GroupByIndexingMixin:
"""
Mixin for adding .iloc to GroupBy.
"""

@property
def iloc(self) -> _ilocGroupByIndexer:
"""
Integer location-based indexing for selection by position per group.

Similar to ``.apply(lambda x: x.iloc[i:j, k:l])``, but much faster and returns
a subset of rows from the original DataFrame with the original index and order
preserved.

The output is compatible with head() and tail()
The output is different from take() and nth() which do not preserve the index or order

Inputs
------
Allowed inputs for the first index are:

- An integer, e.g. ``5``.
- A slice object with ints and positive step, e.g. ``1:``, ``4:-3:2``.

Allowed inputs for the second index are as for DataFrame.iloc, namely:

- An integer, e.g. ``5``.
- A list or array of integers, e.g. ``[4, 3, 0]``.
- A slice object with ints, e.g. ``1:7``.
- A boolean array.
- A ``callable`` function with one argument (the calling Series or
DataFrame) and that returns valid output for indexing (one of the above).

Returns
-------
Series or DataFrame

Note
----
Neither GroupBy.nth() nor GroupBy.take() take a slice argument and
neither of them preserve the original DataFrame order and index.
They are both slow for large integer lists and take() is very slow for large group counts.

Use Case
--------
Suppose that we have a multi-indexed DataFrame with a large primary index and a secondary sorted
to a different order for each primary.
To reduce the DataFrame to a middle slice of each secondary, group by the primary and then
use iloc.
This preserves the original DataFrame"s order and indexing.
(See tests/groupby/test_groupby_iloc)

Examples
--------
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]],
... columns=["A", "B"])
>>> df.groupby("A").iloc[1:2]
A B
1 a 2
4 b 5
>>> df.groupby("A").iloc[:-1, -1:]
B
0 1
1 2
3 4
"""
return _ilocGroupByIndexer(self)


@doc(GroupByIndexingMixin.iloc)
class _ilocGroupByIndexer:
def __init__(self, grouped):
self.grouped = grouped
self.reversed = False
self._cached_ascending_count = None
self._cached_descending_count = None

def __getitem__(self, arg):
self.reversed = False

if type(arg) == tuple:
return self._handle_item(arg[0], arg[1])

else:
return self._handle_item(arg, None)

def _handle_item(self, arg0, arg1):
typeof_arg = type(arg0)

if typeof_arg == slice:
start = arg0.start
stop = arg0.stop
step = arg0.step

if step is not None and step < 0:
raise ValueError(
f"GroupBy.iloc row slice step must be positive. Slice was {start}:{stop}:{step}"
)
# self.reversed = True
# start = None if start is None else -start - 1
# stop = None if stop is None else -stop - 1
# step = -step

return self._handle_slice(start, stop, step, arg1)

elif typeof_arg == int:
return self._handle_slice(arg0, arg0 + 1, 1, arg1)

else:
raise ValueError(
f"GroupBy.iloc row must be an integer or a slice, not a {typeof_arg}"
)

def _handle_slice(self, start, stop, step, arg1):
mask = None
if step is None:
step = 1

self.grouped._reset_group_selection()

if start is None:
if step > 1:
mask = self._ascending_count % step == 0

else:
if start >= 0:
mask = self._ascending_count >= start

if step > 1:
mask &= (self._ascending_count - start) % step == 0

else:
mask = self._descending_count < -start

if step > 1:
#
# if start is -ve and -start exceedes the length of a group
# then step must count from the
# first row of that group rather than the calculated offset
#
# count_array + reverse_array gives the length of the
# current group enabling to switch between
# the offset_array and the count_array depending on whether
# -start exceedes the group size
#
offset_array = self._descending_count + start + 1
limit_array = (
self._ascending_count + self._descending_count + (start + 1)
) < 0
offset_array = np.where(
limit_array, self._ascending_count, offset_array
)

mask &= offset_array % step == 0

if stop is not None:
if stop >= 0:
if mask is None:
mask = self._ascending_count < stop

else:
mask &= self._ascending_count < stop
else:
if mask is None:
mask = self._descending_count >= -stop

else:
mask &= self._descending_count >= -stop

if mask is None:
arg0 = slice(None)

else:
arg0 = mask

if arg1 is None:
return self._selected_obj.iloc[arg0]

else:
return self._selected_obj.iloc[arg0, arg1]

@property
def _ascending_count(self):
if self._cached_ascending_count is None:
self._cached_ascending_count = self.grouped._cumcount_array()
if self.reversed:
self._cached_ascending_count = self._cached_ascending_count[::-1]

return self._cached_ascending_count

@property
def _descending_count(self):
if self._cached_descending_count is None:
self._cached_descending_count = self.grouped._cumcount_array(
ascending=False
)
if self.reversed:
self._cached_descending_count = self._cached_descending_count[::-1]

return self._cached_descending_count

@property
def _selected_obj(self):
if self.reversed:
return self.grouped._selected_obj.iloc[::-1]

else:
return self.grouped._selected_obj
1 change: 1 addition & 0 deletions pandas/tests/groupby/test_allowlist.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,6 +309,7 @@ def test_tab_completion(mframe):
"rank",
"cumprod",
"tail",
"iloc",
"resample",
"cummin",
"fillna",
Expand Down
146 changes: 146 additions & 0 deletions pandas/tests/groupby/test_groupby_iloc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
""" Test positional grouped indexing with iloc GH#42864"""

import pandas as pd
import pandas._testing as tm
import random


def test_doc_examples():
"""Test the examples in the documentation"""

df = pd.DataFrame(
[["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]], columns=["A", "B"]
)

grouped = df.groupby("A")
result = grouped.iloc[1:2, :]
expected = pd.DataFrame([["a", 2], ["b", 5]], columns=["A", "B"], index=[1, 4])

tm.assert_frame_equal(result, expected)

result = grouped.iloc[:-1, -1:]
expected = pd.DataFrame([1, 2, 4], columns=["B"], index=[0, 1, 3])

tm.assert_frame_equal(result, expected)


def test_multiindex():
"""Test the multiindex mentioned as the use-case in the documentation"""

def make_df_from_data(data):
rows = {}
for date in dates:
for level in data[date]:
rows[(date, level[0])] = {"A": level[1], "B": level[2]}

df = pd.DataFrame.from_dict(rows, orient="index")
df.index.names = ("Date", "Item")
return df

ndates = 1000
nitems = 40
dates = pd.date_range("20130101", periods=ndates, freq="D")
items = [f"item {i}" for i in range(nitems)]

data = {}
for date in dates:
levels = [
(item, random.randint(0, 10000) / 100, random.randint(0, 10000) / 100) for item in items
]
levels.sort(key=lambda x: x[1])
data[date] = levels

df = make_df_from_data(data)
result = df.groupby("Date").iloc[3:7]

sliced = {date: data[date][3:7] for date in dates}
expected = make_df_from_data(sliced)

tm.assert_frame_equal(result, expected)


def test_against_head_and_tail():
"""Test gives the same results as grouped head and tail"""

n_groups = 100
n_rows_per_group = 30

data = {
"group": [f"group {g}" for j in range(n_rows_per_group) for g in range(n_groups)],
"value": [
random.randint(0, 10000) / 100
for j in range(n_rows_per_group)
for g in range(n_groups)
]
}
df = pd.DataFrame(data)
grouped = df.groupby("group")

for i in [1, 5, 29, 30, 31, 1000]:
result = grouped.iloc[:i, :]
expected = grouped.head(i)

tm.assert_frame_equal(result, expected)

result = grouped.iloc[-i:, :]
expected = grouped.tail(i)

tm.assert_frame_equal(result, expected)


def test_against_df_iloc():
"""Test that a single group gives the same results as DataFame.iloc"""

n_rows_per_group = 30

data = {
"group": ["group 0" for j in range(n_rows_per_group)],
"value": [random.randint(0, 10000) / 100 for j in range(n_rows_per_group)]
}
df = pd.DataFrame(data)
grouped = df.groupby("group")

for start in [None, 0, 1, 10, 29, 30, 1000, -1, -10, -29, -30, -1000]:
for stop in [None, 0, 1, 10, 29, 30, 1000, -1, -10, -29, -30, -1000]:
for step in [None, 1, 2, 3, 10, 29, 30, 100]:
result = grouped.iloc[start:stop:step, :]
expected = df.iloc[start:stop:step, :]

tm.assert_frame_equal(result, expected)


def test_series():
"""Test grouped Series"""

ser = pd.Series([1, 2, 3, 4, 5], index=["a", "a", "a", "b", "b"])
grouped = ser.groupby(level=0)
result = grouped.iloc[1:2]
expected = pd.Series([2, 5], index=["a", "b"])

tm.assert_series_equal(result, expected)


def test_step():
"""Test grouped slice with step"""

data = [["x", f"x{i}"] for i in range(5)]
data += [["y", f"y{i}"] for i in range(4)]
data += [["z", f"z{i}"] for i in range(3)]
df = pd.DataFrame(data, columns=["A", "B"])

grouped = df.groupby("A")

for step in [1, 2, 3, 4, 5]:
result = grouped.iloc[::step, :]

data = [["x", f"x{i}"] for i in range(0, 5, step)]
data += [["y", f"y{i}"] for i in range(0, 4, step)]
data += [["z", f"z{i}"] for i in range(0, 3, step)]

index = [0 + i for i in range(0, 5, step)]
index += [5 + i for i in range(0, 4, step)]
index += [9 + i for i in range(0, 3, step)]

expected = pd.DataFrame(data, columns=["A", "B"], index=index)

tm.assert_frame_equal(result, expected)