Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 119 commits into from
Oct 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
72fd66d
ENH: A new GroupBy method to slice rows preserving index and order
johnzangwill Aug 9, 2021
d0ebbeb
Formatting
johnzangwill Aug 9, 2021
33d7992
Formatting
johnzangwill Aug 9, 2021
78e9ced
Formatting
johnzangwill Aug 9, 2021
4d098cd
Formatting
johnzangwill Aug 9, 2021
f84c365
Formatting
johnzangwill Aug 9, 2021
d937757
Add iloc to test_tab_completion
johnzangwill Aug 9, 2021
e206912
Add iloc to groupby/base.py
johnzangwill Aug 9, 2021
1788f1b
Documentation
johnzangwill Aug 10, 2021
f6977fa
Cosmetics to make pre-commit happy
johnzangwill Aug 10, 2021
bca4fdd
Improve docstring
johnzangwill Aug 11, 2021
66536b1
Delete a.md
johnzangwill Aug 11, 2021
d075c67
Add to doc and improve test
johnzangwill Aug 19, 2021
df1a767
Tidy-up for pre-commit
johnzangwill Aug 19, 2021
f2e9f79
Update groupbyindexing.py
johnzangwill Aug 19, 2021
a9f9848
Split a long line
johnzangwill Aug 20, 2021
e42c86d
GroupBy.rows implementation
johnzangwill Sep 2, 2021
bab88c9
Add rows to rst file
johnzangwill Sep 2, 2021
a74bd33
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 2, 2021
c77de1d
Change iloc to rows in test_allowlist.py
johnzangwill Sep 2, 2021
0d750bb
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 2, 2021
e952c25
Add to base.py
johnzangwill Sep 2, 2021
2a6aafc
Tidy some whitespace for pep8speaks
johnzangwill Sep 2, 2021
b7f8bfe
Tidied mask code
johnzangwill Sep 4, 2021
86e0c2e
test_rows.py formatting
johnzangwill Sep 4, 2021
6f75502
Correct docstring bullet format
johnzangwill Sep 4, 2021
8de5ff2
Update test_rows.py
johnzangwill Sep 5, 2021
f51fa88
Remove blank line at end of docstring
johnzangwill Sep 6, 2021
3063f3a
Small change to force rebuild
johnzangwill Sep 6, 2021
4228251
Make rows 100% compatible with nth
johnzangwill Sep 8, 2021
41b1c73
Temporarily reroute nth list and slice to rows
johnzangwill Sep 8, 2021
ce36210
Rows for all non-dropna calls + types and tests
johnzangwill Sep 9, 2021
70dcdb5
Merge branch 'master' into groupby_iloc
johnzangwill Sep 9, 2021
c024e41
Changes for flake8
johnzangwill Sep 9, 2021
8abcac3
just one more comma...
johnzangwill Sep 9, 2021
add5727
Add type hints
johnzangwill Sep 10, 2021
bcd1dd9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 10, 2021
25459f7
Delete my build.cmd. Accidental commit
johnzangwill Sep 12, 2021
fa6b86c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 12, 2021
fefbacf
jreback 12 Sep requested changes
johnzangwill Sep 13, 2021
fa9f7e3
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
b589420
remove white-space
johnzangwill Sep 13, 2021
89deee3
Get rid of np.int test
johnzangwill Sep 13, 2021
e28cdfb
Revert "Get rid of np.int test"
johnzangwill Sep 13, 2021
424ab14
Try again...
johnzangwill Sep 13, 2021
258530d
More jreback requested changes
johnzangwill Sep 13, 2021
d49e48f
More tweaks
johnzangwill Sep 13, 2021
1dd6258
Whitespace
johnzangwill Sep 13, 2021
f84f5c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
c068162
Remove blank lines in conditionals
johnzangwill Sep 14, 2021
536298e
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0e73278
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
4cfde7b
Mainly variable changes and some formatting
johnzangwill Sep 14, 2021
6343c9f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
33a2225
Make group_selection_context a private GroupBy class method
johnzangwill Sep 14, 2021
6ca80c2
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
e94d4a8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0d91dca
Add conditional typing for groupby import
johnzangwill Sep 14, 2021
acc3993
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
df52694
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
f42ae41
Delete Example section
johnzangwill Sep 15, 2021
898fad4
Changes for @rhshadrach.
johnzangwill Sep 17, 2021
ffaaf25
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
02ec03c
Remove more docstrings from tests
johnzangwill Sep 17, 2021
0691f99
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 17, 2021
7cad2c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
44120e1
Don't need to check for None anymore
johnzangwill Sep 17, 2021
88b8ac5
Speed up by checking dropna
johnzangwill Sep 18, 2021
0ee53cd
Implement head, tail. column axis, change _rows to _middle and remove…
johnzangwill Sep 20, 2021
945a482
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 20, 2021
9412e3e
Change _middle to _body
johnzangwill Sep 21, 2021
138b791
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
6b29c82
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 21, 2021
ea45bc6
Change class name to match
johnzangwill Sep 21, 2021
179912e
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
19edf00
Add negative values to test_body.py/test_against_head_and_tail()
johnzangwill Sep 22, 2021
94f6e99
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 22, 2021
ae21059
Add _body docstring
johnzangwill Sep 25, 2021
5b8142b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 25, 2021
c8e0950
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 25, 2021
6ce90c4
Make nth a link
johnzangwill Sep 25, 2021
4f6cbe1
Improve doc
johnzangwill Sep 26, 2021
7d92c79
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
19b21bb
Simplify examples
johnzangwill Sep 26, 2021
1a055e4
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
ca164cf
Fix FrameOrSeries typing problem
johnzangwill Sep 26, 2021
337b15c
Fix more new typing problems
johnzangwill Sep 26, 2021
69d8956
More typing problems
johnzangwill Sep 26, 2021
cecc674
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
98a9460
More typing woes
johnzangwill Sep 26, 2021
95eb548
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 26, 2021
4c8644b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 27, 2021
d0e9aa0
Create test_body.py
johnzangwill Sep 28, 2021
10cca16
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
9ccebf1
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
a3db969
Resolve conflicts
johnzangwill Sep 28, 2021
4c4ba92
Avoid groupby name clash
johnzangwill Sep 28, 2021
13ff29f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 28, 2021
acf67b1
Delete duplicated test_body.py
johnzangwill Sep 29, 2021
a3db6d1
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 29, 2021
ee8a86b
Merge branch 'master' into groupby_iloc
johnzangwill Sep 29, 2021
f4b24b0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 30, 2021
82360f5
Rename test_body.py to test_indexing.py
johnzangwill Oct 4, 2021
ba836dc
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
8abcad7
@jreback suggested renames
johnzangwill Oct 4, 2021
86c8e20
Update whatsnew v1.4.0
johnzangwill Oct 4, 2021
ee33df0
Correct typo in doc
johnzangwill Oct 4, 2021
4a1aac9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
d9671a6
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 5, 2021
a6dbc61
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 6, 2021
f65093c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 8, 2021
534ea54
Resolve with another branch
johnzangwill Oct 9, 2021
511c8fd
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 9, 2021
97c3ac0
NDFrameT cannot be used like that
johnzangwill Oct 9, 2021
90a4cb8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 10, 2021
b58b235
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
f5ed6bf
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
21b3637
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 13, 2021
88613a9
Merge branch 'master' into groupby_iloc
johnzangwill Oct 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,30 @@ Example:

s.rolling(3).rank(method="max")

.. _whatsnew_140.enhancements.groupby_indexing:

Groupby positional indexing
^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is now possible to specify positional ranges relative to the ends of each group.

Negative arguments for :meth:`.GroupBy.head` and :meth:`.GroupBy.tail` now work correctly and result in ranges relative to the end and start of each group, respectively.
Previously, negative arguments returned empty frames.

.. ipython:: python

df = pd.DataFrame([["g", "g0"], ["g", "g1"], ["g", "g2"], ["g", "g3"],
["h", "h0"], ["h", "h1"]], columns=["A", "B"])
df.groupby("A").head(-1)


:meth:`.GroupBy.nth` now accepts a slice or list of integers and slices.

.. ipython:: python

df.groupby("A").nth(slice(1, -1))
df.groupby("A").nth([slice(None, 1), slice(-1, None)])

.. _whatsnew_140.enhancements.other:

Other enhancements
Expand Down
87 changes: 49 additions & 38 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ class providing the base-class of operations.
ArrayLike,
IndexLabel,
NDFrameT,
PositionalIndexer,
RandomState,
Scalar,
T,
Expand All @@ -65,6 +66,7 @@ class providing the base-class of operations.
is_bool_dtype,
is_datetime64_dtype,
is_float_dtype,
is_integer,
is_integer_dtype,
is_numeric_dtype,
is_object_dtype,
Expand Down Expand Up @@ -97,6 +99,7 @@ class providing the base-class of operations.
numba_,
ops,
)
from pandas.core.groupby.indexing import GroupByIndexingMixin
from pandas.core.indexes.api import (
CategoricalIndex,
Index,
Expand Down Expand Up @@ -555,7 +558,7 @@ def f(self):
]


class BaseGroupBy(PandasObject, SelectionMixin[NDFrameT]):
class BaseGroupBy(PandasObject, SelectionMixin[NDFrameT], GroupByIndexingMixin):
_group_selection: IndexLabel | None = None
_apply_allowlist: frozenset[str] = frozenset()
_hidden_attrs = PandasObject._hidden_attrs | {
Expand Down Expand Up @@ -2445,23 +2448,28 @@ def backfill(self, limit=None):
@Substitution(name="groupby")
@Substitution(see_also=_common_see_also)
def nth(
self, n: int | list[int], dropna: Literal["any", "all", None] = None
self,
n: PositionalIndexer | tuple,
dropna: Literal["any", "all", None] = None,
) -> NDFrameT:
"""
Take the nth row from each group if n is an int, or a subset of rows
if n is a list of ints.
Take the nth row from each group if n is an int, otherwise a subset of rows.

If dropna, will take the nth non-null row, dropna is either
'all' or 'any'; this is equivalent to calling dropna(how=dropna)
before the groupby.

Parameters
----------
n : int or list of ints
A single nth value for the row or a list of nth values.
n : int, slice or list of ints and slices
A single nth value for the row or a list of nth values or slices.

.. versionchanged:: 1.4.0
Added slice and lists containiing slices.

dropna : {'any', 'all', None}, default None
Apply the specified dropna operation before counting which row is
the nth row.
the nth row. Only supported if n is an int.

Returns
-------
Expand Down Expand Up @@ -2496,6 +2504,12 @@ def nth(
1 2.0
2 3.0
2 5.0
>>> g.nth(slice(None, -1))
B
A
1 NaN
1 2.0
2 3.0

Specifying `dropna` allows count ignoring ``NaN``

Expand All @@ -2520,33 +2534,16 @@ def nth(
1 1 2.0
4 2 5.0
"""
valid_containers = (set, list, tuple)
if not isinstance(n, (valid_containers, int)):
raise TypeError("n needs to be an int or a list/set/tuple of ints")

if not dropna:

if isinstance(n, int):
nth_values = [n]
elif isinstance(n, valid_containers):
nth_values = list(set(n))

nth_array = np.array(nth_values, dtype=np.intp)
with self._group_selection_context():

mask_left = np.in1d(self._cumcount_array(), nth_array)
mask_right = np.in1d(
self._cumcount_array(ascending=False) + 1, -nth_array
)
mask = mask_left | mask_right
mask = self._make_mask_from_positional_indexer(n)

ids, _, _ = self.grouper.group_info

# Drop NA values in grouping
mask = mask & (ids != -1)

out = self._mask_selected_obj(mask)

if not self.as_index:
return out

Expand All @@ -2563,19 +2560,20 @@ def nth(
return out.sort_index(axis=self.axis) if self.sort else out

# dropna is truthy
if isinstance(n, valid_containers):
raise ValueError("dropna option with a list of nth values is not supported")
if not is_integer(n):
raise ValueError("dropna option only supported for an integer argument")

if dropna not in ["any", "all"]:
# Note: when agg-ing picker doesn't raise this, just returns NaN
raise ValueError(
"For a DataFrame groupby, dropna must be "
"For a DataFrame or Series groupby.nth, dropna must be "
"either None, 'any' or 'all', "
f"(was passed {dropna})."
)

# old behaviour, but with all and any support for DataFrames.
# modified in GH 7559 to have better perf
n = cast(int, n)
max_len = n if n >= 0 else -1 - n
dropped = self.obj.dropna(how=dropna, axis=self.axis)

Expand Down Expand Up @@ -3301,11 +3299,16 @@ def head(self, n=5):
from the original DataFrame with original index and order preserved
(``as_index`` flag is ignored).

Does not work for negative values of `n`.
Parameters
----------
n : int
If positive: number of entries to include from start of each group.
If negative: number of entries to exclude from end of each group.

Returns
-------
Series or DataFrame
Subset of original Series or DataFrame as determined by n.
%(see_also)s
Examples
--------
Expand All @@ -3317,12 +3320,11 @@ def head(self, n=5):
0 1 2
2 5 6
>>> df.groupby('A').head(-1)
Empty DataFrame
Columns: [A, B]
Index: []
A B
0 1 2
"""
self._reset_group_selection()
mask = self._cumcount_array() < n
mask = self._make_mask_from_positional_indexer(slice(None, n))
return self._mask_selected_obj(mask)

@final
Expand All @@ -3336,11 +3338,16 @@ def tail(self, n=5):
from the original DataFrame with original index and order preserved
(``as_index`` flag is ignored).

Does not work for negative values of `n`.
Parameters
----------
n : int
If positive: number of entries to include from end of each group.
If negative: number of entries to exclude from start of each group.

Returns
-------
Series or DataFrame
Subset of original Series or DataFrame as determined by n.
%(see_also)s
Examples
--------
Expand All @@ -3352,12 +3359,16 @@ def tail(self, n=5):
1 a 2
3 b 2
>>> df.groupby('A').tail(-1)
Empty DataFrame
Columns: [A, B]
Index: []
A B
1 a 2
3 b 2
"""
self._reset_group_selection()
mask = self._cumcount_array(ascending=False) < n
if n:
mask = self._make_mask_from_positional_indexer(slice(-n, None))
else:
mask = self._make_mask_from_positional_indexer([])

return self._mask_selected_obj(mask)

@final
Expand Down
Loading