Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 119 commits into from
Oct 15, 2021
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
72fd66d
ENH: A new GroupBy method to slice rows preserving index and order
johnzangwill Aug 9, 2021
d0ebbeb
Formatting
johnzangwill Aug 9, 2021
33d7992
Formatting
johnzangwill Aug 9, 2021
78e9ced
Formatting
johnzangwill Aug 9, 2021
4d098cd
Formatting
johnzangwill Aug 9, 2021
f84c365
Formatting
johnzangwill Aug 9, 2021
d937757
Add iloc to test_tab_completion
johnzangwill Aug 9, 2021
e206912
Add iloc to groupby/base.py
johnzangwill Aug 9, 2021
1788f1b
Documentation
johnzangwill Aug 10, 2021
f6977fa
Cosmetics to make pre-commit happy
johnzangwill Aug 10, 2021
bca4fdd
Improve docstring
johnzangwill Aug 11, 2021
66536b1
Delete a.md
johnzangwill Aug 11, 2021
d075c67
Add to doc and improve test
johnzangwill Aug 19, 2021
df1a767
Tidy-up for pre-commit
johnzangwill Aug 19, 2021
f2e9f79
Update groupbyindexing.py
johnzangwill Aug 19, 2021
a9f9848
Split a long line
johnzangwill Aug 20, 2021
e42c86d
GroupBy.rows implementation
johnzangwill Sep 2, 2021
bab88c9
Add rows to rst file
johnzangwill Sep 2, 2021
a74bd33
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 2, 2021
c77de1d
Change iloc to rows in test_allowlist.py
johnzangwill Sep 2, 2021
0d750bb
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 2, 2021
e952c25
Add to base.py
johnzangwill Sep 2, 2021
2a6aafc
Tidy some whitespace for pep8speaks
johnzangwill Sep 2, 2021
b7f8bfe
Tidied mask code
johnzangwill Sep 4, 2021
86e0c2e
test_rows.py formatting
johnzangwill Sep 4, 2021
6f75502
Correct docstring bullet format
johnzangwill Sep 4, 2021
8de5ff2
Update test_rows.py
johnzangwill Sep 5, 2021
f51fa88
Remove blank line at end of docstring
johnzangwill Sep 6, 2021
3063f3a
Small change to force rebuild
johnzangwill Sep 6, 2021
4228251
Make rows 100% compatible with nth
johnzangwill Sep 8, 2021
41b1c73
Temporarily reroute nth list and slice to rows
johnzangwill Sep 8, 2021
ce36210
Rows for all non-dropna calls + types and tests
johnzangwill Sep 9, 2021
70dcdb5
Merge branch 'master' into groupby_iloc
johnzangwill Sep 9, 2021
c024e41
Changes for flake8
johnzangwill Sep 9, 2021
8abcac3
just one more comma...
johnzangwill Sep 9, 2021
add5727
Add type hints
johnzangwill Sep 10, 2021
bcd1dd9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 10, 2021
25459f7
Delete my build.cmd. Accidental commit
johnzangwill Sep 12, 2021
fa6b86c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 12, 2021
fefbacf
jreback 12 Sep requested changes
johnzangwill Sep 13, 2021
fa9f7e3
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
b589420
remove white-space
johnzangwill Sep 13, 2021
89deee3
Get rid of np.int test
johnzangwill Sep 13, 2021
e28cdfb
Revert "Get rid of np.int test"
johnzangwill Sep 13, 2021
424ab14
Try again...
johnzangwill Sep 13, 2021
258530d
More jreback requested changes
johnzangwill Sep 13, 2021
d49e48f
More tweaks
johnzangwill Sep 13, 2021
1dd6258
Whitespace
johnzangwill Sep 13, 2021
f84f5c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 13, 2021
c068162
Remove blank lines in conditionals
johnzangwill Sep 14, 2021
536298e
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0e73278
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
4cfde7b
Mainly variable changes and some formatting
johnzangwill Sep 14, 2021
6343c9f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
33a2225
Make group_selection_context a private GroupBy class method
johnzangwill Sep 14, 2021
6ca80c2
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
e94d4a8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
0d91dca
Add conditional typing for groupby import
johnzangwill Sep 14, 2021
acc3993
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 14, 2021
df52694
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 14, 2021
f42ae41
Delete Example section
johnzangwill Sep 15, 2021
898fad4
Changes for @rhshadrach.
johnzangwill Sep 17, 2021
ffaaf25
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
02ec03c
Remove more docstrings from tests
johnzangwill Sep 17, 2021
0691f99
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 17, 2021
7cad2c0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 17, 2021
44120e1
Don't need to check for None anymore
johnzangwill Sep 17, 2021
88b8ac5
Speed up by checking dropna
johnzangwill Sep 18, 2021
0ee53cd
Implement head, tail. column axis, change _rows to _middle and remove…
johnzangwill Sep 20, 2021
945a482
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 20, 2021
9412e3e
Change _middle to _body
johnzangwill Sep 21, 2021
138b791
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
6b29c82
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 21, 2021
ea45bc6
Change class name to match
johnzangwill Sep 21, 2021
179912e
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 21, 2021
19edf00
Add negative values to test_body.py/test_against_head_and_tail()
johnzangwill Sep 22, 2021
94f6e99
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 22, 2021
ae21059
Add _body docstring
johnzangwill Sep 25, 2021
5b8142b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 25, 2021
c8e0950
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 25, 2021
6ce90c4
Make nth a link
johnzangwill Sep 25, 2021
4f6cbe1
Improve doc
johnzangwill Sep 26, 2021
7d92c79
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
19b21bb
Simplify examples
johnzangwill Sep 26, 2021
1a055e4
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
ca164cf
Fix FrameOrSeries typing problem
johnzangwill Sep 26, 2021
337b15c
Fix more new typing problems
johnzangwill Sep 26, 2021
69d8956
More typing problems
johnzangwill Sep 26, 2021
cecc674
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 26, 2021
98a9460
More typing woes
johnzangwill Sep 26, 2021
95eb548
Merge branch 'groupby_iloc' of https://github.com/johnzangwill/pandas…
johnzangwill Sep 26, 2021
4c8644b
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 27, 2021
d0e9aa0
Create test_body.py
johnzangwill Sep 28, 2021
10cca16
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
9ccebf1
Merge branch 'master' into groupby_iloc
johnzangwill Sep 28, 2021
a3db969
Resolve conflicts
johnzangwill Sep 28, 2021
4c4ba92
Avoid groupby name clash
johnzangwill Sep 28, 2021
13ff29f
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 28, 2021
acf67b1
Delete duplicated test_body.py
johnzangwill Sep 29, 2021
a3db6d1
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 29, 2021
ee8a86b
Merge branch 'master' into groupby_iloc
johnzangwill Sep 29, 2021
f4b24b0
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Sep 30, 2021
82360f5
Rename test_body.py to test_indexing.py
johnzangwill Oct 4, 2021
ba836dc
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
8abcad7
@jreback suggested renames
johnzangwill Oct 4, 2021
86c8e20
Update whatsnew v1.4.0
johnzangwill Oct 4, 2021
ee33df0
Correct typo in doc
johnzangwill Oct 4, 2021
4a1aac9
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 4, 2021
d9671a6
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 5, 2021
a6dbc61
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 6, 2021
f65093c
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 8, 2021
534ea54
Resolve with another branch
johnzangwill Oct 9, 2021
511c8fd
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 9, 2021
97c3ac0
NDFrameT cannot be used like that
johnzangwill Oct 9, 2021
90a4cb8
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 10, 2021
b58b235
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
f5ed6bf
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 12, 2021
21b3637
Merge branch 'pandas-dev:master' into groupby_iloc
johnzangwill Oct 13, 2021
88613a9
Merge branch 'master' into groupby_iloc
johnzangwill Oct 13, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 28 additions & 45 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ class providing the base-class of operations.
F,
FrameOrSeries,
IndexLabel,
PositionalIndexer,
RandomState,
Scalar,
T,
Expand All @@ -66,6 +67,7 @@ class providing the base-class of operations.
is_bool_dtype,
is_datetime64_dtype,
is_float_dtype,
is_integer,
is_integer_dtype,
is_numeric_dtype,
is_object_dtype,
Expand Down Expand Up @@ -98,6 +100,7 @@ class providing the base-class of operations.
numba_,
ops,
)
from pandas.core.groupby.indexing import GroupByIndexingMixin
from pandas.core.indexes.api import (
CategoricalIndex,
Index,
Expand Down Expand Up @@ -568,7 +571,7 @@ def group_selection_context(groupby: GroupBy) -> Iterator[GroupBy]:
]


class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries]):
class BaseGroupBy(PandasObject, SelectionMixin[FrameOrSeries], GroupByIndexingMixin):
_group_selection: IndexLabel | None = None
_apply_allowlist: frozenset[str] = frozenset()
_hidden_attrs = PandasObject._hidden_attrs | {
Expand Down Expand Up @@ -2373,20 +2376,25 @@ def backfill(self, limit=None):
@Substitution(name="groupby")
@Substitution(see_also=_common_see_also)
def nth(
self, n: int | list[int], dropna: Literal["any", "all", None] = None
) -> DataFrame:
self,
arg: PositionalIndexer | tuple,
dropna: Literal["any", "all", None] = None,
) -> FrameOrSeries:
"""
Take the nth row from each group if n is an int, or a subset of rows
if n is a list of ints.
Take the nth row from each group if n is an int, otherwise a subset of rows.

If dropna, will take the nth non-null row, dropna is either
'all' or 'any'; this is equivalent to calling dropna(how=dropna)
before the groupby.

Parameters
----------
n : int or list of ints
A single nth value for the row or a list of nth values.
n : int, slice or list of ints and slices
A single nth value for the row or a list of nth values or slices.

.. versionchanged:: 1.4.0
Added slice and lists containiing slices

dropna : {'any', 'all', None}, default None
Apply the specified dropna operation before counting which row is
the nth row.
Expand Down Expand Up @@ -2424,6 +2432,12 @@ def nth(
1 2.0
2 3.0
2 5.0
>>> g.nth(slice(None, -1))
B
A
1 NaN
1 2.0
2 3.0

Specifying `dropna` allows count ignoring ``NaN``

Expand All @@ -2448,58 +2462,27 @@ def nth(
1 1 2.0
4 2 5.0
"""
valid_containers = (set, list, tuple)
if not isinstance(n, (valid_containers, int)):
raise TypeError("n needs to be an int or a list/set/tuple of ints")

if not dropna:
if isinstance(arg, Iterable):
return self._rows[tuple(arg)]

if isinstance(n, int):
nth_values = [n]
elif isinstance(n, valid_containers):
nth_values = list(set(n))

nth_array = np.array(nth_values, dtype=np.intp)
with group_selection_context(self):

mask_left = np.in1d(self._cumcount_array(), nth_array)
mask_right = np.in1d(
self._cumcount_array(ascending=False) + 1, -nth_array
)
mask = mask_left | mask_right

ids, _, _ = self.grouper.group_info

# Drop NA values in grouping
mask = mask & (ids != -1)

out = self._selected_obj[mask]
if not self.as_index:
return out

result_index = self.grouper.result_index
out.index = result_index[ids[mask]]

if not self.observed and isinstance(result_index, CategoricalIndex):
out = out.reindex(result_index)

out = self._reindex_output(out)
return out.sort_index() if self.sort else out
return self._rows[arg]

# dropna is truthy
if isinstance(n, valid_containers):
raise ValueError("dropna option with a list of nth values is not supported")
if not is_integer(arg):
raise ValueError("dropna option only supported for an integer argument")

if dropna not in ["any", "all"]:
# Note: when agg-ing picker doesn't raise this, just returns NaN
raise ValueError(
"For a DataFrame groupby, dropna must be "
"For a DataFrame groupby.nth, dropna must be "
"either None, 'any' or 'all', "
f"(was passed {dropna})."
)

# old behaviour, but with all and any support for DataFrames.
# modified in GH 7559 to have better perf
n = cast(int, arg)
max_len = n if n >= 0 else -1 - n
dropped = self.obj.dropna(how=dropna, axis=self.axis)

Expand Down
238 changes: 238 additions & 0 deletions pandas/core/groupby/indexing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,238 @@
from __future__ import annotations

from typing import (
Iterable,
cast,
)

import numpy as np

from pandas._typing import (
FrameOrSeries,
PositionalIndexer,
)
from pandas.util._decorators import (
cache_readonly,
doc,
)

from pandas.core.dtypes.common import (
is_integer,
is_list_like,
)

from pandas.core.groupby import groupby
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are you importing this? (when this file itself is imported by group)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah this looks like a circular import that is going to be really fragile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did this as a mixin because I was copying the style of df.iloc. Which does not attempt to type itself.
In fact, I do need access to GroupBy attributes, so I had to use a cast to get it past the tests.
The simple solution would be to abandon trying to statically type this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is just for typing, so if you put it in a TYPE_CHECKING block will be ok (e.g .as the groupby.GroupBy should be separated out to avoid deps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait until my group_selection_context change is tested and reviewed before sorting out this import and the type checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tested OK, so I have added the conditional type checking.;

from pandas.core.indexes.api import CategoricalIndex


class GroupByIndexingMixin:
"""
Mixin for adding .rows to GroupBy.
"""

@property
def _rows(self) -> _rowsGroupByIndexer:
return _rowsGroupByIndexer(cast(groupby.GroupBy, self))


@doc(GroupByIndexingMixin._rows)
class _rowsGroupByIndexer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok and preferred to use capitalcase, e.g. RowsGroupByIndexer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok and preferred to use capitalcase, e.g. RowsGroupByIndexer

Done

ok this is just for typing, so if you put it in a TYPE_CHECKING block will be ok (e.g .as the groupby.GroupBy should be separated out to avoid deps)

Unfortunately, I need groupby because I need groupby.group_selection_context(self.groupby_object).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reback @jbrockmendel

ok this is just for typing, so if you put it in a TYPE_CHECKING block will be ok (e.g .as the groupby.GroupBy should be separated out to avoid deps)
Unfortunately, I need groupby because I need groupby.group_selection_context(self.groupby_object)

OK, I have tested making group_selection_context a method of the GroupBy class, rather than a module scoped method. As far as I can see everything works OK. But it is a bit drastic. My little project seems to be creeping in scope...

Unless you say otherwise, I will make that change. Of course, it effects the entire system, but I suppose we could always back it out...

Copy link
Contributor Author

@johnzangwill johnzangwill Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reback @jbrockmendel

ok this is just for typing, so if you put it in a TYPE_CHECKING block will be ok (e.g .as the groupby.GroupBy should be separated out to avoid deps)
Unfortunately, I need groupby because I need groupby.group_selection_context(self.groupby_object)

OK, I have tested making group_selection_context a method of the GroupBy class, rather than a module scoped method. As far as I can see everything works OK. But it is a bit drastic. My little project seems to be creeping in scope...

Unless you say otherwise, I will make that change. Of course, it effects the entire system, but I suppose we could always back it out...

I've committed it. Fingers crossed...
Now it has passed all the tests.
My only reservation is that groupby.group_selection_context() is sort-of public, if undocumented. So this would break anyone's application code that used it. Do we care? You tell me!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't care, this is an internal routine (its not on a groupby rather a grouper which is private)

def __init__(self, grouped: groupby.GroupBy):
self.grouped = grouped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the name here be more obvious? i had to double-check that this didnt correspond to grouped.obj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really struggled with that! I have renamed it groupByObject, which hopefully sums it up...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is renamed groupby_object.


def __getitem__(self, arg: PositionalIndexer | tuple) -> FrameOrSeries:
"""
Positional index for selection by integer location per group.

Used to implement GroupBy._rows which is used to implement GroupBy.nth
when keyword dropna is None or absent.
The behaviour extends GroupBy.nth and handles DataFrame.groupby()
keyword parameters such as as_index and dropna in a compatible way.

The additions to nth(arg) are:
- Handles iterables such as range.
- Handles slice(start, stop, step) with
start: positive, negative or None.
stop: positive, negative or None.
step: positive or None.

Parameters
----------
arg : PositionalIndexer | tuple
Allowed values are:
- Integer
- Integer values iterable such as list or range
- Slice
- Comma separated list of integers and slices

Returns
-------
Series
The filtered subset of the original groupby Series.
DataFrame
The filtered subset of the original groupby DataFrame.

See Also
--------
DataFrame.iloc : Purely integer-location based indexing for selection by
position.
GroupBy.head : Return first n rows of each group.
GroupBy.tail : Return last n rows of each group.
GroupBy.nth : Take the nth row from each group if n is an int, or a
subset of rows, if n is a list of ints.

Examples
--------
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]],
... columns=["A", "B"])
>>> df.groupby("A", as_index=False)._rows[1:2]
A B
1 a 2
4 b 5

>>> df.groupby("A", as_index=False)._rows[1, -1]
A B
1 a 2
2 a 3
4 b 5
"""
with groupby.group_selection_context(self.grouped):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is the reason you need groupby then we have to move that somewhere else or the imports are convoluted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also needed it in line 35 return _rowsGroupByIndexer(cast(groupby.GroupBy, self))
The class is a mixin but I had to cast it to the "mixed-in" class since I needed access to GroupBy attributes and methods.
I copied the code style from DataFrame.iloc, but since that does not use types, so does not have the problem:
class _iLocIndexer(_LocationIndexer):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the cast bits you can put the import inside a if TYPE_CHECKING: block

if isinstance(arg, tuple):
if all(is_integer(i) for i in arg):
mask = self._handle_list(arg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a weird extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Do you want me to get rid of all the blank lines in if-elif-else?

else:
mask = self._handle_tuple(arg)

elif isinstance(arg, slice):
mask = self._handle_slice(arg)

elif is_integer(arg):
mask = self._handle_int(cast(int, arg))

elif is_list_like(arg):
mask = self._handle_list(cast(Iterable[int], arg))

else:
raise TypeError(
f"Invalid index {type(arg)}. "
"Must be integer, list-like, slice or a tuple of "
"integers and slices"
)

ids, _, _ = self.grouped.grouper.group_info

# Drop NA values in grouping
mask &= ids != -1

if mask is None or mask is True:
result = self.grouped._selected_obj[:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does _selected_obj vs. _obj_with_exclusions make a difference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. This is the method used by head, tail and nth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does _selected_obj vs. _obj_with_exclusions make a difference here?

There is no _obj_with_exclusions on groupby objects. _obj_with_exclusions and _selected_obj are defined on frames and series, but just _selected_obj for groupby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
print(df.groupby('a')._obj_with_exclusions)

produces

   b
0  3
1  4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. It comes in from SelectionMixin and seems to lose the grouped column.
But I think that _selected_obj gives the best behaviour. Especially to emulate head and tail, that essentially give subsets of the original df.


else:
result = self.grouped._selected_obj[mask]

if self.grouped.as_index:
result_index = self.grouped.grouper.result_index
result.index = result_index[ids[mask]]

if not self.grouped.observed and isinstance(
result_index, CategoricalIndex
):
result = result.reindex(result_index)

result = self.grouped._reindex_output(result)
if self.grouped.sort:
result = result.sort_index()

return result

def _handle_int(self, arg: int) -> bool | np.ndarray:
if arg >= 0:
return self._ascending_count == arg

else:
return self._descending_count == (-arg - 1)

def _handle_list(self, args: Iterable[int]) -> bool | np.ndarray:
positive = [arg for arg in args if arg >= 0]
negative = [-arg - 1 for arg in args if arg < 0]

mask: bool | np.ndarray = False

if positive:
mask |= np.isin(self._ascending_count, positive)

if negative:
mask |= np.isin(self._descending_count, negative)

return mask

def _handle_tuple(self, args: tuple) -> bool | np.ndarray:
mask: bool | np.ndarray = False

for arg in args:
if is_integer(arg):
mask |= self._handle_int(cast(int, arg))

elif isinstance(arg, slice):
mask |= self._handle_slice(arg)

else:
raise ValueError(
f"Invalid argument {type(arg)}. Should be int or slice."
)

return mask

def _handle_slice(self, arg: slice) -> bool | np.ndarray:
start = arg.start
stop = arg.stop
step = arg.step

if step is not None and step < 0:
raise ValueError(f"Invalid step {step}. Must be non-negative")

mask: bool | np.ndarray = True

if step is None:
step = 1

if start is None:
if step > 1:
mask &= self._ascending_count % step == 0

elif start >= 0:
mask &= self._ascending_count >= start

if step > 1:
mask &= (self._ascending_count - start) % step == 0

else:
mask &= self._descending_count < -start

offset_array = self._descending_count + start + 1
limit_array = (
self._ascending_count + self._descending_count + (start + 1)
) < 0
offset_array = np.where(
limit_array, self._ascending_count, offset_array
)

mask &= offset_array % step == 0

if stop is not None:
if stop >= 0:
mask &= self._ascending_count < stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand. All my if-then-elses have blank lines. Do you want them all removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

I have removed most of the blank lines before elif and else

else:
mask &= self._descending_count >= -stop

return mask

@cache_readonly
def _ascending_count(self) -> np.ndarray:
return self.grouped._cumcount_array()

@cache_readonly
def _descending_count(self) -> np.ndarray:
return self.grouped._cumcount_array(ascending=False)
Loading