ENH: A new GroupBy method to slice rows preserving index and order #42947

johnzangwill · 2021-08-09T12:01:35Z

New indexable attribute GroupBy.iloc to perform most grouped slices analagous to DataFrame.iloc

Row indexes can be integers or slices with +ve step
Row slices preserve the DataFrame order and grouping and extend the existing GroupBy head() and tail() methods
All column indexes are handled as for DataFrame.iloc

-ve slice step is not currently handled, and would have to reverse the order.
There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.

Note

Neither GroupBy.nth() nor GroupBy.take() take a slice argument and neither of them preserve the original DataFrame order and index. They are both slow for large integer lists and take() is very slow for large group counts.

Example use case

Supose that we have a multi-indexed DataFrame with a large primary index and a secondary sorted
in a different order for each primary index.
To reduce the DataFrame to a middle slice of each secondary, we can groupby the primary index and then use iloc.
This preserves the original DataFrame's order and indexing.
(See tests/groupby/test_groupby_iloc)

GH#42864 Groupby.iloc is added to perform most grouped slices All the slices preserve the initial order and grouping -ve slice step is not currently handled, and would have to reverse the order. There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.

pep8speaks · 2021-08-09T12:01:38Z

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-13 21:39:55 UTC

jreback

this will need review as its a fairly large change. pls really try to address the usecases, I am not sure this function is necessary (e.g. why can't nth be modified instead).

pandas/core/groupby/groupbyindexing.py

johnzangwill · 2021-08-11T10:33:17Z

I have updated the docstring with more detail and made some corrections. I haven't had much luck building the documentation locally so far, so it is dificult for me to see how it renders... Help would be appreciated!
I have a lot more information regarding outputs and timings of the various alternatives, but I don't feel that the docstring is the right place for that. I am happy to share anything that will help review this proposed addition.

johnzangwill · 2021-08-19T16:00:41Z

I added iloc to groupby,rst so that it appears in the documentation.
I improved the multiindex test to make iloc's value clearer: g.iloc[3:-3] cannot be provided by any of the other
methods , especially in the case of a varying group size:

Also, even though iloc is pure Python, it is much faster than nth() for large slices:

I thought to add the multiindex test to the docstring, but I feel that it is too long. Please explain how I should document this better to get it successfully reviewed.
Thanks!

jreback · 2021-08-19T16:11:31Z

so my main question here is why can't we just fix .nth itself? we could in theory deprecate it too in favor of this. but i'd rather not have 2 methods which do about the same thing.

Docstring changes for Sphinx

johnzangwill · 2021-08-20T09:50:31Z

I agree that the current situation is quite tragic!

For positional indexing:

DataFrame has head, tail, iat, iloc, first, last and take.
GroupBy has head, tail, first, last, nth and take.

Confusingly, GroupBy.first and last are completely different from DataFrame.first and last.
GroupBy.nth and take have the same main argument, but handle the list orders differently.

My GroupBy.iloc is a simple extension of head and tail and does something that none of the other methods can, namely take a slice relative to the end of end of each group. (e.g. grouped.iloc[3: -3])
Therefore, I think that iloc is needed. In fact, it is a response to a specific request from a user in quantitative finance.
It is lightweight and efficient and we don't lose anything by adding it.

Alternatively, you could add, say, grouped.nth(slice(3, -3), preserve_index=True), but iloc[3: -3] is the way that DataFrame would do it. GroupBy.nth is already complicated and includes a filtering function (dropna). iloc is just a simple slice.

Resolving first, last, nth and take is a separate question...

jreback · 2021-08-20T12:29:27Z

@johnzangwill my main questions is can we deprecate nth and simply use .iloc as a complete replacement (yes I guess we'd have to remove the dropna= arg as iloc is an indexer and not a function. We would also want to implement .head/.tail as well.
.take doesn't matter as its pseudo private.

jreback

pls look to nth / head / tail for the types of tests we need here. these are big tests, pls make them smaller, more focused. parameteraize and use fixtures.

pandas/core/groupby/groupbyindexing.py

pandas/tests/groupby/test_groupby_iloc.py

johnzangwill · 2021-08-20T15:30:26Z

Tests

I will cut the tests down as you request (but when I get back from vacation).
The multiindex test was written primarily to demonstrate the proposed functionality and performance. Is there another place for such content or shall I just keep it to myself?

Replacing nth

I wanted to avoid implementing iloc[[1, 3, 2]] since changing the order is not easy to make efficient, and my goal with iloc was always to preserve the underlying df order. However, the existing grouped.nth([1, 3, 2]) does does not honor the list order and preserves the original df order (the list is treated as a set), whereas df.iloc[[1, 3, 2]] (and grouped.take([1, 3, 2]) for that matter) returns the rows in the order 1, 3, 2.

Maybe calling it iloc is only making problems. If I called it "rows", say, (i.e. grouped.rows[3: -3]) then there would be no expectation that it was the same as DataFrame.iloc and I would be free to implement lists as does nth.

Assuming that, then I propose adding grouped.rows[[1, 3, 2]] (and grouped.rows[{1, 3, 2}]) which will behave the same way as nth, except for retaining the DataFrame's index structure and being much faster.

jreback

ok looks ready to merge

change naming as suggested above
can you add a whatsnew sub section in 1.4 which shows the before / after for the changes in .head/.tail; also can show the .nth slice changes if you'd like

jreback · 2021-10-03T22:42:41Z

pandas/core/groupby/indexing.py

+    """
+
+    @cache_readonly
+    def _body(self) -> BodyGroupByIndexer:


can we rename to something else, maybe _positional_selector (e.g. just something more descriptive)

Done.
Although we could still have a discussion about the future of this in version 2.
If you are keeping head() and tail() then I would argue that body[] (and maybe body() as well), makes logical sense.
Alternatively, you could finally get rid of head() and tail() and fold them into first(), last() and nth().
I may try to write up some of the inconsistencies and redundancies that I can see in this area, if you think that it would be worthwhile.

jreback · 2021-10-03T22:43:12Z

pandas/core/groupby/indexing.py

+        --------
+            >>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]],
+            ...                   columns=["A", "B"])
+            >>> df.groupby("A")._body[1:2]


can you line these up with Examples (we currently don't validate this but should line up)

jreback · 2021-10-03T22:43:48Z

pandas/core/groupby/indexing.py

+
+        return BodyGroupByIndexer(groupby_self)
+
+    def _make_mask(self, arg: PositionalIndexer | tuple) -> np.ndarray:


can you make these a bit more verbose naming, maybe _make_positional_indexer_mask (i know its long but if you aren't looking here its totally non-obvious)

It is long, but I have changed them all, and _apply_mask() as well. Please confirm that the new names are OK.

jreback · 2021-10-03T22:44:03Z

pandas/core/groupby/indexing.py

+
+        return cast(np.ndarray, mask)
+
+    def _handle_int(self, arg: int) -> np.ndarray:


same with all of these these methods

jreback · 2021-10-03T22:44:42Z

pandas/tests/groupby/test_nth.py

@@ -517,11 +517,11 @@ def test_nth_multi_index_as_expected():
 @pytest.mark.parametrize(
    "op, n, expected_rows",
    [
-        ("head", -1, []),
+        ("head", -1, [0]),


these are essentially bug fixes?

Essentially. Although the docstring says "Does not work for negative values of n." and shows the empty frame returned in the example. So you could argue that the old behaviour was by design and the new one is therefore an enhancement...

johnzangwill · 2021-10-04T17:20:52Z

can you add a whatsnew sub section in 1.4 which shows the before / after for the changes in .head/.tail; also can show the .nth slice changes if you'd like

I have added a section to v1.4.0 Enhancements

jreback · 2021-10-15T13:28:56Z

thanks @johnzangwill really nice (and you have been great about rebasing!)

jreback · 2021-10-15T13:29:10Z

Did we have a list of follows? (maybe not any)?

johnzangwill added 5 commits August 9, 2021 13:44

Formatting

d0ebbeb

Formatting

33d7992

Formatting

78e9ced

Formatting

4d098cd

Formatting

f84c365

johnzangwill changed the title ~~ENH: A new GroupBy method to slice rows preserving index and order~~ ENH: A new GroupBy iloc method to slice rows preserving index and order Aug 9, 2021

johnzangwill added 2 commits August 9, 2021 17:09

Add iloc to test_tab_completion

d937757

Add iloc to groupby/base.py

e206912

johnzangwill mentioned this pull request Aug 10, 2021

ENH: A new GroupBy method to slice rows preserving index and order #42864

Closed

johnzangwill added 2 commits August 10, 2021 12:33

Documentation

1788f1b

Cosmetics to make pre-commit happy

f6977fa

jreback requested changes Aug 10, 2021

View reviewed changes

pandas/core/groupby/groupbyindexing.py Outdated Show resolved Hide resolved

pandas/core/groupby/groupbyindexing.py Outdated Show resolved Hide resolved

pandas/core/groupby/groupbyindexing.py Outdated Show resolved Hide resolved

johnzangwill added 2 commits August 11, 2021 11:25

Improve docstring

bca4fdd

Delete a.md

66536b1

Add to doc and improve test

d075c67

johnzangwill requested a review from jreback August 19, 2021 16:05

jreback added Enhancement Groupby labels Aug 19, 2021

johnzangwill added 2 commits August 19, 2021 17:46

Tidy-up for pre-commit

df1a767

Update groupbyindexing.py

f2e9f79

Docstring changes for Sphinx

Split a long line

a9f9848

jreback requested changes Aug 20, 2021

View reviewed changes

johnzangwill mentioned this pull request Sep 29, 2021

Move group_selection_context from module to class #43796

Merged

johnzangwill added 2 commits September 29, 2021 14:39

Merge branch 'master' into groupby_iloc

ee8a86b

Merge branch 'pandas-dev:master' into groupby_iloc

f4b24b0

jreback requested changes Oct 3, 2021

View reviewed changes

johnzangwill added 4 commits October 4, 2021 13:54

Rename test_body.py to test_indexing.py

82360f5

Merge branch 'pandas-dev:master' into groupby_iloc

ba836dc

@jreback suggested renames

8abcad7

Update whatsnew v1.4.0

86c8e20

johnzangwill added 4 commits October 4, 2021 18:23

Correct typo in doc

ee33df0

Merge branch 'pandas-dev:master' into groupby_iloc

4a1aac9

Merge branch 'pandas-dev:master' into groupby_iloc

d9671a6

Merge branch 'pandas-dev:master' into groupby_iloc

a6dbc61

johnzangwill mentioned this pull request Oct 8, 2021

BUG: Fix groupby nth with axis=1 #43926

Merged

johnzangwill added 9 commits October 8, 2021 14:52

Merge branch 'pandas-dev:master' into groupby_iloc

f65093c

Resolve with another branch

534ea54

Merge branch 'pandas-dev:master' into groupby_iloc

511c8fd

NDFrameT cannot be used like that

97c3ac0

Merge branch 'pandas-dev:master' into groupby_iloc

90a4cb8

Merge branch 'pandas-dev:master' into groupby_iloc

b58b235

Merge branch 'pandas-dev:master' into groupby_iloc

f5ed6bf

Merge branch 'pandas-dev:master' into groupby_iloc

21b3637

Merge branch 'master' into groupby_iloc

88613a9

jreback approved these changes Oct 15, 2021

View reviewed changes

jreback merged commit 33a53f1 into pandas-dev:master Oct 15, 2021

johnzangwill deleted the groupby_iloc branch October 18, 2021 19:32

johnzangwill mentioned this pull request Nov 23, 2021

ENH: add indexing syntax to GroupBy.nth() #44589

Closed

ian-r-rose mentioned this pull request Apr 20, 2022

BUG: Runtime warning with groupby/tail when None appears in group column #46814

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: A new GroupBy method to slice rows preserving index and order #42947

ENH: A new GroupBy method to slice rows preserving index and order #42947

johnzangwill commented Aug 9, 2021 •

edited

Loading

pep8speaks commented Aug 9, 2021 •

edited

Loading

jreback left a comment

johnzangwill commented Aug 11, 2021

johnzangwill commented Aug 19, 2021 •

edited

Loading

jreback commented Aug 19, 2021

johnzangwill commented Aug 20, 2021

jreback commented Aug 20, 2021

jreback left a comment

johnzangwill commented Aug 20, 2021 •

edited

Loading

jreback left a comment

jreback Oct 3, 2021

johnzangwill Oct 4, 2021

jreback Oct 3, 2021

johnzangwill Oct 4, 2021

jreback Oct 3, 2021

johnzangwill Oct 4, 2021

jreback Oct 3, 2021

johnzangwill Oct 4, 2021

jreback Oct 3, 2021

johnzangwill Oct 4, 2021

johnzangwill commented Oct 4, 2021

jreback commented Oct 15, 2021

jreback commented Oct 15, 2021


		return BodyGroupByIndexer(groupby_self)

		def _make_mask(self, arg: PositionalIndexer \| tuple) -> np.ndarray:


		return cast(np.ndarray, mask)

		def _handle_int(self, arg: int) -> np.ndarray:

ENH: A new GroupBy method to slice rows preserving index and order #42947

ENH: A new GroupBy method to slice rows preserving index and order #42947

Conversation

johnzangwill commented Aug 9, 2021 • edited Loading

Note

Example use case

pep8speaks commented Aug 9, 2021 • edited Loading

Comment last updated at 2021-10-13 21:39:55 UTC

jreback left a comment

Choose a reason for hiding this comment

johnzangwill commented Aug 11, 2021

johnzangwill commented Aug 19, 2021 • edited Loading

jreback commented Aug 19, 2021

johnzangwill commented Aug 20, 2021

jreback commented Aug 20, 2021

jreback left a comment

Choose a reason for hiding this comment

johnzangwill commented Aug 20, 2021 • edited Loading

Tests

Replacing nth

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnzangwill commented Oct 4, 2021

jreback commented Oct 15, 2021

jreback commented Oct 15, 2021

johnzangwill commented Aug 9, 2021 •

edited

Loading

pep8speaks commented Aug 9, 2021 •

edited

Loading

johnzangwill commented Aug 19, 2021 •

edited

Loading

johnzangwill commented Aug 20, 2021 •

edited

Loading