-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: A new GroupBy method to slice rows preserving index and order #42947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
GH#42864 Groupby.iloc is added to perform most grouped slices All the slices preserve the initial order and grouping -ve slice step is not currently handled, and would have to reverse the order. There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.
Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-10-13 21:39:55 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need review as its a fairly large change. pls really try to address the usecases, I am not sure this function is necessary (e.g. why can't nth be modified instead).
I have updated the docstring with more detail and made some corrections. I haven't had much luck building the documentation locally so far, so it is dificult for me to see how it renders... Help would be appreciated! |
so my main question here is why can't we just fix |
Docstring changes for Sphinx
I agree that the current situation is quite tragic! For positional indexing:
Confusingly, GroupBy.first and last are completely different from DataFrame.first and last. My GroupBy.iloc is a simple extension of head and tail and does something that none of the other methods can, namely take a slice relative to the end of end of each group. (e.g. grouped.iloc[3: -3]) Alternatively, you could add, say, grouped.nth(slice(3, -3), preserve_index=True), but iloc[3: -3] is the way that DataFrame would do it. GroupBy.nth is already complicated and includes a filtering function (dropna). iloc is just a simple slice. Resolving first, last, nth and take is a separate question... |
@johnzangwill my main questions is can we deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls look to nth / head / tail for the types of tests we need here. these are big tests, pls make them smaller, more focused. parameteraize and use fixtures.
TestsI will cut the tests down as you request (but when I get back from vacation). Replacing nthI wanted to avoid implementing iloc[[1, 3, 2]] since changing the order is not easy to make efficient, and my goal with iloc was always to preserve the underlying df order. However, the existing grouped.nth([1, 3, 2]) does does not honor the list order and preserves the original df order (the list is treated as a set), whereas df.iloc[[1, 3, 2]] (and grouped.take([1, 3, 2]) for that matter) returns the rows in the order 1, 3, 2. Maybe calling it iloc is only making problems. If I called it "rows", say, (i.e. grouped.rows[3: -3]) then there would be no expectation that it was the same as DataFrame.iloc and I would be free to implement lists as does nth. Assuming that, then I propose adding grouped.rows[[1, 3, 2]] (and grouped.rows[{1, 3, 2}]) which will behave the same way as nth, except for retaining the DataFrame's index structure and being much faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok looks ready to merge
- change naming as suggested above
- can you add a whatsnew sub section in 1.4 which shows the before / after for the changes in .head/.tail; also can show the .nth slice changes if you'd like
pandas/core/groupby/indexing.py
Outdated
""" | ||
|
||
@cache_readonly | ||
def _body(self) -> BodyGroupByIndexer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename to something else, maybe _positional_selector (e.g. just something more descriptive)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Although we could still have a discussion about the future of this in version 2.
If you are keeping head()
and tail()
then I would argue that body[]
(and maybe body()
as well), makes logical sense.
Alternatively, you could finally get rid of head()
and tail()
and fold them into first()
, last()
and nth()
.
I may try to write up some of the inconsistencies and redundancies that I can see in this area, if you think that it would be worthwhile.
pandas/core/groupby/indexing.py
Outdated
-------- | ||
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]], | ||
... columns=["A", "B"]) | ||
>>> df.groupby("A")._body[1:2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you line these up with Examples (we currently don't validate this but should line up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/groupby/indexing.py
Outdated
|
||
return BodyGroupByIndexer(groupby_self) | ||
|
||
def _make_mask(self, arg: PositionalIndexer | tuple) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make these a bit more verbose naming, maybe _make_positional_indexer_mask
(i know its long but if you aren't looking here its totally non-obvious)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is long, but I have changed them all, and _apply_mask()
as well. Please confirm that the new names are OK.
pandas/core/groupby/indexing.py
Outdated
|
||
return cast(np.ndarray, mask) | ||
|
||
def _handle_int(self, arg: int) -> np.ndarray: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with all of these these methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -517,11 +517,11 @@ def test_nth_multi_index_as_expected(): | |||
@pytest.mark.parametrize( | |||
"op, n, expected_rows", | |||
[ | |||
("head", -1, []), | |||
("head", -1, [0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are essentially bug fixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially. Although the docstring says "Does not work for negative values of n." and shows the empty frame returned in the example. So you could argue that the old behaviour was by design and the new one is therefore an enhancement...
I have added a section to v1.4.0 Enhancements |
thanks @johnzangwill really nice (and you have been great about rebasing!) |
Did we have a list of follows? (maybe not any)? |
Closes #42864
New indexable attribute GroupBy.iloc to perform most grouped slices analagous to DataFrame.iloc
Row indexes can be integers or slices with +ve step
Row slices preserve the DataFrame order and grouping and extend the existing GroupBy head() and tail() methods
All column indexes are handled as for DataFrame.iloc
-ve slice step is not currently handled, and would have to reverse the order.
There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.
Note
Neither GroupBy.nth() nor GroupBy.take() take a slice argument and neither of them preserve the original DataFrame order and index. They are both slow for large integer lists and take() is very slow for large group counts.
Example use case
Supose that we have a multi-indexed DataFrame with a large primary index and a secondary sorted
in a different order for each primary index.
To reduce the DataFrame to a middle slice of each secondary, we can groupby the primary index and then use iloc.
This preserves the original DataFrame's order and indexing.
(See tests/groupby/test_groupby_iloc)