Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 119 commits into from
Oct 15, 2021

Conversation

johnzangwill
Copy link
Contributor

@johnzangwill johnzangwill commented Aug 9, 2021

Closes #42864

New indexable attribute GroupBy.iloc to perform most grouped slices analagous to DataFrame.iloc

Row indexes can be integers or slices with +ve step
Row slices preserve the DataFrame order and grouping and extend the existing GroupBy head() and tail() methods
All column indexes are handled as for DataFrame.iloc

-ve slice step is not currently handled, and would have to reverse the order.
There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.

Note

Neither GroupBy.nth() nor GroupBy.take() take a slice argument and neither of them preserve the original DataFrame order and index. They are both slow for large integer lists and take() is very slow for large group counts.

Example use case

Supose that we have a multi-indexed DataFrame with a large primary index and a secondary sorted
in a different order for each primary index.
To reduce the DataFrame to a middle slice of each secondary, we can groupby the primary index and then use iloc.
This preserves the original DataFrame's order and indexing.
(See tests/groupby/test_groupby_iloc)

GH#42864 Groupby.iloc is added to perform most grouped slices

All the slices preserve the initial order and grouping

-ve slice step is not currently handled, and would have to reverse the order.

There is no plan to implement Integer lists since these do not preserve the order at all and cannot be easily vectorized.
@pep8speaks
Copy link

pep8speaks commented Aug 9, 2021

Hello @johnzangwill! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-13 21:39:55 UTC

@johnzangwill johnzangwill changed the title ENH: A new GroupBy method to slice rows preserving index and order ENH: A new GroupBy iloc method to slice rows preserving index and order Aug 9, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need review as its a fairly large change. pls really try to address the usecases, I am not sure this function is necessary (e.g. why can't nth be modified instead).

@johnzangwill
Copy link
Contributor Author

I have updated the docstring with more detail and made some corrections. I haven't had much luck building the documentation locally so far, so it is dificult for me to see how it renders... Help would be appreciated!
I have a lot more information regarding outputs and timings of the various alternatives, but I don't feel that the docstring is the right place for that. I am happy to share anything that will help review this proposed addition.

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Aug 19, 2021

I added iloc to groupby,rst so that it appears in the documentation.
I improved the multiindex test to make iloc's value clearer: g.iloc[3:-3] cannot be provided by any of the other
methods , especially in the case of a varying group size:
image

Also, even though iloc is pure Python, it is much faster than nth() for large slices:
image

I thought to add the multiindex test to the docstring, but I feel that it is too long. Please explain how I should document this better to get it successfully reviewed.
Thanks!

@johnzangwill johnzangwill requested a review from jreback August 19, 2021 16:05
@jreback
Copy link
Contributor

jreback commented Aug 19, 2021

so my main question here is why can't we just fix .nth itself? we could in theory deprecate it too in favor of this. but i'd rather not have 2 methods which do about the same thing.

@johnzangwill
Copy link
Contributor Author

I agree that the current situation is quite tragic!

For positional indexing:

  • DataFrame has head, tail, iat, iloc, first, last and take.
  • GroupBy has head, tail, first, last, nth and take.

Confusingly, GroupBy.first and last are completely different from DataFrame.first and last.
GroupBy.nth and take have the same main argument, but handle the list orders differently.

My GroupBy.iloc is a simple extension of head and tail and does something that none of the other methods can, namely take a slice relative to the end of end of each group. (e.g. grouped.iloc[3: -3])
Therefore, I think that iloc is needed. In fact, it is a response to a specific request from a user in quantitative finance.
It is lightweight and efficient and we don't lose anything by adding it.

Alternatively, you could add, say, grouped.nth(slice(3, -3), preserve_index=True), but iloc[3: -3] is the way that DataFrame would do it. GroupBy.nth is already complicated and includes a filtering function (dropna). iloc is just a simple slice.

Resolving first, last, nth and take is a separate question...

@jreback
Copy link
Contributor

jreback commented Aug 20, 2021

@johnzangwill my main questions is can we deprecate nth and simply use .iloc as a complete replacement (yes I guess we'd have to remove the dropna= arg as iloc is an indexer and not a function. We would also want to implement .head/.tail as well.
.take doesn't matter as its pseudo private.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls look to nth / head / tail for the types of tests we need here. these are big tests, pls make them smaller, more focused. parameteraize and use fixtures.

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Aug 20, 2021

Tests

I will cut the tests down as you request (but when I get back from vacation).
The multiindex test was written primarily to demonstrate the proposed functionality and performance. Is there another place for such content or shall I just keep it to myself?

Replacing nth

I wanted to avoid implementing iloc[[1, 3, 2]] since changing the order is not easy to make efficient, and my goal with iloc was always to preserve the underlying df order. However, the existing grouped.nth([1, 3, 2]) does does not honor the list order and preserves the original df order (the list is treated as a set), whereas df.iloc[[1, 3, 2]] (and grouped.take([1, 3, 2]) for that matter) returns the rows in the order 1, 3, 2.

Maybe calling it iloc is only making problems. If I called it "rows", say, (i.e. grouped.rows[3: -3]) then there would be no expectation that it was the same as DataFrame.iloc and I would be free to implement lists as does nth.

Assuming that, then I propose adding grouped.rows[[1, 3, 2]] (and grouped.rows[{1, 3, 2}]) which will behave the same way as nth, except for retaining the DataFrame's index structure and being much faster.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok looks ready to merge

  • change naming as suggested above
  • can you add a whatsnew sub section in 1.4 which shows the before / after for the changes in .head/.tail; also can show the .nth slice changes if you'd like

"""

@cache_readonly
def _body(self) -> BodyGroupByIndexer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename to something else, maybe _positional_selector (e.g. just something more descriptive)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Although we could still have a discussion about the future of this in version 2.
If you are keeping head() and tail() then I would argue that body[] (and maybe body() as well), makes logical sense.
Alternatively, you could finally get rid of head() and tail() and fold them into first(), last() and nth().
I may try to write up some of the inconsistencies and redundancies that I can see in this area, if you think that it would be worthwhile.

--------
>>> df = pd.DataFrame([["a", 1], ["a", 2], ["a", 3], ["b", 4], ["b", 5]],
... columns=["A", "B"])
>>> df.groupby("A")._body[1:2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you line these up with Examples (we currently don't validate this but should line up)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


return BodyGroupByIndexer(groupby_self)

def _make_mask(self, arg: PositionalIndexer | tuple) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make these a bit more verbose naming, maybe _make_positional_indexer_mask (i know its long but if you aren't looking here its totally non-obvious)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is long, but I have changed them all, and _apply_mask() as well. Please confirm that the new names are OK.


return cast(np.ndarray, mask)

def _handle_int(self, arg: int) -> np.ndarray:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with all of these these methods

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -517,11 +517,11 @@ def test_nth_multi_index_as_expected():
@pytest.mark.parametrize(
"op, n, expected_rows",
[
("head", -1, []),
("head", -1, [0]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are essentially bug fixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially. Although the docstring says "Does not work for negative values of n." and shows the empty frame returned in the example. So you could argue that the old behaviour was by design and the new one is therefore an enhancement...

@johnzangwill
Copy link
Contributor Author

  • can you add a whatsnew sub section in 1.4 which shows the before / after for the changes in .head/.tail; also can show the .nth slice changes if you'd like

I have added a section to v1.4.0 Enhancements

@jreback jreback merged commit 33a53f1 into pandas-dev:master Oct 15, 2021
@jreback
Copy link
Contributor

jreback commented Oct 15, 2021

thanks @johnzangwill really nice (and you have been great about rebasing!)

@jreback
Copy link
Contributor

jreback commented Oct 15, 2021

Did we have a list of follows? (maybe not any)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: A new GroupBy method to slice rows preserving index and order
5 participants