Skip to content

ENH: A new GroupBy method to slice rows preserving index and order #42864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
johnzangwill opened this issue Aug 3, 2021 · 7 comments · Fixed by #42947
Closed

ENH: A new GroupBy method to slice rows preserving index and order #42864

johnzangwill opened this issue Aug 3, 2021 · 7 comments · Fixed by #42947
Assignees
Labels
Enhancement Groupby Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@johnzangwill
Copy link
Contributor

johnzangwill commented Aug 3, 2021

Is your feature request related to a problem?

Pandas provides DataFrameGroupBy.head() and tail(), which efficiently slice the beginning and end of each group while preserving the order and index. I would like to be able to do a general row slice with the same properties. DataFrame has head(), tail() and iloc that behave in a compatible way. There is no corresponding DataFrameGroupBy.iloc.

Describe the solution you'd like

Provide a new DataFrameGroupBy method to slice rows per group

API breaking implications

None

Describe alternatives you've considered

The following are existing ways to extract, say, the second and third entry of each group, assuming that there are a large number of rows in each group (~10000):

  1. grouped.apply(lambda x: x.iloc[1:3, :]) - Extremely slow. Does not preserve the order or indexing.
  2. grouped.take([1, 2]) - Extremely slow. Does not preserve the order or indexing.
  3. grouped.nth([1, 2]) - Quite fast for a small list. Does not preserve the order or indexing.
  4. grouped.head(3).groupby('...').tail(2) - Quite fast. Does preserve index and ordering.
  5. grouped._selected_obj[mask] where mask is built from grouped.cumcount() - Very fast. Does preserve index and ordering. But uses private attribute of DataFrameGroupBy and takes several lines of code.

Additional context

There are three options:

  1. Add an option to an existing method to force it to preserve index and order. But take() is very slow and nth() is quite slow. Neither accept a slice argument, so a range list has to be provided.
  2. Easiest: Add a new method taking a slice as an argument and implementing it as in 5 above.
  3. Most logical and complete: Add a new iloc attribute analogous to DataFrame.iloc
@johnzangwill johnzangwill added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 3, 2021
@jreback
Copy link
Contributor

jreback commented Aug 3, 2021

.nth already does this

@johnzangwill
Copy link
Contributor Author

johnzangwill commented Aug 3, 2021

As I pointed out, grouped.nth() does not preserve the order or index of the original df, particularly if it has a multiindex. So it is not compaible with head() or tail() and it takes a list rather than a slice argument.

Also, the speed of grouped.nth(list_of_ints) grows with the length of the list. It is over 10 times slower than alternative 5 when working with a large slice.

@johnzangwill johnzangwill changed the title ENH: A new DataFrameGroupBy method to slice rows preserving index and order ENH: A new GroupBy method to slice rows preserving index and order Aug 4, 2021
@johnzangwill
Copy link
Contributor Author

I appreciate that this has not yet been triaged, but I can propose a solution for GroupBy.iloc that addresses this issue. So I would like to take this.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2021

ok would be ok with .iloc as long as it's clear how this is different than nth head and tail - eg the usecases r clear in the docs and api

@johnzangwill
Copy link
Contributor Author

take

@johnzangwill
Copy link
Contributor Author

I have implemented #42947 and submitted a pull request. I'm not sure what happens next (this is my first contribution...)

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@johnzangwill
Copy link
Contributor Author

Update to my #42947. I decided that the syntax and behaviour of my index was too different from DataFrame.iloc to use the same name. I implemented it as GroupBy.rows. I do understand that we are trying to reduce attributes rather than add to them, but I believe that my code adds useful functionality that is not otherwise available. It also resolves multiple requests for GroupBy.head and tail to handle negative arguments.

@jreback jreback added this to the 1.4 milestone Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants