Skip to content

DOC: Order of groups in groupby and head method #17775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jmcarcell opened this issue Oct 4, 2017 · 5 comments
Open

DOC: Order of groups in groupby and head method #17775

jmcarcell opened this issue Oct 4, 2017 · 5 comments

Comments

@jmcarcell
Copy link

When sort = True is passed to groupby (which is by default) the groups will be in sorted order. If you loop through them they are in sorted order, if you compute the mean, std... they are in sorted order but if you use the method head they are NOT in sorted order.


import pandas as pd

df = pd.DataFrame([[2, 100], [2, 200], [2, 300], [1, 400], [1, 500], [1, 600]], columns = ['A', 'B'])

grouped = df.groupby(df['A'], sort = True)
for name, group in grouped:
    print(group)
print(grouped.mean())
print(grouped.head(1))

Is this expected? I have not found this behaviour documented.
I think it is confusing and has caused me a headache because I was combining output of the mean and head methods in the same DataFrame, and since the data was not ordered before those results were getting mixed because of these order issues. I have pandas 0.20.3

@jreback
Copy link
Contributor

jreback commented Oct 4, 2017

see the docs http://pandas-docs.github.io/pandas-docs-travis/groupby.html#filtration.

.head() gives you as ordered in the frame. sort=True on the groupby only applies to the actual ordering of the groups, not the elements within a group. This is as expected.

I suppose the fact that you are manually iterating is in sorted order could be better documented. Can you submit a PR to that effect (may use your example in a warning or note box in http://pandas-docs.github.io/pandas-docs-travis/groupby.html#iterating-through-groups. agree it could be slightly unexpected.

@jreback jreback added this to the Next Major Release milestone Oct 4, 2017
@jreback jreback changed the title Order of groups in groupby and head method DOC: Order of groups in groupby and head method Oct 4, 2017
@jmcarcell
Copy link
Author

jmcarcell commented Oct 4, 2017

sort=True on the groupby only applies to the actual ordering of the groups, not the elements within a group. This is as expected.

This is true and is well documented.

The problem I find is not with iterating through groups but with .head() itself. I tested and all of the following methods give you a well ordered result: .all, .any, .count, .cov, .describe, .get_group, .max, .mean, .median, .nth etc. However, .head() returns the first rows of each group in the order they appear in DataFrame (in my example, this means that every other method yields a result where the block with A=1 is first and A=2 is second while head yields first the rows of the block A=2 and then A=1), besides being a method of the GroupBy object (that should be ordered because of the sort = True!). If you don't know this and use the result of head() and any other method your computations will be wrong, as you will be using an ordered result with an unordered result. Silly example:

import pandas as pd

df = pd.DataFrame([[2, 100], [2, 200], [2, 300], [1, 400], [1, 500], [1, 600]], columns = ['A', 'B'])

grouped = df.groupby(df['A'], sort = True)
df1 = grouped.sum()
df2 = grouped.head(1)
print(df1['B'].values - df2['B'].values)

If you subtract the first value to the sum you should get the sum of the other two but since the order is different you do not.
I think the expected behaviour should be that head returns the first rows with the new order, as all the other methods do.

Edit:
In http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.head.html
it sais that it is esentially equivalent to .apply(lambda x: x.head(n)) yet when you use this expression you obtain, again, a well ordered result and when using .head() you don't.

@mattayes
Copy link
Contributor

Hey @Ifnister, so to be clear:

# Setup
data = [
    {'species': 'setosa', 'sepal_length': 5.1},
    {'species': 'versicolar', 'sepal_length': 5.6},
    {'species': 'virginica', 'sepal_length': 5.7},
]
df = pd.DataFrame(data).sort_values('species', ascending=False)
df
   sepal_length     species
0           5.7   virginica
1           5.6  versicolar
2           5.1      setosa

This is what you currently get using head() (the order of species matches the original order):

df.groupby('species').head()
   sepal_length     species
0           5.7   virginica
1           5.6  versicolar
2           5.1      setosa

However, this is what you get if you use sum() or most other methods (species is sorted and doesn't match the original order):

df.groupby('species').sepal_length.sum()
species
setosa        5.1
versicolar    5.6
virginica     5.7
Name: sepal_length, dtype: float64

And you expect the output of head() to look like this:

   sepal_length     species
0           5.1      setosa
1           5.6  versicolar
2           5.7   virginica

Is that right?

@tdpetrou
Copy link
Contributor

tdpetrou commented Nov 15, 2017

There is a big problem with the docstrings here for DataFrameGroupBy.head. They say:

Essentially equivalent to ``.apply(lambda x: x.head(n))``,
except ignores as_index flag.

Dataframegroupby.head keeps the original ordering of the dataframe. It doesn't even order by the keys. Using .apply(lambda x: x.head(n)) puts the group keys in the index and sorts them.

Edit, I see that @Ifnister already pointed this out. I think it would make a lot more sense to actually do .apply(lambda x: x.head(n)).

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@rhshadrach
Copy link
Member

Head is a filter; sort is only applied to reducers within groupby. To my knowledge this isn't documented, but I haven't checked. I think documentation on this should be added to the Series.groupby and DataFrame.groupby API docs as well as the User Guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants