Skip to content

v0.14 group.head() not really grouped #7287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelaye opened this issue May 30, 2014 · 9 comments
Closed

v0.14 group.head() not really grouped #7287

michaelaye opened this issue May 30, 2014 · 9 comments
Labels
Docs Duplicate Report Duplicate issue or pull request Groupby

Comments

@michaelaye
Copy link
Contributor

This might not necessarily a new thing since 0.14, but I find the output of group.head() not appropriate for a grouping:

df = pd.DataFrame(np.random.randn(6,2))
df['A'] = [1,2,2,1,1,2]
df
          0         1  A
0 -0.047101  0.828542  1
1  1.617815  0.362700  2
2  1.366453 -1.116897  2
3  0.086743 -0.611371  1
4  1.918440 -1.230909  1
5 -1.003828 -0.592541  2
g = df.groupby('A')
g.head(2)
          0         1  A
0 -0.047101  0.828542  1
1  1.617815  0.362700  2
2  1.366453 -1.116897  2
3  0.086743 -0.611371  1

My expectation of the previous output would be:

          0         1  A
0 -0.047101  0.828542  1
3  0.086743 -0.611371  1
1  1.617815  0.362700  2
2  1.366453 -1.116897  2

because, after all, this is the result of a grouping, so things should be displayed grouped, shouldn't they?

@hayd
Copy link
Contributor

hayd commented May 30, 2014

http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#groupby-api-changes

The previous output had A as the index (unlike a filter) see #5755 #6533.

Filter is another example which doesn't sort by the group keys... IMO they shouldn't be sorted.

@michaelaye
Copy link
Contributor Author

I saw the docs. But the problem (for me) does not appear with head(1), it only starts to be confusing (in terms of expecting a 'grouped' result) with head(n) and n > 1.
Could we not have the sorted behavior like from g.apply(lambda x: x.head(n)) via a kwarg? Or is it available in other ways?

@jreback
Copy link
Contributor

jreback commented May 30, 2014

So you clearly want this:

In [22]: df.groupby('A').apply(lambda x: x.head(2))
Out[22]: 
            0         1  A
A                         
1 0 -0.643972  0.928889  1
  3  0.223195 -0.035344  1
2 1 -1.787060 -0.024865  2
  2 -0.942621 -0.763324  2

However, I think that it could easily get fooled and so it returns it in index order
(here the grouper columns) is reversed

In [21]: df.sort('A',ascending=False).reset_index(drop=True).groupby('A').head(2)
Out[21]: 
          0         1  A
0 -1.787060 -0.024865  2
1 -0.942621 -0.763324  2
3 -0.643972  0.928889  1
4  0.223195 -0.035344  1

You can also do this

In [25]: df.groupby('A').head(2).sort('A')
Out[25]: 
          0         1  A
0 -0.643972  0.928889  1
3  0.223195 -0.035344  1
1 -1.787060 -0.024865  2
2 -0.942621 -0.763324  2

.head() is a filter, so just a pass thru; but you want to sort in lexographic group order.

Their is a sort=True keyword in the .groupby(...), but is currently not used here

@hayd
Copy link
Contributor

hayd commented May 31, 2014

filter also doesn't respect sort... surprisingly the default is sort=True, so IMO I don't think it's reasonably for this behaviour to change.

@jreback jreback added this to the 0.14.1 milestone Jun 2, 2014
@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

@hayd ok...let's think about this and see what if anyything should be changed

@jreback
Copy link
Contributor

jreback commented Jun 28, 2014

@hayd see my PR #7580

I think i finally got to the bottom of this. basically cumcount doesn't`` sort, while standard groupby *does (as sort=True is the default). Should fix this for 0.15.0. (you need to use `group_sorter`, like in `DataSplitter` to compute the group orderings).

so right now head/tail/nth don't sort, while everything else does. (In some sense ok, because these are fitlers), but should be consistent.

So groupby(..).nth(0)==groupby(..,sort=False).first()

@hayd
Copy link
Contributor

hayd commented Jun 28, 2014

My opinion is that sort should just not be an option for filter-like groupby methods...definitely don't think it should ever be the default. Perhaps I'm missing a compelling argument for this API change... but I thought this was raised in the larger "consistent groupby" issue.

Maybe the answer here is to change the docstring of groupby's sort argument to say "for aggregated output" (just like as_index) ?

If a user wants this behaviour groupby then sort... ?

@michaelaye
Copy link
Contributor Author

My biggest problem is the mental disconnect with what I expect as an outcome of 'groupby' (but maybe that's where I fail?). The user, at least me, expects that everything after a groupby is grouped. Having parts of a group appearing to be split off and shown after a different group is just not what I would expect after a groupby operation, that's how I stumbled over it. It just feels like a disconnect in my head, but as I said, that might be based on a misunderstanding of groupby?

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@datapythonista datapythonista modified the milestones: Contributions Welcome, Someday Jul 8, 2018
@mroeschke mroeschke added Docs and removed API Design labels Apr 11, 2021
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@rhshadrach
Copy link
Member

Closing as a duplicate of #17775

@rhshadrach rhshadrach added the Duplicate Report Duplicate issue or pull request label Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

6 participants