Skip to content

groupby.nth lost multiindex #11830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jesrael opened this issue Dec 12, 2015 · 9 comments
Closed

groupby.nth lost multiindex #11830

jesrael opened this issue Dec 12, 2015 · 9 comments
Milestone

Comments

@jesrael
Copy link

jesrael commented Dec 12, 2015

It is bug or not? Because in function mean and first it is OK.
link

df = pd.DataFrame({'a': [1, 1, 2, 2], 'b': ['b', 'b', 'b', 'a'], 'c': [1, 2, 3, 4]})
print df
#   a  b  c
#0  1  b  1
#1  1  b  2
#2  2  b  3
#3  2  a  4

#lost multiindex
print df.groupby(['a', 'b']).c.nth(0)
#0    1
#2    3
#3    4
#Name: c, dtype: int64

print df.groupby(['a', 'b']).c.mean()
#a  b
#1  b    1.5
#2  a    4.0
#   b    3.0
#Name: c, dtype: float64
print df.groupby(['a', 'b']).c.first()
#a  b
#1  b    1
#2  a    4
#   b    3
#Name: c, dtype: int64
print df.groupby(['a', 'b']).nth(0).c
#a  b
#1  b    1
#2  a    4
#   b    3
#Name: c, dtype: int64
@jreback
Copy link
Contributor

jreback commented Dec 12, 2015

looks like a bug, but this is a quite complicated area, see here

If you'd like to dig-in would be great.

@jreback jreback added this to the Next Major Release milestone Dec 12, 2015
@pwaller
Copy link
Contributor

pwaller commented Jan 9, 2016

By the way, I just hit this I think, and you don't need a multi-index to cause it:

df.groupby(df.b).a.nth(0), x.groupby(df.b).a.first()

gives:

(0    1
 3    2
 Name: a, dtype: int64,
 b
 a    2
 b    1
 Name: a, dtype: int64)

(And I would expect the two to be equal). It seems that for whatever reason nth is having its index lost.

@pwaller
Copy link
Contributor

pwaller commented Jan 9, 2016

I just built several versions in the past to see if this was a regression. It's doesn't look like a regression. I've tested as far back as 0.12.

@pwaller
Copy link
Contributor

pwaller commented Jan 9, 2016

Sorry. I think my test may have been depending on a feature which didn't exist in older versions (the .name property of indices). It now looks like it gives the correct result in v0.13.0. This is my test:

$ python3 -c 'import pandas as P; raise SystemExit(not P.DataFrame({"a": [1, 2, 3, 4], "b": [9, 9, 9, 9]}).groupby("a").a.nth(0).index.equals([1, 2, 3, 4]))'
$ echo $?
0

The latest version gives a non-zero exit status.

Going to try an automated git bisect. Any hints on how to make pandas build faster in development mode?

@jreback
Copy link
Contributor

jreback commented Jan 9, 2016

http://pandas.pydata.org/pandas-docs/stable/groupby.html#groupby-nth

nth actually has slightly different semantics than .first. we have been discussing this in a couple of issues: #7569, #11038, and #11039

@pwaller
Copy link
Contributor

pwaller commented Jan 9, 2016

OK. Not that it matters but I was able to bisect the behaviour change to c444c73.

FWIW, For me this wasn't very clearly explained in the documentation, for example Here. I guess it comes down to how you read the word "rows". For me, that just meant that it was picking out a row within the group, but I still expected it to have the index for that group - just as first() and last() do. It would be very useful if the docs made this behaviour difference clear.

Another thing that is a surprise to me:

>>> P.DataFrame({"a": [1, 2, 3, 4], "b": [9, 9, 9, 9]}).groupby("a").b.nth(0)
0    9
1    9
2    9
3    9
Name: b, dtype: int64

Is different from (flipped b.nth(0) to nth(0).b):

>>> P.DataFrame({"a": [1, 2, 3, 4], "b": [9, 9, 9, 9]}).groupby("a").nth(0).b
a
1    9
2    9
3    9
4    9
Name: b, dtype: int64

(Edited: I accidentally pressed submit prematurely!)

@jreback
Copy link
Contributor

jreback commented Jan 9, 2016

the issue is we need a bunch more test as their is s bug in the results - can u do a PR with some more tests - would help move this along

@adneu
Copy link
Contributor

adneu commented May 24, 2016

I believe this was fixed with 445d1c6. The discussion for #11039 mentions that the commit now makes Groupby.nth reducing for a Series (keeping the original index) vs. the former behavior, which was filtering (position based index). Also that commit added these tests which I think address this bug.

+        assert_series_equal(g.B.nth(0), df.set_index('A').B.iloc[[0, 2]])
+        assert_series_equal(g.B.nth(1), df.set_index('A').B.iloc[[1]])

@jreback
Copy link
Contributor

jreback commented May 24, 2016

ok #12839 closed this, but see the comment on the actual PR (we ended up re-opening that one) because the state was not being preserved.

welcome to have you address that issue.

thanks!

@jreback jreback closed this as completed May 24, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release May 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants