Skip to content

groupby().first() skips NaN values #6732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jwkvam opened this issue Mar 29, 2014 · 12 comments · Fixed by #7044
Closed

groupby().first() skips NaN values #6732

jwkvam opened this issue Mar 29, 2014 · 12 comments · Fixed by #7044

Comments

@jwkvam
Copy link
Contributor

jwkvam commented Mar 29, 2014

I do this

>>> df = pd.DataFrame([[1,np.nan,0],[1,1,1],[2,2,2],[2,3,3]], columns=list('abc'))
>>> print(df)
   a   b  c
0  1 NaN  0
1  1   1  1
2  2   2  2
3  2   3  3

[4 rows x 3 columns]
>>> print(df.groupby('a').first())
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

but I expected this

>>> print(df.groupby('a').first())
     b  c
a      
1  NaN  0
2    2  2

[2 rows x 2 columns]

Is it possible to achieve my expected output? I get the same output in master and 0.13.1.

@hayd
Copy link
Contributor

hayd commented Mar 29, 2014

Yup, you can use nth (for the moment skipping NaN is a feature of first/last):

In [5]: df.groupby('a').nth(0)
Out[5]:
   a   b  c
0  1 NaN  0
2  2   2  2

@jreback Now, I recall why I haven't changed first/last yet, I need make these have a way to get this old (weird?) behaviour, similar to Series nth.

@jwkvam
Copy link
Contributor Author

jwkvam commented Mar 29, 2014

@hayd Thanks, if this is intended behavior then you can close this if you like. I'll leave it up to you.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2014

@hayd I think first/last should pass dropna=False, so it is really the first/last (and presevers backward compat)

In [4]: df.groupby('a').nth(0,dropna=False)
Out[4]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]

In [5]: df.groupby('a').first()
Out[5]: 
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

@jreback jreback added this to the 0.14.0 milestone Mar 29, 2014
@jreback
Copy link
Contributor

jreback commented Apr 3, 2014

@hayd we need to change first/last to use nth right? (and will fix the perf issue as well)
see here: http://stackoverflow.com/questions/22845856/performance-issues-with-groupbys-last-in-pandas/22846215?noredirect=1#comment34855219_22846215

@hayd
Copy link
Contributor

hayd commented Apr 3, 2014

Yeah, perf of the dropping NaN will still be an issue. (Easiest soln is just to leave current implementation "as is" when using dropna=True / default).

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@hayd you are covering this one I believe as well?

@hayd
Copy link
Contributor

hayd commented Apr 21, 2014

yes

@jreback
Copy link
Contributor

jreback commented May 5, 2014

@hayd ping!

@jreback
Copy link
Contributor

jreback commented May 5, 2014

@hayd

might want to add dropna kw to first/last to emulate this behavior (but can put that off to 0.14.1 if you want).

However, I think that _set_selection_from_grouper needs to be called in nth, right?
(as this should have 'a' as the index, right?

In [13]: df.groupby('a').nth(0,dropna=False)
Out[13]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]
In [8]: df.groupby('a').nth(0,dropna='any')
Out[8]: 
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

In [4]: df.groupby('a').nth(0)
Out[4]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]

@alvarouc
Copy link

I can confirm first keeps ignoring NaNs in pandas 0.25

@Rabeez
Copy link

Rabeez commented Mar 9, 2020

I can confirm first keeps ignoring NaNs in pandas 1.0.1

@maximlt
Copy link

maximlt commented Oct 4, 2020

If it's really how it works (which was a nice surprise to me) then it could be mentioned in the docs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants