groupby().first() skips NaN values #6732

jwkvam · 2014-03-29T03:50:09Z

I do this

>>> df = pd.DataFrame([[1,np.nan,0],[1,1,1],[2,2,2],[2,3,3]], columns=list('abc'))
>>> print(df)
   a   b  c
0  1 NaN  0
1  1   1  1
2  2   2  2
3  2   3  3

[4 rows x 3 columns]
>>> print(df.groupby('a').first())
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

but I expected this

>>> print(df.groupby('a').first())
     b  c
a      
1  NaN  0
2    2  2

[2 rows x 2 columns]

Is it possible to achieve my expected output? I get the same output in master and 0.13.1.

The text was updated successfully, but these errors were encountered:

hayd · 2014-03-29T06:14:27Z

Yup, you can use nth (for the moment skipping NaN is a feature of first/last):

In [5]: df.groupby('a').nth(0)
Out[5]:
   a   b  c
0  1 NaN  0
2  2   2  2

@jreback Now, I recall why I haven't changed first/last yet, I need make these have a way to get this old (weird?) behaviour, similar to Series nth.

jwkvam · 2014-03-29T06:49:49Z

@hayd Thanks, if this is intended behavior then you can close this if you like. I'll leave it up to you.

jreback · 2014-03-29T13:32:04Z

@hayd I think first/last should pass dropna=False, so it is really the first/last (and presevers backward compat)

In [4]: df.groupby('a').nth(0,dropna=False)
Out[4]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]

In [5]: df.groupby('a').first()
Out[5]: 
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

jreback · 2014-04-03T20:52:26Z

@hayd we need to change first/last to use nth right? (and will fix the perf issue as well)
see here: http://stackoverflow.com/questions/22845856/performance-issues-with-groupbys-last-in-pandas/22846215?noredirect=1#comment34855219_22846215

hayd · 2014-04-03T20:56:09Z

Yeah, perf of the dropping NaN will still be an issue. (Easiest soln is just to leave current implementation "as is" when using dropna=True / default).

jreback · 2014-04-21T19:45:54Z

@hayd you are covering this one I believe as well?

hayd · 2014-04-21T20:37:25Z

yes

jreback · 2014-05-05T00:15:54Z

@hayd ping!

jreback · 2014-05-05T00:20:03Z

@hayd

might want to add dropna kw to first/last to emulate this behavior (but can put that off to 0.14.1 if you want).

However, I think that _set_selection_from_grouper needs to be called in nth, right?
(as this should have 'a' as the index, right?

In [13]: df.groupby('a').nth(0,dropna=False)
Out[13]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]

In [8]: df.groupby('a').nth(0,dropna='any')
Out[8]: 
   b  c
a      
1  1  0
2  2  2

[2 rows x 2 columns]

In [4]: df.groupby('a').nth(0)
Out[4]: 
   a   b  c
0  1 NaN  0
2  2   2  2

[2 rows x 3 columns]

alvarouc · 2019-07-29T14:45:15Z

I can confirm first keeps ignoring NaNs in pandas 0.25

Rabeez · 2020-03-09T13:53:24Z

I can confirm first keeps ignoring NaNs in pandas 1.0.1

maximlt · 2020-10-04T00:39:33Z

If it's really how it works (which was a nice surprise to me) then it could be mentioned in the docs :)

jreback added API Design labels Mar 29, 2014

jreback added this to the 0.14.0 milestone Mar 29, 2014

jreback assigned hayd Mar 29, 2014

jreback mentioned this issue May 5, 2014

API: update nth to use the _set_selection_from_grouper makes first==nth(0) and last==nth(-1) #7044

Merged

jreback closed this as completed in #7044 May 12, 2014

jorisvandenbossche mentioned this issue Oct 1, 2014

BUG: groupby.first/last with nans #8427

Closed

wesm unassigned hayd Oct 12, 2016

sursu mentioned this issue Apr 14, 2018

.last() does not perform as expected #20657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby().first() skips NaN values #6732

groupby().first() skips NaN values #6732

jwkvam commented Mar 29, 2014

hayd commented Mar 29, 2014

jwkvam commented Mar 29, 2014

jreback commented Mar 29, 2014

jreback commented Apr 3, 2014

hayd commented Apr 3, 2014

jreback commented Apr 21, 2014

hayd commented Apr 21, 2014

jreback commented May 5, 2014

jreback commented May 5, 2014

alvarouc commented Jul 29, 2019

Rabeez commented Mar 9, 2020

maximlt commented Oct 4, 2020

groupby().first() skips NaN values #6732

groupby().first() skips NaN values #6732

Comments

jwkvam commented Mar 29, 2014

hayd commented Mar 29, 2014

jwkvam commented Mar 29, 2014

jreback commented Mar 29, 2014

jreback commented Apr 3, 2014

hayd commented Apr 3, 2014

jreback commented Apr 21, 2014

hayd commented Apr 21, 2014

jreback commented May 5, 2014

jreback commented May 5, 2014

alvarouc commented Jul 29, 2019

Rabeez commented Mar 9, 2020

maximlt commented Oct 4, 2020