-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby.first/last with nans #8427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this is designed behavior, and does not depend on types:
|
I think it's incorrect for the frame case (series case ok) |
Related: #6732 |
I think this is confusing at the least. So you have multiple values on the first row, some of which are NaN. I guess |
So the documented behavior of Groupby.first / Groupby.last is the first / last non-NA value? Anything to do here? |
There's some inconsistency between In [10]: df.groupby('id').nth(0)
Out[10]:
val val2
id
a NaN -1 |
@WillAyd - first is designed to work column by column whereas nth selects a single row. So it seems to me this inconsistency is purposeful (but agree it's confusing). To me an ideal solution would allow a user to
If we had this, then it'd be best to align nth/first so the default behavior of nth(0) is the same as first, and similarly with last. |
I think your second bullet feels right, and might have the added bonus of simplifying our code paths |
take |
@rhshadrach , I understand that modifying first and last methods to point to nth would be the way to go. This would cause this behaviour.
Then, we would include another parameter to decide if it is the nth value by row or column-by-column. |
@NumberPiOso - Certainly we should include the parameter first if going that route before changing the default behavior of first/last. However, I do not feel certain that is the correct approach. I would have to guess that first/last are common operations, at least more so than nth. Is changing the default of first to mean "first row" more useful than "first value in each column"? That is very unclear to me. It's at this point that I wonder if changing the behavior here is more disruptive than helpful. As a tangential aside, I've always found treating nans differently in first/last/etc more surprising than beneficial, but perhaps that's just my use case. However this is pretty consistent across pandas and I certainly don't find it sever enough to merit a change to the default behavior (or at the very least, merit me arguing for a change!). |
This is true. |
This is incorrect, as this is applied column by column (as they are different dtypes)
so
_first_compat
should first compute the mask then use it.from SO
The text was updated successfully, but these errors were encountered: