-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby doesn't distinguish between different kinds of null values #48476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I had a similar discussion with brock yesterday, in masked arrays, None, np.nan and pd.NA are all treated equally. So if you decide to make changes, you have to consider this and keep the behaviour there as is. In Numpy Dtypes this might be a different story. |
this would be a massive change and will cause quite some issue i susoect and will break user code need a proposal on how to handle this generally rather than local patches |
In case anyone is unaware: is_valid_na_for_dtype and is_matching_na may be relevant #32265 will be a PITA until the end of time. |
Generally agreed with the other's sentiment of if different nulls are distinct in groupby they should probably be distinct in other ops as well. FWIW, I am generally +1 to making |
Just to be clear about the scope of this discussion: I think this is limited to I agree that we should see this in a more general context, and not just limited to groupby (also things like count, unique, value_counts, duplicated, etc in non-groupby context). With pandas 1.3:
With main:
So if we do it there, it might also makes sense to follow that in groupby? |
I think we should distinguish two different aspects here: 1) how are different null sentinels treated in case of input to a method or operation (how are they coerced ), and 2) how do we treat different null sentinels if they are present in an already typed DataFrame object. For example, using the top post example but forcing to use a nullable dtype:
The question here is whether the first line is the behaviour we want (i.e. the fact that all of NaN, NA and None are coerced into NA). But once you have that dataframe, the actual groupby step is of course correct to put all NAs in a single group. |
The three types of null values currently get combined into a single group. There are various places in pandas where different types of null values are identified, e.g.
pd.Series([np.nan, None])
convertsNone
tonp.nan
. However within groupby, I think if the input contains distinct values, then they should remain distinct in the groupby result.In order to change this, the change will need to be made in
factorize
, which could impact other parts outside of groupby. Our tests didn't catch any such instances (see #48477 for an implementation), but I plan to look into this further.Assuming we consider the current output undesirable, we need to decide if we are going to call this a bug or if it should go through deprecation. This is tested for in
test_groupby_dropna_multi_index_dataframe_nan_in_two_groups
, which I think might suggest deprecating, however that was added in the original PR that introduceddropna
to groupby (#30584). I went through that PR and did not see any discussion on this behavior. That makes me lean toward calling it a bug, but I could go either way here.Slightly related: #32265
cc @charlesdong1991 @jorisvandenbossche @jbrockmendel @mroeschke @jreback for any thoughts.
The text was updated successfully, but these errors were encountered: