-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: fix case all-NaN/numeric object column in groupby #39655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: fix case all-NaN/numeric object column in groupby #39655
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will have a look but let's leave for 1.2.3
|
||
# unwrap DataFrame to get array | ||
result = mgr.blocks[0].values | ||
return result | ||
mgr = result._mgr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of this can you now just define as_array properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of simply consolidate first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already consolidated above, but if you have multiple dtypes, that doesn't help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, as the diff seems to be larger (because I remove the assert above and moved the comment), but basically the only code addition is this:
if len(mgr.blocks) != 1:
return mgr.as_array()
to handle the case of multiple blocks that need to be converted into an array (while now this raised an AssertionError because of assert len(mgr.blocks) == 1
This is a quite straightforward revert of #35841, a clean-up PR in which was said "we need to identify a test case in which this doesnt work". And so I have added a few tests that actually do result in multiple bocks in that code path. |
we had this discussion before appreciate them but it's too late |
@simonjayhawkins is not releasing today, so there is still a bit of time. Other PRs have been merged today as well. If it's not ready by tomorrow, then fine that it's for 1.2.3, but until then this can be assumed for 1.2.2 IMO (not as a blocker, but as "include if merged by releasing"). |
cc @jbrockmendel the tests added here are example cases of how you can end up with multiple blocks in this code path. Basically if you have multiple object dtype columns (a single block originally), but some column is entirely numeric, after the aggregation we infer the numeric dtype, and so you can end up with columns of different dtypes (and thus multiple blocks). |
dates = pd.date_range("2020-01-01", periods=15, freq="D") | ||
df1 = DataFrame({"key": "A", "date": dates, "col1": range(15), "col_object": "val"}) | ||
df2 = DataFrame({"key": "B", "date": dates, "col1": range(15)}) | ||
df = pd.concat([df1, df2], ignore_index=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the consolidation status of df
matter here? does it matter that df is not constructed directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't matter in this case. The dataframe that gets resampled (after the first groupby) seems to be consolidated anyway (I assume that creating the sub-dataframe to resample might incur consolidation?). But parametrized the test to cover both cases to be sure.
lets make sure to add this to the running list fo value-dependent behaviors |
Since multiple people reviewed this, I think this can be merged? (and to repeat: this has no impact whatsoever on code that already worked before) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jorisvandenbossche
thanks @jorisvandenbossche |
@meeseeksdev backport 1.2.x |
This comment has been minimized.
This comment has been minimized.
cc @simonjayhawkins @jorisvandenbossche for manual backport |
@jbrockmendel I think this is more a case of "we liberally try to infer object dtype" (although of course those are strictly also a kind of value dependent behaviour, but for object dtype we have plenty of that) |
@jorisvandenbossche the conflict is
are you happy to accept the incoming changes here. will also include changes from #36010, specifically the |
…olumn in groupby
@simonjayhawkins I don't directly see the difference with the diff of this PR, the above seems exactly what I changed, so accepting the incoming changes in that conflict should be fine |
should be more visible on #39677, on 1.2.x we don't have the |
…roupby (#39677) Co-authored-by: Joris Van den Bossche <[email protected]>
Fixes #39329
This basically reverts #35841, although the code itself has changed quite a bit since then, so it's simpler here as the original PR² (not dealing with blocks anymore)
The original code comment that was removed in #35841 "We've split an object block!" was correct, as it can actually happen in those example cases.