REGR: fix case all-NaN/numeric object column in groupby #39655

jorisvandenbossche · 2021-02-07T19:56:10Z

This basically reverts #35841, although the code itself has changed quite a bit since then, so it's simpler here as the original PR² (not dealing with blocks anymore)

The original code comment that was removed in #35841 "We've split an object block!" was correct, as it can actually happen in those example cases.

This reverts commit 0e199f3.

jreback

will have a look but let's leave for 1.2.3

jreback · 2021-02-07T20:22:25Z

pandas/core/groupby/generic.py


            # unwrap DataFrame to get array
-            result = mgr.blocks[0].values
-            return result
+            mgr = result._mgr


instead of this can you now just define as_array properly?

of simply consolidate first?

It's already consolidated above, but if you have multiple dtypes, that doesn't help

To be clear, as the diff seems to be larger (because I remove the assert above and moved the comment), but basically the only code addition is this:

if len(mgr.blocks) != 1: return mgr.as_array()

to handle the case of multiple blocks that need to be converted into an array (while now this raised an AssertionError because of assert len(mgr.blocks) == 1

jorisvandenbossche · 2021-02-07T20:32:15Z

This is a quite straightforward revert of #35841, a clean-up PR in which was said "we need to identify a test case in which this doesnt work". And so I have added a few tests that actually do result in multiple bocks in that code path.
Also, you can only end up in the new code for cases that would otherwise have led to an AssertionError, so I think this is safe to include for 1.2.2

jreback · 2021-02-07T20:42:12Z

we had this discussion before
about late PRs

appreciate them but it's too late
and 1.2.3 is appropriate

jorisvandenbossche · 2021-02-07T20:48:15Z

@simonjayhawkins is not releasing today, so there is still a bit of time. Other PRs have been merged today as well. If it's not ready by tomorrow, then fine that it's for 1.2.3, but until then this can be assumed for 1.2.2 IMO (not as a blocker, but as "include if merged by releasing").

jorisvandenbossche · 2021-02-07T20:54:41Z

cc @jbrockmendel the tests added here are example cases of how you can end up with multiple blocks in this code path. Basically if you have multiple object dtype columns (a single block originally), but some column is entirely numeric, after the aggregation we infer the numeric dtype, and so you can end up with columns of different dtypes (and thus multiple blocks).

pandas/core/groupby/generic.py

jbrockmendel · 2021-02-08T00:31:11Z

pandas/tests/resample/test_resampler_grouper.py

+    dates = pd.date_range("2020-01-01", periods=15, freq="D")
+    df1 = DataFrame({"key": "A", "date": dates, "col1": range(15), "col_object": "val"})
+    df2 = DataFrame({"key": "B", "date": dates, "col1": range(15)})
+    df = pd.concat([df1, df2], ignore_index=True)


does the consolidation status of df matter here? does it matter that df is not constructed directly?

It doesn't matter in this case. The dataframe that gets resampled (after the first groupby) seems to be consolidated anyway (I assume that creating the sub-dataframe to resample might incur consolidation?). But parametrized the test to cover both cases to be sure.

jbrockmendel · 2021-02-08T00:32:34Z

but some column is entirely numeric

lets make sure to add this to the running list fo value-dependent behaviors

jorisvandenbossche · 2021-02-08T12:59:38Z

Since multiple people reviewed this, I think this can be merged?

(and to repeat: this has no impact whatsoever on code that already worked before)

simonjayhawkins

Thanks @jorisvandenbossche

jreback · 2021-02-08T13:42:35Z

thanks @jorisvandenbossche

jreback · 2021-02-08T13:42:45Z

@meeseeksdev backport 1.2.x

jreback · 2021-02-08T13:44:50Z

cc @simonjayhawkins @jorisvandenbossche for manual backport

jorisvandenbossche · 2021-02-08T13:47:40Z

but some column is entirely numeric

lets make sure to add this to the running list fo value-dependent behaviors

@jbrockmendel I think this is more a case of "we liberally try to infer object dtype" (although of course those are strictly also a kind of value dependent behaviour, but for object dtype we have plenty of that)

simonjayhawkins · 2021-02-08T14:31:54Z

@jorisvandenbossche the conflict is

<<<<<<< HEAD
            assert len(result._mgr.blocks) == 1

            # unwrap DataFrame to get array
            result = result._mgr.blocks[0].values
            return result
=======
            mgr = result._mgr
            assert isinstance(mgr, BlockManager)

            # unwrap DataFrame to get array
            if len(mgr.blocks) != 1:
                # We've split an object block! Everything we've assumed
                # about a single block input returning a single block output
                # is a lie. See eg GH-39329
                return mgr.as_array()
            else:
                result = mgr.blocks[0].values
                return result
>>>>>>> e58a193408... REGR: fix case all-NaN/numeric object column in groupby  (#39655)

are you happy to accept the incoming changes here. will also include changes from #36010, specifically the assert isinstance(mgr, BlockManager)

…olumn in groupby

jorisvandenbossche · 2021-02-08T14:43:03Z

@simonjayhawkins I don't directly see the difference with the diff of this PR, the above seems exactly what I changed, so accepting the incoming changes in that conflict should be fine

simonjayhawkins · 2021-02-08T14:45:54Z

I don't directly see the difference with the diff of this PR

should be more visible on #39677, on 1.2.x we don't have the mgr = result._mgr, but I think this is worth having to avoid the duplication, but also comes with an extra assert that was not on 1.2.x

…roupby (#39677) Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche added 3 commits February 7, 2021 20:29

Revert "REF: simplify _cython_agg_blocks (pandas-dev#35841)"

6e1514a

This reverts commit 0e199f3.

add test case from issue

62cc348

add simpler groupby-only test case

9d0ea88

jorisvandenbossche added Groupby Regression Functionality that used to work in a prior pandas version labels Feb 7, 2021

jorisvandenbossche changed the title ~~Regr groupby object~~ REGR: fix case all-NaN object column in groupby Feb 7, 2021

jorisvandenbossche added 3 commits February 7, 2021 21:05

add whatsnew

383bf50

add more generic numeric test case

a21c4ed

update whatsnew message

9ae380a

jorisvandenbossche added this to the 1.2.2 milestone Feb 7, 2021

jorisvandenbossche changed the title ~~REGR: fix case all-NaN object column in groupby~~ REGR: fix case all-NaN/numeric object column in groupby Feb 7, 2021

other whatsnew fix

7fb7108

jreback requested changes Feb 7, 2021

View reviewed changes

jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021

jorisvandenbossche modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021

jreback modified the milestones: 1.2.2, 1.2.3 Feb 7, 2021

jorisvandenbossche modified the milestones: 1.2.3, 1.2.2 Feb 7, 2021

simonjayhawkins reviewed Feb 7, 2021

View reviewed changes

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Feb 8, 2021

View reviewed changes

jorisvandenbossche added 2 commits February 8, 2021 09:08

parametrize with consolidation

baae4a3

remove unnecessary assert

277c0fe

simonjayhawkins approved these changes Feb 8, 2021

View reviewed changes

jreback approved these changes Feb 8, 2021

View reviewed changes

jreback merged commit e58a193 into pandas-dev:master Feb 8, 2021

This comment has been minimized.

Sign in to view

lumberbot-app bot added the Still Needs Manual Backport label Feb 8, 2021

jorisvandenbossche deleted the regr-groupby-object branch February 8, 2021 13:45

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Feb 8, 2021

Backport PR pandas-dev#39655: REGR: fix case all-NaN/numeric object c…

2a6dfc3

…olumn in groupby

simonjayhawkins mentioned this pull request Feb 8, 2021

Backport PR #39655: REGR: fix case all-NaN/numeric object column in groupby #39677

Merged

simonjayhawkins removed the Still Needs Manual Backport label Feb 8, 2021

jorisvandenbossche added a commit that referenced this pull request Feb 8, 2021

Backport PR #39655: REGR: fix case all-NaN/numeric object column in g…

1b8a4eb

…roupby (#39677) Co-authored-by: Joris Van den Bossche <[email protected]>

Uh oh!

REGR: fix case all-NaN/numeric object column in groupby #39655

REGR: fix case all-NaN/numeric object column in groupby #39655

Uh oh!

Conversation

jorisvandenbossche commented Feb 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

jreback Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Feb 7, 2021

Uh oh!

jreback commented Feb 7, 2021

Uh oh!

jorisvandenbossche commented Feb 7, 2021

Uh oh!

jorisvandenbossche commented Feb 7, 2021

Uh oh!

Uh oh!

jbrockmendel Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 8, 2021

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Feb 8, 2021

Uh oh!

jorisvandenbossche commented Feb 8, 2021

Uh oh!

simonjayhawkins left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Feb 8, 2021

Uh oh!

jreback commented Feb 8, 2021

Uh oh!

This comment has been minimized.

jreback commented Feb 8, 2021

Uh oh!

jorisvandenbossche commented Feb 8, 2021

Uh oh!

simonjayhawkins commented Feb 8, 2021

Uh oh!

jorisvandenbossche commented Feb 8, 2021

Uh oh!

simonjayhawkins commented Feb 8, 2021

Uh oh!

Uh oh!

jorisvandenbossche commented Feb 7, 2021 •

edited

Loading