BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

arw2019 · 2020-08-16T05:04:18Z

closes BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna=True #35612
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This PR makes a few more changes to propagate dropna correctly for sliced groupby objects.

It depends on #35444 for the changes to pandas/core/groupby/generic.py. I put them in by hand for now so that the tests pass but there should be no diff once #35444 is merged.

pandas/core/groupby/groupby.py

mroeschke · 2020-08-16T06:41:35Z

pandas/core/groupby/generic.py

@@ -556,8 +556,9 @@ def _transform_general(
            if common_dtype is result.dtype:
                result = maybe_downcast_numeric(result, self._selected_obj.dtype)

-        result.name = self._selected_obj.name
-        result.index = self._selected_obj.index
+        obj = self._selected_obj.dropna() if self.dropna else self._selected_obj


Do we need to actually drop the data itself? I thought dropna should just include whether we include the NA group (which the factorize call handled)?

Yes you're right. We actually don't really need the whole obj here, just the index with NA dropped

Actually it's fine to delete this* - so definitely not needed

modulo a test with empty dataframe input in pandas/tests/groupby/test_grouping.py. But that only worked for transform and not other agg functions so maybe it's ok to delete the test?

Looks like there was a desire for agg/transform/apply to return the same result for an empty frame: #26208 (comment)

Ok! Deleting was maybe heavy-handed. Actually transform was the odd one out before this change, it's an empty index now vs an empty RangeIndex before. Restored and updated the test

pandas/core/groupby/groupby.py

…rdered

jreback

let's merge the dependent first and then can look

arw2019 · 2020-08-21T23:07:39Z

pandas/core/groupby/generic.py

+
+            if self.dropna:
+                result = concatenated.sort_index()
+                if len(result.index) < len(self._selected_obj.index):


This check is here for an edge case. If we have a DatetimeIndex and we're not dropping any NaNs slicing the DatetimeIndex throws out frequency information:

In [8]: idx = pd.date_range(start='1/1/2018', end='1/03/2018') ...: idx Out[8]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D') In [9]: idx[[0, 1, 2]] Out[9]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)

The if statement avoids the slice in this case

rhshadrach · 2020-08-31T22:37:18Z

@arw2019 #35444 has been merged.

arw2019 · 2020-09-02T03:59:03Z

@arw2019 #35444 has been merged.

thanks @rhshadrach!

@jreback this is ready for review

jreback

see my comments above; I really don't like special cases all over the place. try to push down the handling so the group.index is correct based on dropna

jreback · 2020-09-02T22:36:59Z

pandas/core/groupby/generic.py

+                if len(result.index) < len(self._selected_obj.index):
+                    result.index = self._selected_obj.index[result.index.asi8]
+                else:
+                    result.index = self._selected_obj.index


this could go in _set_result_index_ordered not here.
i am unclear why this is actually needed

jreback · 2020-09-02T22:37:51Z

pandas/core/groupby/generic.py

-                res = self.obj._constructor(
-                    res, index=group.index, columns=group.columns
-                )
+                indexer = self._get_index(name) if self.dropna else group.index


i think you can arrange to have dropna passed to where the groups are constructed, then you won't need to do this as group.index will be correct.

So the groups are constructed in this BaseGrouper method:

pandas/pandas/core/groupby/ops.py

Lines 116 to 128 in 361166f

def get_iterator(self, data: FrameOrSeries, axis: int = 0):

"""

Groupby iterator

Returns

-------

Generator yielding sequence of (name, subsetted object)

for each group

"""

splitter = self._get_splitter(data, axis=axis)

keys = self._get_group_keys()

for key, (i, group) in zip(keys, splitter):

yield key, group

I think that BaseGrouper doesn't have a dropna attribute - I checked inside this method with hasattr(self, 'dropna') to be sure. Would the idea be to propagate dropna to BaseGrouper so that group.index is correct?

Not sure this is the way to go but thought I'd ask

arw2019 · 2020-11-10T17:36:55Z

CI green, responded to comments above

Thanks @rhshadrach for the docs fix!!!

…35612

jreback · 2020-12-04T23:51:13Z

if you can merge master will look again

…35612

arw2019 · 2020-12-05T04:42:52Z

if you can merge master will look again

Merged, will ping once CI is green

…35612

arw2019 · 2020-12-05T21:05:32Z

Merged master, CI green. Addressed comments from previous reviews

arw2019 · 2020-12-07T23:03:04Z

cc @jreback in case you have time to look

…35612

jreback · 2020-12-18T23:13:57Z

ok this looks good, can you merge master and ping on green.

…35612

arw2019 · 2020-12-19T02:21:38Z

Ping (merged master + CI green)

jreback · 2020-12-19T02:34:50Z

thanks @arw2019

…peError (pandas-dev#35751)

arw2019 added 2 commits August 16, 2020 04:48

handle dropna=False in _selected_obj, _set_result_index_ordered

0484244

add whatsnew

a335744

mroeschke reviewed Aug 16, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

arw2019 added 2 commits August 16, 2020 06:37

rewrote _set_result_index_ordered

640ec38

added dropna=False to tests reliant on that

099e30c

mroeschke reviewed Aug 16, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

mroeschke reviewed Aug 16, 2020

View reviewed changes

delete second reindexing in _transform_general

8a13d06

mroeschke reviewed Aug 16, 2020

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

arw2019 added 2 commits August 16, 2020 07:36

restore + change test_evaluate_with_empty_groups

394feb6

remove calls to obj.dropna()

e1cafd4

arw2019 force-pushed the GH35612 branch from 5074510 to e1cafd4 Compare August 17, 2020 07:10

arw2019 added 2 commits August 17, 2020 07:29

separate tests for dataframe/series slices

0df329c

fix series indexing

77e7fc7

arw2019 force-pushed the GH35612 branch from bdda2c2 to 77e7fc7 Compare August 21, 2020 06:42

fix DataFrameGroupBy._transform_general

16544ea

arw2019 force-pushed the GH35612 branch from d91ad93 to 16544ea Compare August 21, 2020 08:06

move DataFrameGroupBy._transform_general logic to _set_result_index_o…

46e5f66

…rdered

arw2019 force-pushed the GH35612 branch from 2baab0c to 46e5f66 Compare August 21, 2020 19:53

jreback requested changes Aug 21, 2020

View reviewed changes

jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 21, 2020

handle edge case (datetime index, no NaNs)

6fca785

arw2019 commented Aug 21, 2020

View reviewed changes

merge with upstream/master

6341d97

jreback requested changes Sep 2, 2020

View reviewed changes

arw2019 added 2 commits September 4, 2020 16:03

move logic to BaseGrouper and _set_index_ordered

269516b

Merge remote-tracking branch 'upstream/master' into GH35612

6819ac6

REF: restore casing on rows_dropped

a789b6a

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

2ba8b44

…35612

arw2019 added the Needs Review label Nov 11, 2020

arw2019 added 2 commits November 23, 2020 13:29

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

95b86ba

…35612

merge with master

7881134

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

15aa56e

…35612

arw2019 closed this Dec 5, 2020

arw2019 reopened this Dec 5, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

08f0abd

…35612

arw2019 added 7 commits December 9, 2020 20:06

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

8ab9baa

…35612

move whatsnew to 1.3

faf6570

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

7bd2a9a

…35612

minimize diff

24bb112

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

4377b63

…35612

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

de86144

…35612

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

9bc9ce4

…35612

jreback added this to the 1.3 milestone Dec 18, 2020

Merge branch 'master' of https://github.com/pandas-dev/pandas into GH…

1ea9d29

…35612

jreback approved these changes Dec 19, 2020

View reviewed changes

jreback merged commit c39773e into pandas-dev:master Dec 19, 2020

rhshadrach mentioned this pull request Jan 10, 2021

BUG: Groupby transform with missing groups #8955

Closed

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws Sha…

0c1d3c8

…peError (pandas-dev#35751)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

arw2019 commented Aug 16, 2020

mroeschke Aug 16, 2020

arw2019 Aug 16, 2020

arw2019 Aug 16, 2020

mroeschke Aug 16, 2020

arw2019 Aug 16, 2020

jreback left a comment

arw2019 Aug 21, 2020

rhshadrach commented Aug 31, 2020

arw2019 commented Sep 2, 2020 •

edited

Loading

jreback left a comment

jreback Sep 2, 2020

jreback Sep 2, 2020

arw2019 Sep 3, 2020

arw2019 commented Nov 10, 2020

jreback commented Dec 4, 2020

arw2019 commented Dec 5, 2020

arw2019 commented Dec 5, 2020

arw2019 commented Dec 7, 2020

jreback commented Dec 18, 2020

arw2019 commented Dec 19, 2020

jreback commented Dec 19, 2020

	def get_iterator(self, data: FrameOrSeries, axis: int = 0):
	"""
	Groupby iterator

	Returns
	-------
	Generator yielding sequence of (name, subsetted object)
	for each group
	"""
	splitter = self._get_splitter(data, axis=axis)
	keys = self._get_group_keys()
	for key, (i, group) in zip(keys, splitter):
	yield key, group

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

Conversation

arw2019 commented Aug 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach commented Aug 31, 2020

arw2019 commented Sep 2, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arw2019 commented Nov 10, 2020

jreback commented Dec 4, 2020

arw2019 commented Dec 5, 2020

arw2019 commented Dec 5, 2020

arw2019 commented Dec 7, 2020

jreback commented Dec 18, 2020

arw2019 commented Dec 19, 2020

jreback commented Dec 19, 2020

arw2019 commented Sep 2, 2020 •

edited

Loading