Skip to content

BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 88 commits into from
Dec 19, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Aug 16, 2020

This PR makes a few more changes to propagate dropna correctly for sliced groupby objects.

It depends on #35444 for the changes to pandas/core/groupby/generic.py. I put them in by hand for now so that the tests pass but there should be no diff once #35444 is merged.

@@ -556,8 +556,9 @@ def _transform_general(
if common_dtype is result.dtype:
result = maybe_downcast_numeric(result, self._selected_obj.dtype)

result.name = self._selected_obj.name
result.index = self._selected_obj.index
obj = self._selected_obj.dropna() if self.dropna else self._selected_obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to actually drop the data itself? I thought dropna should just include whether we include the NA group (which the factorize call handled)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right. We actually don't really need the whole obj here, just the index with NA dropped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's fine to delete this* - so definitely not needed

  • modulo a test with empty dataframe input in pandas/tests/groupby/test_grouping.py. But that only worked for transform and not other agg functions so maybe it's ok to delete the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there was a desire for agg/transform/apply to return the same result for an empty frame: #26208 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Deleting was maybe heavy-handed. Actually transform was the odd one out before this change, it's an empty index now vs an empty RangeIndex before. Restored and updated the test

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's merge the dependent first and then can look

@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Aug 21, 2020

if self.dropna:
result = concatenated.sort_index()
if len(result.index) < len(self._selected_obj.index):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is here for an edge case. If we have a DatetimeIndex and we're not dropping any NaNs slicing the DatetimeIndex throws out frequency information:

In [8]: idx = pd.date_range(start='1/1/2018', end='1/03/2018') 
   ...: idx                                                                                                                                                                   
Out[8]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D')

In [9]: idx[[0, 1, 2]]                                                                                                                                                        
Out[9]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)

The if statement avoids the slice in this case

@rhshadrach
Copy link
Member

@arw2019 #35444 has been merged.

@arw2019
Copy link
Member Author

arw2019 commented Sep 2, 2020

@arw2019 #35444 has been merged.

thanks @rhshadrach!

@jreback this is ready for review

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments above; I really don't like special cases all over the place. try to push down the handling so the group.index is correct based on dropna

if len(result.index) < len(self._selected_obj.index):
result.index = self._selected_obj.index[result.index.asi8]
else:
result.index = self._selected_obj.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could go in _set_result_index_ordered not here.
i am unclear why this is actually needed

res = self.obj._constructor(
res, index=group.index, columns=group.columns
)
indexer = self._get_index(name) if self.dropna else group.index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can arrange to have dropna passed to where the groups are constructed, then you won't need to do this as group.index will be correct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the groups are constructed in this BaseGrouper method:

def get_iterator(self, data: FrameOrSeries, axis: int = 0):
"""
Groupby iterator
Returns
-------
Generator yielding sequence of (name, subsetted object)
for each group
"""
splitter = self._get_splitter(data, axis=axis)
keys = self._get_group_keys()
for key, (i, group) in zip(keys, splitter):
yield key, group

I think that BaseGrouper doesn't have a dropna attribute - I checked inside this method with hasattr(self, 'dropna') to be sure. Would the idea be to propagate dropna to BaseGrouper so that group.index is correct?

Not sure this is the way to go but thought I'd ask

@arw2019
Copy link
Member Author

arw2019 commented Nov 10, 2020

CI green, responded to comments above

Thanks @rhshadrach for the docs fix!!!

@jreback
Copy link
Contributor

jreback commented Dec 4, 2020

if you can merge master will look again

@arw2019
Copy link
Member Author

arw2019 commented Dec 5, 2020

if you can merge master will look again

Merged, will ping once CI is green

@arw2019 arw2019 closed this Dec 5, 2020
@arw2019 arw2019 reopened this Dec 5, 2020
@arw2019
Copy link
Member Author

arw2019 commented Dec 5, 2020

Merged master, CI green. Addressed comments from previous reviews

@arw2019
Copy link
Member Author

arw2019 commented Dec 7, 2020

cc @jreback in case you have time to look

@jreback jreback added this to the 1.3 milestone Dec 18, 2020
@jreback
Copy link
Contributor

jreback commented Dec 18, 2020

ok this looks good, can you merge master and ping on green.

@arw2019
Copy link
Member Author

arw2019 commented Dec 19, 2020

Ping (merged master + CI green)

@jreback jreback merged commit c39773e into pandas-dev:master Dec 19, 2020
@jreback
Copy link
Contributor

jreback commented Dec 19, 2020

thanks @arw2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna=True
4 participants