-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.groupby(., dropna=True, axis=0) incorrectly throws ShapeError #35751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/generic.py
Outdated
@@ -556,8 +556,9 @@ def _transform_general( | |||
if common_dtype is result.dtype: | |||
result = maybe_downcast_numeric(result, self._selected_obj.dtype) | |||
|
|||
result.name = self._selected_obj.name | |||
result.index = self._selected_obj.index | |||
obj = self._selected_obj.dropna() if self.dropna else self._selected_obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to actually drop the data itself? I thought dropna should just include whether we include the NA group (which the factorize call handled)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right. We actually don't really need the whole obj
here, just the index with NA dropped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's fine to delete this* - so definitely not needed
- modulo a test with empty dataframe input in
pandas/tests/groupby/test_grouping.py
. But that only worked fortransform
and not other agg functions so maybe it's ok to delete the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there was a desire for agg/transform/apply
to return the same result for an empty frame: #26208 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! Deleting was maybe heavy-handed. Actually transform
was the odd one out before this change, it's an empty index now vs an empty RangeIndex
before. Restored and updated the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's merge the dependent first and then can look
pandas/core/groupby/generic.py
Outdated
|
||
if self.dropna: | ||
result = concatenated.sort_index() | ||
if len(result.index) < len(self._selected_obj.index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is here for an edge case. If we have a DatetimeIndex
and we're not dropping any NaN
s slicing the DatetimeIndex
throws out frequency information:
In [8]: idx = pd.date_range(start='1/1/2018', end='1/03/2018')
...: idx
Out[8]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq='D')
In [9]: idx[[0, 1, 2]]
Out[9]: DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'], dtype='datetime64[ns]', freq=None)
The if
statement avoids the slice in this case
thanks @rhshadrach! @jreback this is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments above; I really don't like special cases all over the place. try to push down the handling so the group.index is correct based on dropna
pandas/core/groupby/generic.py
Outdated
if len(result.index) < len(self._selected_obj.index): | ||
result.index = self._selected_obj.index[result.index.asi8] | ||
else: | ||
result.index = self._selected_obj.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could go in _set_result_index_ordered not here.
i am unclear why this is actually needed
pandas/core/groupby/generic.py
Outdated
res = self.obj._constructor( | ||
res, index=group.index, columns=group.columns | ||
) | ||
indexer = self._get_index(name) if self.dropna else group.index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can arrange to have dropna passed to where the groups are constructed, then you won't need to do this as group.index will be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the groups are constructed in this BaseGrouper
method:
pandas/pandas/core/groupby/ops.py
Lines 116 to 128 in 361166f
def get_iterator(self, data: FrameOrSeries, axis: int = 0): | |
""" | |
Groupby iterator | |
Returns | |
------- | |
Generator yielding sequence of (name, subsetted object) | |
for each group | |
""" | |
splitter = self._get_splitter(data, axis=axis) | |
keys = self._get_group_keys() | |
for key, (i, group) in zip(keys, splitter): | |
yield key, group |
I think that
BaseGrouper
doesn't have a dropna
attribute - I checked inside this method with hasattr(self, 'dropna')
to be sure. Would the idea be to propagate dropna
to BaseGrouper
so that group.index
is correct?
Not sure this is the way to go but thought I'd ask
CI green, responded to comments above Thanks @rhshadrach for the docs fix!!! |
if you can merge master will look again |
Merged, will ping once CI is green |
Merged master, CI green. Addressed comments from previous reviews |
cc @jreback in case you have time to look |
ok this looks good, can you merge master and ping on green. |
Ping (merged master + CI green) |
thanks @arw2019 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This PR makes a few more changes to propagate
dropna
correctly for sliced groupby objects.It depends on #35444 for the changes to
pandas/core/groupby/generic.py
. I put them in by hand for now so that the tests pass but there should be no diff once #35444 is merged.