-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG]: Groupy and Resample miscalculated aggregation #36198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 9 commits
e4d6735
03d3050
bc127f2
fcc11e7
d76f1c5
d4d02d8
087c682
1e43de6
f08d56e
19569c4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -314,6 +314,7 @@ Groupby/resample/rolling | |
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`) | ||
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`) | ||
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`) | ||
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would write out There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wrote out the two methods but I would like to keep the rest that way, because
produces the same error |
||
- | ||
|
||
Reshaping | ||
|
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -311,7 +311,12 @@ def _get_grouper(self, obj, validate: bool = True): | |||||||||
) | ||||||||||
return self.binner, self.grouper, self.obj | ||||||||||
|
||||||||||
def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | ||||||||||
def _set_grouper( | ||||||||||
self, | ||||||||||
obj: FrameOrSeries, | ||||||||||
sort: bool = False, | ||||||||||
group_indices: Optional[Dict] = None, | ||||||||||
): | ||||||||||
""" | ||||||||||
given an object and the specifications, setup the internal grouper | ||||||||||
for this particular specification | ||||||||||
|
@@ -327,9 +332,10 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||||||||
if self.key is not None and self.level is not None: | ||||||||||
raise ValueError("The Grouper cannot specify both a key and a level!") | ||||||||||
|
||||||||||
# Keep self.grouper value before overriding | ||||||||||
# Keep self.grouper and self.indexer value before overriding | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have to keep the indexer too, to have the new order of the Index. |
||||||||||
if self._grouper is None: | ||||||||||
self._grouper = self.grouper | ||||||||||
self._indexer = self.indexer | ||||||||||
|
||||||||||
# the key must be a valid info item | ||||||||||
if self.key is not None: | ||||||||||
|
@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||||||||
if getattr(self.grouper, "name", None) == key and isinstance( | ||||||||||
obj, ABCSeries | ||||||||||
): | ||||||||||
ax = self._grouper.take(obj.index) | ||||||||||
if group_indices is None: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you can simply access There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Self._grouper is and DatetimeIndex in this case. It does not have the attribute indices unfortunately. The information about the groups are all lost. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not that happy about the convolution too. But I wanted to avoid making group_indices required, so we have to catch the None case. Also if the index There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually that is probably my fault. I added group_indices as input only in the resample trace. That was the only part, which run in there. Wasn't that clever. I will add it everywhere this method gets called. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback I think I got the issue now. We have to different grouper parent classes. The class BaseGrouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py which has the indices method and the different parent class Grouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/grouper.py which has no method indices. The Groupings class from the same file has the method indices, but at the function above we have no Groupings object. Should I implement a indices method for the Grouper class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am sorry for the confusion, but currently I can not see a way how to implement a indices method for a Grouper. The Grouper class does not hold the necessary information. When we want to get rid of the convolution, we will have to pass the indices object from the underlying BaseGrouper into the function. This seems pretty ugly, because we only need the indices in the Resample case, which is currently implemented and we would have to pass the indices object through a lot of layers to get it into this function at different places. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe i don't understand what the actual problem is, but groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper. I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the answer. Maybe I miss something, then I would like to apologize in advance. But the way I understand this is the following: In this specific resample case, we get a TimeGrouper, which inherits from Grouper. pandas/pandas/core/groupby/grouper.py Line 35 in e67220d
This file does have a indices method, but this method belongs to the Groupings class ( pandas/pandas/core/groupby/grouper.py Line 388 in e67220d
indices definition: pandas/pandas/core/groupby/grouper.py Line 554 in e67220d
) Groupings and Grouper have no relation with each other, so this does not help in this case. We also have the BaseGrouper class, but this class also has no relation with Grouper or TimeGrouper. I would really aprreciate it, if you could point out, where I am wrong. I stepped through the code, but the SeriesGroupBy class information with the indices method attached is lost here: pandas/pandas/core/resample.py Line 986 in 1f49b76
It was stored under self._groupby, but this information is not handed over |
||||||||||
ax = self._grouper.take(obj.index) | ||||||||||
else: | ||||||||||
indices = group_indices.get(obj.name) | ||||||||||
if self._indexer is not None: | ||||||||||
ax = self._grouper.take(self._indexer.argsort()).take(indices) | ||||||||||
else: | ||||||||||
ax = self._grouper.take(indices) | ||||||||||
else: | ||||||||||
if key not in obj._info_axis: | ||||||||||
raise KeyError(f"The grouper name {key} is not found") | ||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elabortate what 'miscalculate' means here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things can happen here right now: