-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG]: Groupy and Resample miscalculated aggregation #36198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… if no RangeIndex was used
� Conflicts: � doc/source/whatsnew/v1.2.0.rst
@@ -327,9 +327,10 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||
if self.key is not None and self.level is not None: | |||
raise ValueError("The Grouper cannot specify both a key and a level!") | |||
|
|||
# Keep self.grouper value before overriding | |||
# Keep self.grouper and self.indexer value before overriding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to keep the indexer too, to have the new order of the Index.
pandas/core/groupby/grouper.py
Outdated
@@ -311,7 +311,7 @@ def _get_grouper(self, obj, validate: bool = True): | |||
) | |||
return self.binner, self.grouper, self.obj | |||
|
|||
def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||
def _set_grouper(self, obj: FrameOrSeries, sort: bool = False, group_indices: Dict = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to hand the group_indices over, otherwise we can not know with which index positions our series corresponds. The connection would be lost otherwise.
pandas/core/groupby/grouper.py
Outdated
if self._indexer is not None: | ||
ax = self._grouper.take(self._indexer.argsort()).take(indices) | ||
else: | ||
ax = self._grouper.take(indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the df was not resorted, indexer is None. We have to use indices, because a for example a string index would kill obj.index
@jreback would you mind looking over my changes? I passed the indices of the single groups through now. This information is lost otherwise. If there is a better way piping this through to this function please let me know. We could sort self._grouper immediately after storing it. Did not know, which would be preferred. The issue is matching a Series without the date column with the corresponding DatetimeIndex, which was resorted (probably). |
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -314,6 +314,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`) | |||
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`) | |||
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`) | |||
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elabortate what 'miscalculate' means here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things can happen here right now:
- If the Index of the Input df has any index except an RangeIndex starting at 0, it crashes (DateIndex, Index of type object, doesn't matter)
- If the index is a RangeIndex, the obj.index keeps the previous index labels. These labels are used to extract Index components of self._grouper. But self._grouper is resorted most of the times in resample. So we select random index components which may or may not be correct. Most of the times this results in values getting assigned to different dates as they were in the input df (for example BUG: groupby resample different results with .agg() vs .mean() #33548)
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -314,6 +314,7 @@ Groupby/resample/rolling | |||
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`) | |||
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`) | |||
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`) | |||
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write out
df.groupby(...).resample().agg(...)
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote out the two methods but I would like to keep the rest that way, because
df.groupby(...).resample(...)['...'].mean()
produces the same error
@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||
if getattr(self.grouper, "name", None) == key and isinstance( | |||
obj, ABCSeries | |||
): | |||
ax = self._grouper.take(obj.index) | |||
if group_indices is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply access self._grouper.indices
no? why all of the convulated logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Self._grouper is and DatetimeIndex in this case. It does not have the attribute indices unfortunately. The information about the groups are all lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not that happy about the convolution too. But I wanted to avoid making group_indices required, so we have to catch the None case. Also if the index self._grouper
was not resorted by the resample method, the indexer is None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that is probably my fault. I added group_indices as input only in the resample trace. That was the only part, which run in there. Wasn't that clever. I will add it everywhere this method gets called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I think I got the issue now.
We have to different grouper parent classes. The class BaseGrouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py which has the indices method and the different parent class Grouper from
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/grouper.py
which has no method indices. The Groupings class from the same file has the method indices, but at the function above we have no Groupings object. Should I implement a indices method for the Grouper class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for the confusion, but currently I can not see a way how to implement a indices method for a Grouper. The Grouper class does not hold the necessary information. When we want to get rid of the convolution, we will have to pass the indices object from the underlying BaseGrouper into the function. This seems pretty ugly, because we only need the indices in the Resample case, which is currently implemented and we would have to pass the indices object through a lot of layers to get it into this function at different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i don't understand what the actual problem is, but .indices
is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices
.)
groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.
I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the answer. Maybe I miss something, then I would like to apologize in advance. But the way I understand this is the following:
In this specific resample case, we get a TimeGrouper, which inherits from Grouper.
Grouper does not have a indices Method. (
pandas/pandas/core/groupby/grouper.py
Line 35 in e67220d
class Grouper: |
This file does have a indices method, but this method belongs to the Groupings class (
Class definition:
pandas/pandas/core/groupby/grouper.py
Line 388 in e67220d
class Grouping: |
indices definition:
pandas/pandas/core/groupby/grouper.py
Line 554 in e67220d
def indices(self): |
)
Groupings and Grouper have no relation with each other, so this does not help in this case.
We also have the BaseGrouper class, but this class also has no relation with Grouper or TimeGrouper.
I would really aprreciate it, if you could point out, where I am wrong. I stepped through the code, but the SeriesGroupBy class information with the indices method attached is lost here:
pandas/pandas/core/resample.py
Line 986 in 1f49b76
x = self._shallow_copy(x, groupby=self.groupby) |
It was stored under self._groupby, but this information is not handed over
@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||
if getattr(self.grouper, "name", None) == key and isinstance( | |||
obj, ABCSeries | |||
): | |||
ax = self._grouper.take(obj.index) | |||
if group_indices is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.
@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False): | |||
if getattr(self.grouper, "name", None) == key and isinstance( | |||
obj, ABCSeries | |||
): | |||
ax = self._grouper.take(obj.index) | |||
if group_indices is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i don't understand what the actual problem is, but .indices
is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices
.)
groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.
I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).
the .grouper property holds a subclass of a Grouper (BinGrouper in this case) which has all of the info TimeGrouper is just a simple class that is used to create the others pls step thru and look at object at each stage |
self.grouper or self._grouper respectively is a DateTimeIndex not a grouper in this case. This does not contain any indices. I look through tomorrow anyway, but I already looked into self.grouper, I could not find the necessary information. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
closing in favor of #37905 |
on
keyword does not produce the same output as on DateTimeIndex #27343black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
I think I found a solution for this issue. We used
obj.index
to get the Index values. In case of resampleobj.index
had the original index. This index was not necessarily a RangeIndex of course -> resulted in a IndexError. In case of a RangeIndex the objectself._grouper
may have been resorted -> Wrong elements were selected. The objectself.obj
stored the original DataFrame, but already sorted in the first go around (is overwritten later). I stored the mapping from original index to a new synthetic RangeIndex (sorted after the grouping column). This allows us to replace the Index of obj in a way, that the take function works.