[BUG]: Groupy and Resample miscalculated aggregation #36198

phofl · 2020-09-07T19:30:14Z

closes DataFrameGroupby.resample with the on keyword does not produce the same output as on DateTimeIndex #27343
closes BUG: groupby resample different results with .agg() vs .mean() #33548
closes BUG: agg on groups with different sizes fails with out of bounds IndexError #35275
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I think I found a solution for this issue. We used obj.index to get the Index values. In case of resample obj.index had the original index. This index was not necessarily a RangeIndex of course -> resulted in a IndexError. In case of a RangeIndex the object self._grouper may have been resorted -> Wrong elements were selected. The object self.obj stored the original DataFrame, but already sorted in the first go around (is overwritten later). I stored the mapping from original index to a new synthetic RangeIndex (sorted after the grouping column). This allows us to replace the Index of obj in a way, that the take function works.

… if no RangeIndex was used

� Conflicts: � doc/source/whatsnew/v1.2.0.rst

pep8speaks · 2020-09-07T19:30:18Z

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-09 07:42:50 UTC

pandas/core/groupby/grouper.py

phofl · 2020-09-08T16:39:15Z

pandas/core/groupby/grouper.py

@@ -327,9 +327,10 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
        if self.key is not None and self.level is not None:
            raise ValueError("The Grouper cannot specify both a key and a level!")

-        # Keep self.grouper value before overriding
+        # Keep self.grouper and self.indexer value before overriding


We have to keep the indexer too, to have the new order of the Index.

phofl · 2020-09-08T16:39:55Z

pandas/core/groupby/grouper.py

@@ -311,7 +311,7 @@ def _get_grouper(self, obj, validate: bool = True):
        )
        return self.binner, self.grouper, self.obj

-    def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
+    def _set_grouper(self, obj: FrameOrSeries, sort: bool = False, group_indices: Dict = None):


We have to hand the group_indices over, otherwise we can not know with which index positions our series corresponds. The connection would be lost otherwise.

phofl · 2020-09-08T16:40:40Z

pandas/core/groupby/grouper.py

+                if self._indexer is not None:
+                    ax = self._grouper.take(self._indexer.argsort()).take(indices)
+                else:
+                    ax = self._grouper.take(indices)


If the df was not resorted, indexer is None. We have to use indices, because a for example a string index would kill obj.index

phofl · 2020-09-08T16:47:50Z

@jreback would you mind looking over my changes?

I passed the indices of the single groups through now. This information is lost otherwise. If there is a better way piping this through to this function please let me know.

We could sort self._grouper immediately after storing it. Did not know, which would be preferred.

The issue is matching a Series without the date column with the corresponding DatetimeIndex, which was resorted (probably).

jreback · 2020-09-08T22:25:33Z

doc/source/whatsnew/v1.2.0.rst

@@ -314,6 +314,7 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
 - Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
 - Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
+- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).


can you elabortate what 'miscalculate' means here

Two things can happen here right now:

If the Index of the Input df has any index except an RangeIndex starting at 0, it crashes (DateIndex, Index of type object, doesn't matter)

If the index is a RangeIndex, the obj.index keeps the previous index labels. These labels are used to extract Index components of self._grouper. But self._grouper is resorted most of the times in resample. So we select random index components which may or may not be correct. Most of the times this results in values getting assigned to different dates as they were in the input df (for example BUG: groupby resample different results with .agg() vs .mean() #33548)

jreback · 2020-09-08T22:26:10Z

doc/source/whatsnew/v1.2.0.rst

@@ -314,6 +314,7 @@ Groupby/resample/rolling
 - Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
 - Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
 - Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
+- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).


I would write out
df.groupby(...).resample().agg(...) instead

I wrote out the two methods but I would like to keep the rest that way, because

df.groupby(...).resample(...)['...'].mean()

produces the same error

jreback · 2020-09-08T22:27:41Z

pandas/core/groupby/grouper.py

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
            if getattr(self.grouper, "name", None) == key and isinstance(
                obj, ABCSeries
            ):
-                ax = self._grouper.take(obj.index)
+                if group_indices is None:


you can simply access self._grouper.indices no? why all of the convulated logic.

Self._grouper is and DatetimeIndex in this case. It does not have the attribute indices unfortunately. The information about the groups are all lost.

I am not that happy about the convolution too. But I wanted to avoid making group_indices required, so we have to catch the None case. Also if the index self._grouper was not resorted by the resample method, the indexer is None.

something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.

Actually that is probably my fault. I added group_indices as input only in the resample trace. That was the only part, which run in there. Wasn't that clever. I will add it everywhere this method gets called.

@jreback I think I got the issue now.

We have to different grouper parent classes. The class BaseGrouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py which has the indices method and the different parent class Grouper from

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/grouper.py

which has no method indices. The Groupings class from the same file has the method indices, but at the function above we have no Groupings object. Should I implement a indices method for the Grouper class?

I am sorry for the confusion, but currently I can not see a way how to implement a indices method for a Grouper. The Grouper class does not hold the necessary information. When we want to get rid of the convolution, we will have to pass the indices object from the underlying BaseGrouper into the function. This seems pretty ugly, because we only need the indices in the Resample case, which is currently implemented and we would have to pass the indices object through a lot of layers to get it into this function at different places.

maybe i don't understand what the actual problem is, but .indices is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices.)

groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.

I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).

Thanks for the answer. Maybe I miss something, then I would like to apologize in advance. But the way I understand this is the following:

In this specific resample case, we get a TimeGrouper, which inherits from Grouper.
Grouper does not have a indices Method. (

pandas/pandas/core/groupby/grouper.py

Line 35 in e67220d

class Grouper:

)

This file does have a indices method, but this method belongs to the Groupings class (
Class definition:

pandas/pandas/core/groupby/grouper.py

Line 388 in e67220d

class Grouping:

indices definition:

pandas/pandas/core/groupby/grouper.py

Line 554 in e67220d

def indices(self):

)

Groupings and Grouper have no relation with each other, so this does not help in this case.

We also have the BaseGrouper class, but this class also has no relation with Grouper or TimeGrouper.

I would really aprreciate it, if you could point out, where I am wrong. I stepped through the code, but the SeriesGroupBy class information with the indices method attached is lost here:

pandas/pandas/core/resample.py

Line 986 in 1f49b76

x = self._shallow_copy(x, groupby=self.groupby)

It was stored under self._groupby, but this information is not handed over

jreback · 2020-09-11T17:36:49Z

pandas/core/groupby/grouper.py

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
            if getattr(self.grouper, "name", None) == key and isinstance(
                obj, ABCSeries
            ):
-                ax = self._grouper.take(obj.index)
+                if group_indices is None:


something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.

jreback · 2020-09-13T23:01:54Z

pandas/core/groupby/grouper.py

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
            if getattr(self.grouper, "name", None) == key and isinstance(
                obj, ABCSeries
            ):
-                ax = self._grouper.take(obj.index)
+                if group_indices is None:


maybe i don't understand what the actual problem is, but .indices is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices.)

groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.

I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).

jreback · 2020-09-14T01:15:44Z

the .grouper property holds a subclass of a Grouper (BinGrouper in this case) which has all of the info

TimeGrouper is just a simple class that is used to create the others

pls step thru and look at object at each stage

phofl · 2020-09-14T01:24:12Z

self.grouper or self._grouper respectively is a DateTimeIndex not a grouper in this case. This does not contain any indices.

I look through tomorrow anyway, but I already looked into self.grouper, I could not find the necessary information.

github-actions · 2020-10-15T00:15:52Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2020-11-26T19:03:29Z

closing in favor of #37905

phofl added 2 commits September 7, 2020 21:23

Bug in groupby and resample miscalculatet aggregation or raised error…

e4d6735

… if no RangeIndex was used

Merge branch 'master' of https://github.com/pandas-dev/pandas

03d3050

� Conflicts: � doc/source/whatsnew/v1.2.0.rst

phofl added 2 commits September 7, 2020 21:32

Fix Pep8 issues

bc127f2

Add gh numbers to tests

fcc11e7

jreback requested changes Sep 7, 2020

View reviewed changes

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved

jreback added Groupby Resample resample method labels Sep 7, 2020

phofl added 2 commits September 8, 2020 16:18

Use take on indexer based on group indices

d76f1c5

Refactor code and parametrize test

d4d02d8

phofl commented Sep 8, 2020

View reviewed changes

Run black pas

087c682

phofl added 2 commits September 8, 2020 19:11

Fix wrong Type hint

1e43de6

Catch group_indices None case

f08d56e

jreback requested changes Sep 8, 2020

View reviewed changes

Change whats new entry

19569c4

jreback requested changes Sep 11, 2020

View reviewed changes

jreback requested changes Sep 13, 2020

View reviewed changes

github-actions bot added the Stale label Oct 15, 2020

jreback mentioned this pull request Nov 19, 2020

API/BUG: DatetimeIndex.argsort does not match DatetimeArray.argsort #37863

Closed

jreback closed this Nov 26, 2020

phofl deleted the 35275 branch April 27, 2023 19:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Groupy and Resample miscalculated aggregation #36198

[BUG]: Groupy and Resample miscalculated aggregation #36198

phofl commented Sep 7, 2020

pep8speaks commented Sep 7, 2020 •

edited

Loading

phofl Sep 8, 2020

phofl Sep 8, 2020

phofl Sep 8, 2020

phofl commented Sep 8, 2020

jreback Sep 8, 2020

phofl Sep 9, 2020 •

edited

Loading

jreback Sep 8, 2020

phofl Sep 9, 2020 •

edited

Loading

jreback Sep 8, 2020

phofl Sep 9, 2020

phofl Sep 9, 2020

jreback Sep 11, 2020

phofl Sep 11, 2020

phofl Sep 11, 2020

phofl Sep 11, 2020

jreback Sep 13, 2020

phofl Sep 13, 2020

jreback Sep 11, 2020

jreback Sep 13, 2020

jreback commented Sep 14, 2020

phofl commented Sep 14, 2020

github-actions bot commented Oct 15, 2020

jreback commented Nov 26, 2020

[BUG]: Groupy and Resample miscalculated aggregation #36198

[BUG]: Groupy and Resample miscalculated aggregation #36198

Conversation

phofl commented Sep 7, 2020

pep8speaks commented Sep 7, 2020 • edited Loading

Comment last updated at 2020-09-09 07:42:50 UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl commented Sep 8, 2020

Choose a reason for hiding this comment

phofl Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl Sep 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Sep 14, 2020

phofl commented Sep 14, 2020

github-actions bot commented Oct 15, 2020

jreback commented Nov 26, 2020

pep8speaks commented Sep 7, 2020 •

edited

Loading

phofl Sep 9, 2020 •

edited

Loading

phofl Sep 9, 2020 •

edited

Loading