Skip to content

[BUG]: Groupy and Resample miscalculated aggregation #36198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Sep 7, 2020

I think I found a solution for this issue. We used obj.index to get the Index values. In case of resample obj.index had the original index. This index was not necessarily a RangeIndex of course -> resulted in a IndexError. In case of a RangeIndex the object self._grouper may have been resorted -> Wrong elements were selected. The object self.obj stored the original DataFrame, but already sorted in the first go around (is overwritten later). I stored the mapping from original index to a new synthetic RangeIndex (sorted after the grouping column). This allows us to replace the Index of obj in a way, that the take function works.

@pep8speaks
Copy link

pep8speaks commented Sep 7, 2020

Hello @phofl! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-09 07:42:50 UTC

@jreback jreback added Groupby Resample resample method labels Sep 7, 2020
@@ -327,9 +327,10 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if self.key is not None and self.level is not None:
raise ValueError("The Grouper cannot specify both a key and a level!")

# Keep self.grouper value before overriding
# Keep self.grouper and self.indexer value before overriding
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to keep the indexer too, to have the new order of the Index.

@@ -311,7 +311,7 @@ def _get_grouper(self, obj, validate: bool = True):
)
return self.binner, self.grouper, self.obj

def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
def _set_grouper(self, obj: FrameOrSeries, sort: bool = False, group_indices: Dict = None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to hand the group_indices over, otherwise we can not know with which index positions our series corresponds. The connection would be lost otherwise.

if self._indexer is not None:
ax = self._grouper.take(self._indexer.argsort()).take(indices)
else:
ax = self._grouper.take(indices)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the df was not resorted, indexer is None. We have to use indices, because a for example a string index would kill obj.index

@phofl
Copy link
Member Author

phofl commented Sep 8, 2020

@jreback would you mind looking over my changes?

I passed the indices of the single groups through now. This information is lost otherwise. If there is a better way piping this through to this function please let me know.

We could sort self._grouper immediately after storing it. Did not know, which would be preferred.

The issue is matching a Series without the date column with the corresponding DatetimeIndex, which was resorted (probably).

@@ -314,6 +314,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elabortate what 'miscalculate' means here

Copy link
Member Author

@phofl phofl Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things can happen here right now:

  • If the Index of the Input df has any index except an RangeIndex starting at 0, it crashes (DateIndex, Index of type object, doesn't matter)
  • If the index is a RangeIndex, the obj.index keeps the previous index labels. These labels are used to extract Index components of self._grouper. But self._grouper is resorted most of the times in resample. So we select random index components which may or may not be correct. Most of the times this results in values getting assigned to different dates as they were in the input df (for example BUG: groupby resample different results with .agg() vs .mean() #33548)

@@ -314,6 +314,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write out
df.groupby(...).resample().agg(...) instead

Copy link
Member Author

@phofl phofl Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote out the two methods but I would like to keep the rest that way, because

df.groupby(...).resample(...)['...'].mean()

produces the same error

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if getattr(self.grouper, "name", None) == key and isinstance(
obj, ABCSeries
):
ax = self._grouper.take(obj.index)
if group_indices is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply access self._grouper.indices no? why all of the convulated logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self._grouper is and DatetimeIndex in this case. It does not have the attribute indices unfortunately. The information about the groups are all lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that happy about the convolution too. But I wanted to avoid making group_indices required, so we have to catch the None case. Also if the index self._grouper was not resorted by the resample method, the indexer is None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that is probably my fault. I added group_indices as input only in the resample trace. That was the only part, which run in there. Wasn't that clever. I will add it everywhere this method gets called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think I got the issue now.

We have to different grouper parent classes. The class BaseGrouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py which has the indices method and the different parent class Grouper from

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/grouper.py

which has no method indices. The Groupings class from the same file has the method indices, but at the function above we have no Groupings object. Should I implement a indices method for the Grouper class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the confusion, but currently I can not see a way how to implement a indices method for a Grouper. The Grouper class does not hold the necessary information. When we want to get rid of the convolution, we will have to pass the indices object from the underlying BaseGrouper into the function. This seems pretty ugly, because we only need the indices in the Resample case, which is currently implemented and we would have to pass the indices object through a lot of layers to get it into this function at different places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i don't understand what the actual problem is, but .indices is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices.)

groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.

I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answer. Maybe I miss something, then I would like to apologize in advance. But the way I understand this is the following:

In this specific resample case, we get a TimeGrouper, which inherits from Grouper.
Grouper does not have a indices Method. (

)

This file does have a indices method, but this method belongs to the Groupings class (
Class definition:

class Grouping:

indices definition:
def indices(self):

)

Groupings and Grouper have no relation with each other, so this does not help in this case.

We also have the BaseGrouper class, but this class also has no relation with Grouper or TimeGrouper.

I would really aprreciate it, if you could point out, where I am wrong. I stepped through the code, but the SeriesGroupBy class information with the indices method attached is lost here:

x = self._shallow_copy(x, groupby=self.groupby)

It was stored under self._groupby, but this information is not handed over

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if getattr(self.grouper, "name", None) == key and isinstance(
obj, ABCSeries
):
ax = self._grouper.take(obj.index)
if group_indices is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.

@@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if getattr(self.grouper, "name", None) == key and isinstance(
obj, ABCSeries
):
ax = self._grouper.take(obj.index)
if group_indices is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i don't understand what the actual problem is, but .indices is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices.)

groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.

I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).

@jreback
Copy link
Contributor

jreback commented Sep 14, 2020

the .grouper property holds a subclass of a Grouper (BinGrouper in this case) which has all of the info

TimeGrouper is just a simple class that is used to create the others

pls step thru and look at object at each stage

@phofl
Copy link
Member Author

phofl commented Sep 14, 2020

self.grouper or self._grouper respectively is a DateTimeIndex not a grouper in this case. This does not contain any indices.

I look through tomorrow anyway, but I already looked into self.grouper, I could not find the necessary information.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2020

closing in favor of #37905

@jreback jreback closed this Nov 26, 2020
@phofl phofl deleted the 35275 branch April 27, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants