Skip to content

[BUG]: Groupy and Resample miscalculated aggregation #36198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
- Bug in :meth:`DataFrame.groupby(...).resample(...)` when restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).
-

Reshaping
Expand Down
19 changes: 16 additions & 3 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,12 @@ def _get_grouper(self, obj, validate: bool = True):
)
return self.binner, self.grouper, self.obj

def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
def _set_grouper(
self,
obj: FrameOrSeries,
sort: bool = False,
group_indices: Optional[Dict] = None,
):
"""
given an object and the specifications, setup the internal grouper
for this particular specification
Expand All @@ -327,9 +332,10 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if self.key is not None and self.level is not None:
raise ValueError("The Grouper cannot specify both a key and a level!")

# Keep self.grouper value before overriding
# Keep self.grouper and self.indexer value before overriding
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to keep the indexer too, to have the new order of the Index.

if self._grouper is None:
self._grouper = self.grouper
self._indexer = self.indexer

# the key must be a valid info item
if self.key is not None:
Expand All @@ -338,7 +344,14 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if getattr(self.grouper, "name", None) == key and isinstance(
obj, ABCSeries
):
ax = self._grouper.take(obj.index)
if group_indices is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply access self._grouper.indices no? why all of the convulated logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self._grouper is and DatetimeIndex in this case. It does not have the attribute indices unfortunately. The information about the groups are all lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not that happy about the convolution too. But I wanted to avoid making group_indices required, so we have to catch the None case. Also if the index self._grouper was not resorted by the resample method, the indexer is None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is not correct here, you need to really try and trace this; .indices a method on the grouper itself which always exists (there are different flavors, BinGroup and so on); its possible its not being passed thru correct. We need to fix at the correct level and avoid convolution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that is probably my fault. I added group_indices as input only in the resample trace. That was the only part, which run in there. Wasn't that clever. I will add it everywhere this method gets called.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I think I got the issue now.

We have to different grouper parent classes. The class BaseGrouper from https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/ops.py which has the indices method and the different parent class Grouper from

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/grouper.py

which has no method indices. The Groupings class from the same file has the method indices, but at the function above we have no Groupings object. Should I implement a indices method for the Grouper class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for the confusion, but currently I can not see a way how to implement a indices method for a Grouper. The Grouper class does not hold the necessary information. When we want to get rid of the convolution, we will have to pass the indices object from the underlying BaseGrouper into the function. This seems pretty ugly, because we only need the indices in the Resample case, which is currently implemented and we would have to pass the indices object through a lot of layers to get it into this function at different places.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i don't understand what the actual problem is, but .indices is available on all groupers, (see pandas/core/groupby/grouper.py and search for def indices.)

groupby.resample is just a case of using the resampler grouper, which is generally a BinGrouper.

I would encourage you to step into this line-by-line and really follow the functions (this is not trivial but provides lots of insight).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the answer. Maybe I miss something, then I would like to apologize in advance. But the way I understand this is the following:

In this specific resample case, we get a TimeGrouper, which inherits from Grouper.
Grouper does not have a indices Method. (

)

This file does have a indices method, but this method belongs to the Groupings class (
Class definition:

class Grouping:

indices definition:
def indices(self):

)

Groupings and Grouper have no relation with each other, so this does not help in this case.

We also have the BaseGrouper class, but this class also has no relation with Grouper or TimeGrouper.

I would really aprreciate it, if you could point out, where I am wrong. I stepped through the code, but the SeriesGroupBy class information with the indices method attached is lost here:

x = self._shallow_copy(x, groupby=self.groupby)

It was stored under self._groupby, but this information is not handed over

ax = self._grouper.take(obj.index)
else:
indices = group_indices.get(obj.name)
if self._indexer is not None:
ax = self._grouper.take(self._indexer.argsort()).take(indices)
else:
ax = self._grouper.take(indices)
else:
if key not in obj._info_axis:
raise KeyError(f"The grouper name {key} is not found")
Expand Down
10 changes: 8 additions & 2 deletions pandas/core/resample.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,11 @@ def __init__(self, obj, groupby=None, axis=0, kind=None, **kwargs):
self.grouper = None

if self.groupby is not None:
self.groupby._set_grouper(self._convert_obj(obj), sort=True)
self.groupby._set_grouper(
self._convert_obj(obj),
sort=True,
group_indices=kwargs.get("group_indices"),
)

def __str__(self) -> str:
"""
Expand Down Expand Up @@ -980,7 +984,9 @@ def _apply(self, f, grouper=None, *args, **kwargs):
"""

def func(x):
x = self._shallow_copy(x, groupby=self.groupby)
x = self._shallow_copy(
x, groupby=self.groupby, group_indices=self._groupby.indices
)

if isinstance(f, str):
return getattr(x, f)(**kwargs)
Expand Down
70 changes: 68 additions & 2 deletions pandas/tests/resample/test_resampler_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,10 +124,12 @@ def test_getitem_multiple():
tm.assert_series_equal(result, expected)


def test_groupby_resample_on_api_with_getitem():
@pytest.mark.parametrize("index_values", [[0, 1, 2, 3, 4], ["a", "b", "c", "d", "e"]])
def test_groupby_resample_on_api_with_getitem(index_values):
# GH 17813
df = pd.DataFrame(
{"id": list("aabbb"), "date": pd.date_range("1-1-2016", periods=5), "data": 1}
{"id": list("aabbb"), "date": pd.date_range("1-1-2016", periods=5), "data": 1},
index=pd.Index(index_values),
)
exp = df.set_index("date").groupby("id").resample("2D")["data"].sum()
result = df.groupby("id").resample("2D", on="date")["data"].sum()
Expand Down Expand Up @@ -347,3 +349,67 @@ def test_median_duplicate_columns():
result = df.resample("5s").median()
expected.columns = result.columns
tm.assert_frame_equal(result, expected)


def test_resample_different_result_with_agg():
# GH: 35275 and 33548
data = pd.DataFrame(
{
"cat": ["cat1", "cat1", "cat2", "cat1", "cat2", "cat1", "cat2", "cat1"],
"num": [5, 20, 22, 3, 4, 30, 10, 50],
"date": [
"2019-2-1",
"2018-02-03",
"2020-3-11",
"2019-2-2",
"2019-2-2",
"2018-12-4",
"2020-3-11",
"2020-12-12",
],
}
)
data["date"] = pd.to_datetime(data["date"])

resampled = data.groupby("cat").resample("Y", on="date")

index = pd.MultiIndex.from_tuples(
[
("cat1", "2018-12-31"),
("cat1", "2019-12-31"),
("cat1", "2020-12-31"),
("cat2", "2019-12-31"),
("cat2", "2020-12-31"),
],
names=["cat", "date"],
)
index = index.set_levels([index.levels[0], pd.to_datetime(index.levels[1])])
expected = DataFrame([25, 4, 50, 4, 16], columns=pd.Index(["num"]), index=index)
result = resampled.agg({"num": "mean"})
tm.assert_frame_equal(result, expected)
result = resampled["num"].mean()
tm.assert_series_equal(result, expected["num"])
result = resampled.mean()
tm.assert_frame_equal(result, expected)


def test_resample_agg_different_results_on_keyword():
# GH: 27343
df = pd.DataFrame.from_records(
{
"ref": ["a", "a", "a", "b", "b"],
"time": [
"2014-12-31",
"2015-12-31",
"2016-12-31",
"2012-12-31",
"2014-12-31",
],
"value": 5 * [1],
}
)
df["time"] = pd.to_datetime(df["time"])

expected = df.set_index("time").groupby("ref").resample(rule="M")["value"].sum()
result = df.groupby("ref").resample(rule="M", on="time")["value"].sum()
tm.assert_series_equal(result, expected)