Skip to content

[BUG]: Groupy and Resample miscalculated aggregation #36198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -314,6 +314,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupby.tshift` failing to raise ``ValueError`` when a frequency cannot be inferred for the index of a group (:issue:`35937`)
- Bug in :meth:`DataFrame.groupby` does not always maintain column index name for ``any``, ``all``, ``bfill``, ``ffill``, ``shift`` (:issue:`29764`)
- Bug in :meth:`DataFrameGroupBy.apply` raising error with ``np.nan`` group(s) when ``dropna=False`` (:issue:`35889`)
- Bug when combining methods :meth:`DataFrame.groupby` with :meth:`DataFrame.resample` and restricting to `Series` or using `agg` did miscalculate the aggregation (:issue:`27343`, :issue:`33548`, :issue:`35275`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elabortate what 'miscalculate' means here

Copy link
Member Author

@phofl phofl Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things can happen here right now:

  • If the Index of the Input df has any index except an RangeIndex starting at 0, it crashes (DateIndex, Index of type object, doesn't matter)
  • If the index is a RangeIndex, the obj.index keeps the previous index labels. These labels are used to extract Index components of self._grouper. But self._grouper is resorted most of the times in resample. So we select random index components which may or may not be correct. Most of the times this results in values getting assigned to different dates as they were in the input df (for example BUG: groupby resample different results with .agg() vs .mean() #33548)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write out
df.groupby(...).resample().agg(...) instead

Copy link
Member Author

@phofl phofl Sep 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote out the two methods but I would like to keep the rest that way, because

df.groupby(...).resample(...)['...'].mean()

produces the same error

-

Reshaping
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,6 +338,20 @@ def _set_grouper(self, obj: FrameOrSeries, sort: bool = False):
if getattr(self.grouper, "name", None) == key and isinstance(
obj, ABCSeries
):
# In some cases the self._grouper may be sorted differently than obj.
# self.obj has the right order with the old index in the first go
# around. We align the index from obj with the self.obj index to
# select the correct values. Additionally we have to sort
# obj.index correctly to aggregate correctly.
if not hasattr(self, "_index_mapping"):
self._index_mapping = DataFrame(
self.obj.index, columns=["index"]
).sort_values(by="index")
obj.sort_index(inplace=True)
obj.index = self._index_mapping.loc[
self._index_mapping["index"].isin(obj.index)
].index
obj.sort_index(inplace=True)
ax = self._grouper.take(obj.index)
else:
if key not in obj._info_axis:
Expand Down
64 changes: 64 additions & 0 deletions pandas/tests/resample/test_resampler_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,3 +347,67 @@ def test_median_duplicate_columns():
result = df.resample("5s").median()
expected.columns = result.columns
tm.assert_frame_equal(result, expected)


def test_resample_different_result_with_agg():
# GH: 35275 and 33548
data = pd.DataFrame(
{
"cat": ["cat1", "cat1", "cat2", "cat1", "cat2", "cat1", "cat2", "cat1"],
"num": [5, 20, 22, 3, 4, 30, 10, 50],
"date": [
"2019-2-1",
"2018-02-03",
"2020-3-11",
"2019-2-2",
"2019-2-2",
"2018-12-4",
"2020-3-11",
"2020-12-12",
],
}
)
data["date"] = pd.to_datetime(data["date"])

resampled = data.groupby("cat").resample("Y", on="date")

index = pd.MultiIndex.from_tuples(
[
("cat1", "2018-12-31"),
("cat1", "2019-12-31"),
("cat1", "2020-12-31"),
("cat2", "2019-12-31"),
("cat2", "2020-12-31"),
],
names=["cat", "date"],
)
index = index.set_levels([index.levels[0], pd.to_datetime(index.levels[1])])
expected = DataFrame([25, 4, 50, 4, 16], columns=pd.Index(["num"]), index=index)
result = resampled.agg({"num": "mean"})
tm.assert_frame_equal(result, expected)
result = resampled["num"].mean()
tm.assert_series_equal(result, expected["num"])
result = resampled.mean()
tm.assert_frame_equal(result, expected)


def test_resample_agg_different_results_on_keyword():
# GH: 27343
df = pd.DataFrame.from_records(
{
"ref": ["a", "a", "a", "b", "b"],
"time": [
"2014-12-31",
"2015-12-31",
"2016-12-31",
"2012-12-31",
"2014-12-31",
],
"value": 5 * [1],
}
)
df["time"] = pd.to_datetime(df["time"])

expected = df.set_index("time").groupby("ref").resample(rule="M")["value"].sum()
result = df.groupby("ref").resample(rule="M", on="time")["value"].sum()
tm.assert_series_equal(result, expected)