Skip to content

BUG: Unexpected side effects within agg function #44813

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
matteosantama opened this issue Dec 8, 2021 · 2 comments
Open
2 of 3 tasks

BUG: Unexpected side effects within agg function #44813

matteosantama opened this issue Dec 8, 2021 · 2 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Resample resample method

Comments

@matteosantama
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

index = pd.period_range(start="2021-01-01", end="2021-01-3", freq="T")
n = index.shape[0]

df = pd.DataFrame({"A": range(n), "B": range(n)}, index=index)

>>> df.head()
#                   A  B
# 2021-01-01 00:00  0  0
# 2021-01-01 00:01  1  1
# 2021-01-01 00:02  2  2
# 2021-01-01 00:03  3  3
# 2021-01-01 00:04  4  4


def f(grp: pd.DataFrame) -> int:
    return 2

>>> df.resample("D").agg(lambda x: f(x))
#             A  B
# 2021-01-01  2  2
# 2021-01-02  2  2
# 2021-01-03  2  2


def g(grp: pd.DataFrame) -> int:
    # the addition of this line alters the result
    x = grp["A"].sum()
    return 2

>>> df.resample("D").agg(lambda x: g(x))
# 2021-01-01    2
# 2021-01-02    2
# 2021-01-03    2
# Freq: D, dtype: int64

Issue Description

The code should be fairly self-explanatory, but the idea is that I've defined two functions, f and g. Both take in a DataFrame and return the number 2, but g first sums the DataFrame's A column and discards the result. This action unexplainably alters the output of df.resample("D").agg. When we don't sum the A column, we get back a full DataFrame with both A and B columns. When we do sum the A column, we get back a Series.

Expected Behavior

The output of df.resample("D").agg(lambda x: f(x)) and df.resample("D").agg(lambda x: g(x)) should be exactly the same, since both functions return the same thing,

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9ed python : 3.9.7.final.0 python-bits : 64 OS : Darwin OS-release : 21.1.0 Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:01 PDT 2021; root:xnu-8019.41.5~1/RELEASE_ARM64_T6000 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.3.4 numpy : 1.21.4 pytz : 2021.3 dateutil : 2.8.2 pip : 21.3.1 setuptools : 57.4.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.3 IPython : 7.30.0 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.5.0 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.7.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
@matteosantama matteosantama added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2021
@sappersapper
Copy link

sappersapper commented Dec 8, 2021

It seems that pandas tries to pass series as well as dataframe into the func in agg: For most single user-defined function(except for np.min, np.std ...), the Resampler._groupby_and_aggregate() will be invoked. Firstly the function is tried to perform column by column, if it fails, then the function is tried to perform on the dataframe of one group.

In your example, function f() can perform on series as well dataframe and due to the priority in _groupby_and_aggregate(), it actually invoked column by column(the type annotation does not take effect). As for function g(), only dataframe can be valid input, so the output is only one column filling with 2.

def _groupby_and_aggregate(self, how, *args, **kwargs):
"""
Re-evaluate the obj with a groupby aggregation.
"""
grouper = self.grouper
obj = self._selected_obj
grouped = get_groupby(obj, by=None, grouper=grouper, axis=self.axis)
try:
if isinstance(obj, ABCDataFrame) and callable(how):
# Check if the function is reducing or not.
result = grouped._aggregate_item_by_item(how, *args, **kwargs)
else:
result = grouped.aggregate(how, *args, **kwargs)
except DataError:
# got TypeErrors on aggregation
result = grouped.apply(how, *args, **kwargs)
except (AttributeError, KeyError):
# we have a non-reducing function; try to evaluate
# alternatively we want to evaluate only a column of the input
# test_apply_to_one_column_of_df the function being applied references
# a DataFrame column, but aggregate_item_by_item operates column-wise
# on Series, raising AttributeError or KeyError
# (depending on whether the column lookup uses getattr/__getitem__)
result = grouped.apply(how, *args, **kwargs)
except ValueError as err:
if "Must produce aggregated value" in str(err):
# raised in _aggregate_named
# see test_apply_without_aggregation, test_apply_with_mutated_index
pass
else:
raise
# we have a non-reducing function
# try to evaluate
result = grouped.apply(how, *args, **kwargs)
result = self._apply_loffset(result)
return self._wrap_result(result)

Maybe it is usefull to adjust the invoking logic of agg to use the information of type annotation.

@matteosantama
Copy link
Contributor Author

Using type annotations to inform the execution is a decent option, although I believe it would be a first in the pandas API. Perhaps a better idea is an axis or result_type-style keyword argument to Resampler.agg (as found in DataFrame.apply).

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Resample resample method
Projects
None yet
Development

No branches or pull requests

3 participants