Skip to content

BUG: GH45912 breaks important functionality in rolling apply #47494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
DRudel opened this issue Jun 24, 2022 · 7 comments
Open
3 tasks done

BUG: GH45912 breaks important functionality in rolling apply #47494

DRudel opened this issue Jun 24, 2022 · 7 comments
Labels
Apply Apply, Aggregate, Transform, Map Docs Window rolling, ewma, expanding

Comments

@DRudel
Copy link
Contributor

DRudel commented Jun 24, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000', '3/13/2000', '3/13/2000'])
my_df = pd.DataFrame({
    'date': s,
    'values': [10, 3, 11, 12, 3]
}, index=list(range(len(s))))

my_df['date'] = pd.to_datetime(my_df['date'])

def my_custom_function(ser: pd.Series):
    print(ser.index)
    return ser.index[0]

results = my_df.rolling('2d', on='date').apply(my_custom_function)
print(results)

Issue Description

Starting in 1.4.1 and after, the code throws a No Numeric Types to Aggregate error because an undocumented change was made to rolling in which using on=col parameter in rolling() causes a later apply() to re-index to the col column, making the index of the series sent to the function the wrong index.

The specific code doing this re-indexing is in RollingExpandingMixin:

        def apply_func(values, begin, end, min_periods, raw=raw):
            if not raw:
                # GH 45912
                values = Series(values, index=self._on)
            return window_func(values, begin, end, min_periods)

        return apply_func

This loses important functionality because currently rolling().apply() can only process one column at a time, so if you want to be able to recover what the windows were in able to do some operation spanning all columns, you need to notate the values of the indexes for the windows passed in [using a closure or global variable] and use those indexes to reconstruct later what the windows were. With the possibility of repeated values, that may mean creating an integer "primary key" index to use as the index and using on=my_date_column to specify the windowing.

The current code assumes that the user wants to use the on column for indexing, but if that were the case, the user could have simply re-indexed in the calling code.

Expected Behavior

In 1.3.1 the code above produces expected behavior: a 2-column table with numeric index and the the final column indicating the first value of that index that starts each of the 5 windows [0, 0, 1, 1, 1]


    date  values

0 2000-03-11 0.0
1 2000-03-12 0.0
2 2000-03-13 1.0
3 2000-03-13 1.0
4 2000-03-13 1.0


Installed Versions

1.4.3

@DRudel DRudel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2022
@DRudel
Copy link
Contributor Author

DRudel commented Jun 24, 2022

I think the key functionality I'm trying to preserve can still be obtained in the current release by creating a single-column version of the dataframe with primary_key column and not using on, and pulling the numbers from the series rather than the index.

But I think there should at least be a change to the documentation of the on parameter to alert users that the windows will be re-indexed. Ideally there is a parameter that can be used to prevent the re-indexing for those who need the original index information. One can imagine that this may be useful as the index is the one column that will be available for all transformation operations when using rolling().apply().

@simonjayhawkins
Copy link
Member

Thanks @DRudel for the report.

Starting in 1.4.1 and after, the code throws a No Numeric Types to Aggregate error because an undocumented change was made to rolling in which using on=col parameter in rolling() causes a later apply() to re-index to the col column, making the index of the series sent to the function the wrong index.

this was an intentional change in #45925 that I think is considered a bug-fix not a change in behavior. (the release note Bug in apply() with axis=1 raising an erroneous ValueError ([GH45912](https://github.com/pandas-dev/pandas/issues/45912)) on hindsight could perhaps have been clearer.

The series passed to the apply function is now consistent with iterating over the grouper and does not lose the grouped on values which makes sense to me. @mroeschke ?

list(my_df.rolling("2d", on="date"))
# [            values
#  date              
#  2000-03-11      10,
#              values
#  date              
#  2000-03-11      10
#  2000-03-12       3,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11
#  2000-03-13      12,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11
#  2000-03-13      12
#  2000-03-13       3]

@DRudel
Copy link
Contributor Author

DRudel commented Jun 24, 2022

Thanks @simonjayhawkins for the quick reply.
My view is that if the user is interested in having access to the grouped-on series, why is he using the on parameter at all?

In the 1.3.1 implementation, the use of on provides the user the unique functionality of using one column to control the windowing but exposing a different vector of information (the index) to the service code [the function called via apply()].

Things would be different is there was an easy facility to have rolling().apply() consume all the columns at once [as in groupby().apply()], but in a world where rolling().apply() acts on each column separately, the on value provides a facility to have two separate windowed series to work with in the service code. That strikes me as an important affordance that is lost in the present implementation.

@simonjayhawkins
Copy link
Member

sounds reasonable. will label as a regression for now, pending feedback from @mroeschke

@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2022
@simonjayhawkins simonjayhawkins added this to the 1.4.4 milestone Jun 24, 2022
@mroeschke
Copy link
Member

Yeah this is a case I don't think was ever codified or thought of when this was implemented.

When first implemented, the Series passed to apply was originally just hardcoded to self.obj.index

https://github.com/pandas-dev/pandas/pull/20584/files#diff-7b4606d826cb66e7a87fc3cac8f7b4c97bf2a2e55512f975731e8caf0b7c4c55R997

self.obj.index is implicit "on" when on=None & axis=0

So when #45912 was reported (on=None & axis=1), of course this failed since the implicit "on" should be self.obj.columns

When generalizing this fix, I figured the passed Series's index should have the final determined "on" (whether implicitly or explicitly defined)

if self.on is None:
if self.axis == 0:
self._on = self.obj.index
else:
# i.e. self.axis == 1
self._on = self.obj.columns
elif isinstance(self.on, Index):
self._on = self.on
elif isinstance(self.obj, ABCDataFrame) and self.on in self.obj.columns:
self._on = Index(self.obj[self.on])
else:
raise ValueError(
f"invalid on specified as {self.on}, "
"must be a column (of DataFrame), an Index or None"
)

@mroeschke
Copy link
Member

The series passed to the apply function is now consistent with iterating over the grouper and does not lose the grouped on values which makes sense to me. @mroeschke ?

This is another argument why the final "on" is the index of the passed Series. We test this in test_iter_rolling_on_dataframe

@mroeschke
Copy link
Member

Based on the above explanation, I think it would be good to document this case.

@mroeschke mroeschke removed this from the 1.4.4 milestone Aug 22, 2022
@mroeschke mroeschke added Docs Apply Apply, Aggregate, Transform, Map and removed Bug Regression Functionality that used to work in a prior pandas version labels Aug 22, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 23, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Docs Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants