BUG: GH45912 breaks important functionality in rolling apply #47494

DRudel · 2022-06-24T11:41:22Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000', '3/13/2000', '3/13/2000'])
my_df = pd.DataFrame({
    'date': s,
    'values': [10, 3, 11, 12, 3]
}, index=list(range(len(s))))

my_df['date'] = pd.to_datetime(my_df['date'])

def my_custom_function(ser: pd.Series):
    print(ser.index)
    return ser.index[0]

results = my_df.rolling('2d', on='date').apply(my_custom_function)
print(results)

Issue Description

Starting in 1.4.1 and after, the code throws a No Numeric Types to Aggregate error because an undocumented change was made to rolling in which using on=col parameter in rolling() causes a later apply() to re-index to the col column, making the index of the series sent to the function the wrong index.

The specific code doing this re-indexing is in RollingExpandingMixin:

        def apply_func(values, begin, end, min_periods, raw=raw):
            if not raw:
                # GH 45912
                values = Series(values, index=self._on)
            return window_func(values, begin, end, min_periods)

        return apply_func

This loses important functionality because currently rolling().apply() can only process one column at a time, so if you want to be able to recover what the windows were in able to do some operation spanning all columns, you need to notate the values of the indexes for the windows passed in [using a closure or global variable] and use those indexes to reconstruct later what the windows were. With the possibility of repeated values, that may mean creating an integer "primary key" index to use as the index and using on=my_date_column to specify the windowing.

The current code assumes that the user wants to use the on column for indexing, but if that were the case, the user could have simply re-indexed in the calling code.

Expected Behavior

In 1.3.1 the code above produces expected behavior: a 2-column table with numeric index and the the final column indicating the first value of that index that starts each of the 5 windows [0, 0, 1, 1, 1]

    date  values

0 2000-03-11 0.0
1 2000-03-12 0.0
2 2000-03-13 1.0
3 2000-03-13 1.0
4 2000-03-13 1.0

Installed Versions

1.4.3

The text was updated successfully, but these errors were encountered:

DRudel · 2022-06-24T12:08:57Z

I think the key functionality I'm trying to preserve can still be obtained in the current release by creating a single-column version of the dataframe with primary_key column and not using on, and pulling the numbers from the series rather than the index.

But I think there should at least be a change to the documentation of the on parameter to alert users that the windows will be re-indexed. Ideally there is a parameter that can be used to prevent the re-indexing for those who need the original index information. One can imagine that this may be useful as the index is the one column that will be available for all transformation operations when using rolling().apply().

simonjayhawkins · 2022-06-24T13:05:19Z

Thanks @DRudel for the report.

Starting in 1.4.1 and after, the code throws a No Numeric Types to Aggregate error because an undocumented change was made to rolling in which using on=col parameter in rolling() causes a later apply() to re-index to the col column, making the index of the series sent to the function the wrong index.

this was an intentional change in #45925 that I think is considered a bug-fix not a change in behavior. (the release note Bug in apply() with axis=1 raising an erroneous ValueError ([GH45912](https://github.com/pandas-dev/pandas/issues/45912)) on hindsight could perhaps have been clearer.

The series passed to the apply function is now consistent with iterating over the grouper and does not lose the grouped on values which makes sense to me. @mroeschke ?

list(my_df.rolling("2d", on="date"))
# [            values
#  date              
#  2000-03-11      10,
#              values
#  date              
#  2000-03-11      10
#  2000-03-12       3,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11
#  2000-03-13      12,
#              values
#  date              
#  2000-03-12       3
#  2000-03-13      11
#  2000-03-13      12
#  2000-03-13       3]

DRudel · 2022-06-24T13:34:00Z

Thanks @simonjayhawkins for the quick reply.
My view is that if the user is interested in having access to the grouped-on series, why is he using the on parameter at all?

In the 1.3.1 implementation, the use of on provides the user the unique functionality of using one column to control the windowing but exposing a different vector of information (the index) to the service code [the function called via apply()].

Things would be different is there was an easy facility to have rolling().apply() consume all the columns at once [as in groupby().apply()], but in a world where rolling().apply() acts on each column separately, the on value provides a facility to have two separate windowed series to work with in the service code. That strikes me as an important affordance that is lost in the present implementation.

simonjayhawkins · 2022-06-24T13:38:52Z

sounds reasonable. will label as a regression for now, pending feedback from @mroeschke

mroeschke · 2022-06-24T19:42:31Z

Yeah this is a case I don't think was ever codified or thought of when this was implemented.

When first implemented, the Series passed to apply was originally just hardcoded to self.obj.index

https://github.com/pandas-dev/pandas/pull/20584/files#diff-7b4606d826cb66e7a87fc3cac8f7b4c97bf2a2e55512f975731e8caf0b7c4c55R997

self.obj.index is implicit "on" when on=None & axis=0

So when #45912 was reported (on=None & axis=1), of course this failed since the implicit "on" should be self.obj.columns

When generalizing this fix, I figured the passed Series's index should have the final determined "on" (whether implicitly or explicitly defined)

pandas/pandas/core/window/rolling.py

Lines 147 to 161 in dc36ce1

    
           if self.on is None: 
        
               if self.axis == 0: 
        
                   self._on = self.obj.index 
        
               else: 
        
                   # i.e. self.axis == 1 
        
                   self._on = self.obj.columns 
        
           elif isinstance(self.on, Index): 
        
               self._on = self.on 
        
           elif isinstance(self.obj, ABCDataFrame) and self.on in self.obj.columns: 
        
               self._on = Index(self.obj[self.on]) 
        
           else: 
        
               raise ValueError( 
        
                   f"invalid on specified as {self.on}, " 
        
                   "must be a column (of DataFrame), an Index or None" 
        
               )

mroeschke · 2022-06-24T19:51:18Z

The series passed to the apply function is now consistent with iterating over the grouper and does not lose the grouped on values which makes sense to me. @mroeschke ?

This is another argument why the final "on" is the index of the passed Series. We test this in test_iter_rolling_on_dataframe

mroeschke · 2022-08-22T18:20:02Z

Based on the above explanation, I think it would be good to document this case.

DRudel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2022

simonjayhawkins added Regression Functionality that used to work in a prior pandas version Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2022

simonjayhawkins added this to the 1.4.4 milestone Jun 24, 2022

mroeschke removed this from the 1.4.4 milestone Aug 22, 2022

mroeschke added Docs Apply Apply, Aggregate, Transform, Map and removed Bug Regression Functionality that used to work in a prior pandas version labels Aug 22, 2022

simonjayhawkins added this to the Contributions Welcome milestone Aug 23, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

fakio mentioned this issue Nov 27, 2022

BUG: groupby rolling apply giving strange index #49878

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GH45912 breaks important functionality in rolling apply #47494

BUG: GH45912 breaks important functionality in rolling apply #47494

DRudel commented Jun 24, 2022

DRudel commented Jun 24, 2022

simonjayhawkins commented Jun 24, 2022

DRudel commented Jun 24, 2022

simonjayhawkins commented Jun 24, 2022

mroeschke commented Jun 24, 2022

mroeschke commented Jun 24, 2022

mroeschke commented Aug 22, 2022

BUG: GH45912 breaks important functionality in rolling apply #47494

BUG: GH45912 breaks important functionality in rolling apply #47494

Comments

DRudel commented Jun 24, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

DRudel commented Jun 24, 2022

simonjayhawkins commented Jun 24, 2022

DRudel commented Jun 24, 2022

simonjayhawkins commented Jun 24, 2022

mroeschke commented Jun 24, 2022

mroeschke commented Jun 24, 2022

mroeschke commented Aug 22, 2022