-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: GH45912 breaks important functionality in rolling apply #47494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the key functionality I'm trying to preserve can still be obtained in the current release by creating a single-column version of the dataframe with primary_key column and not using But I think there should at least be a change to the documentation of the |
Thanks @DRudel for the report.
this was an intentional change in #45925 that I think is considered a bug-fix not a change in behavior. (the release note The series passed to the apply function is now consistent with iterating over the grouper and does not lose the grouped on values which makes sense to me. @mroeschke ? list(my_df.rolling("2d", on="date"))
# [ values
# date
# 2000-03-11 10,
# values
# date
# 2000-03-11 10
# 2000-03-12 3,
# values
# date
# 2000-03-12 3
# 2000-03-13 11,
# values
# date
# 2000-03-12 3
# 2000-03-13 11
# 2000-03-13 12,
# values
# date
# 2000-03-12 3
# 2000-03-13 11
# 2000-03-13 12
# 2000-03-13 3] |
Thanks @simonjayhawkins for the quick reply. In the 1.3.1 implementation, the use of Things would be different is there was an easy facility to have rolling().apply() consume all the columns at once [as in groupby().apply()], but in a world where rolling().apply() acts on each column separately, the |
sounds reasonable. will label as a regression for now, pending feedback from @mroeschke |
Yeah this is a case I don't think was ever codified or thought of when this was implemented. When first implemented, the
So when #45912 was reported ( When generalizing this fix, I figured the passed pandas/pandas/core/window/rolling.py Lines 147 to 161 in dc36ce1
|
This is another argument why the final "on" is the index of the passed Series. We test this in |
Based on the above explanation, I think it would be good to document this case. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Starting in 1.4.1 and after, the code throws a
No Numeric Types to Aggregate
error because an undocumented change was made to rolling in which usingon=col
parameter inrolling()
causes a laterapply()
to re-index to thecol
column, making the index of the series sent to the function the wrong index.The specific code doing this re-indexing is in
RollingExpandingMixin
:This loses important functionality because currently
rolling().apply()
can only process one column at a time, so if you want to be able to recover what the windows were in able to do some operation spanning all columns, you need to notate the values of the indexes for the windows passed in [using a closure or global variable] and use those indexes to reconstruct later what the windows were. With the possibility of repeated values, that may mean creating an integer "primary key" index to use as the index and usingon=my_date_column
to specify the windowing.The current code assumes that the user wants to use the
on
column for indexing, but if that were the case, the user could have simply re-indexed in the calling code.Expected Behavior
In 1.3.1 the code above produces expected behavior: a 2-column table with numeric index and the the final column indicating the first value of that index that starts each of the 5 windows [0, 0, 1, 1, 1]
0 2000-03-11 0.0
1 2000-03-12 0.0
2 2000-03-13 1.0
3 2000-03-13 1.0
4 2000-03-13 1.0
Installed Versions
1.4.3
The text was updated successfully, but these errors were encountered: