Skip to content

BUG: Unexpected behaviour of rolling with apply on DataFrame #34965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
oXwvdrbbj8S4wo9k8lSN opened this issue Jun 24, 2020 · 1 comment
Closed
2 of 3 tasks
Labels
Bug Duplicate Report Duplicate issue or pull request Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@oXwvdrbbj8S4wo9k8lSN
Copy link

oXwvdrbbj8S4wo9k8lSN commented Jun 24, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


When executed on a DataFrame, rolling seems to select only certain columns for processing. For demonstration, I created a DataFrame that has three columns (A, B, and C), of which the first contains TimeDeltas and the other contain Floats. When using rolling, e.g. with sum, only the Floats are passed on.
Even stranger, when used in combination with apply, only the first column containing Floats is passed to the function, whereas I would have expected the corresponding part of the DataFrame.

Code Sample, a copy-pastable example

import pandas as pd
columns = ["A", "B", "C"]
index = list(range(10))
data = [[10**10,2,3]]*len(index)
df = pd.DataFrame(columns = columns, index = index, data=data)
df["A"] = df["A"].apply(pd.to_timedelta)

The resulting df will look like this:

         A  B  C
0 00:00:10  2  3
1 00:00:10  2  3
2 00:00:10  2  3
3 00:00:10  2  3
4 00:00:10  2  3
5 00:00:10  2  3
6 00:00:10  2  3
7 00:00:10  2  3
8 00:00:10  2  3
9 00:00:10  2  3

Applying rolling with sum like this:

df.rolling(window=2).sum()

will result in the following output, in which the first column is missing:

     B    C
0  NaN  NaN
1  4.0  6.0
2  4.0  6.0
3  4.0  6.0
4  4.0  6.0
5  4.0  6.0
6  4.0  6.0
7  4.0  6.0
8  4.0  6.0
9  4.0  6.0

To demonstrate the problem with apply, I created a custom function that simply outputs the number of columns (since I expected a DataFrame to be passed to the function:

def get_num_columns(sub_df):
    print(sub_df)
    return len(sub_df.columns)
df.rolling(window=2).apply(get_num_columns, raw=False)

This produces the exception "AttributeError: 'Series' object has no attribute 'columns'" and the following printout:

0    2.0
1    2.0
dtype: float64

Problem description

I would expect in both cases that the windowed DataFrame with all columns is used within the function (either sum or get_num_columns).

Expected Output

In the case of sum, I would either expect an Exception that tells the user that only Floats are acceptable or - preferably - the following output:

         A    B    C
0      NaT  NaN  NaN
1 00:00:20  4.0  6.0
2 00:00:20  4.0  6.0
3 00:00:20  4.0  6.0
4 00:00:20  4.0  6.0
5 00:00:20  4.0  6.0
6 00:00:20  4.0  6.0
7 00:00:20  4.0  6.0
8 00:00:20  4.0  6.0
9 00:00:20  4.0  6.0

In the case of apply, I would have expected a DataFrame as input to the function. Therefore, the output of the function (without the prints) should be:

0    3
1    3
2    3
3    3
4    3
5    3
6    3
7    3
8    3
9    3

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.76-linuxkit
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.3.1.post20200616
Cython : 0.29.20
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.1
html5lib : None
pymysql : 0.9.3
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.1
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@oXwvdrbbj8S4wo9k8lSN oXwvdrbbj8S4wo9k8lSN added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 24, 2020
@TomAugspurger
Copy link
Contributor

I think that rolling.apply is defined to take 1d input (ndframe or series). We don't have a table-wide rolling.apply (#15095).

We have #23002 for non-numeric data in rolling. Which I think covers all your issues.

@TomAugspurger TomAugspurger added the Duplicate Report Duplicate issue or pull request label Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants