DataFrame.rolling does nothing when values are in a list #18129

tdpetrou · 2017-11-06T02:06:47Z

This issue is based on data from this Stack Overflow post.

First, I get all the values for each day into a list using resample. I then try and apply a five-day rolling function and the original DataFrame is returned. No calculation happens.

>>> d = {'favorable': [0.48, 0.51, 0.56, 0.51, 0.48, 0.46, 0.48, 0.49, 0.53, 0.51, 0.49, 0.47, 0.49, 0.53, 0.47, 0.49, 0.52, 0.5, 0.51, 0.51], 
         'unfavorable': [0.49, 0.48, 0.4, 0.47, 0.49, 0.46, 0.49, 0.48, 0.45, 0.45, 0.49, 0.47, 0.45, 0.39, 0.44, 0.48, 0.46, 0.47, 0.46, 0.41], 
         'other': [0.03, 0.02, 0.04, 0.02, 0.04, 0.09, 0.03, 0.03, 0.02, 0.04, 0.03, 0.05, 0.06, 0.0, 0.08, 0.03, 0.01, 0.03, 0.02, 0.0]}
>>> index = pd.DatetimeIndex(['2012-10-25', '2012-10-25', '2012-10-26', '2012-10-27', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-30', '2012-11-01', '2012-11-01', '2012-11-01', '2012-11-03', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-05', '2012-11-07'])
>>> df = pd.DataFrame(d, index=index)
>>> df

            favorable  other  unfavorable
2012-10-25       0.48   0.03         0.49
2012-10-25       0.51   0.02         0.48
2012-10-26       0.56   0.04         0.40
2012-10-27       0.51   0.02         0.47
2012-10-28       0.48   0.04         0.49
2012-10-28       0.46   0.09         0.46
2012-10-28       0.48   0.03         0.49
2012-10-28       0.49   0.03         0.48
2012-10-30       0.53   0.02         0.45
2012-11-01       0.51   0.04         0.45
2012-11-01       0.49   0.03         0.49
2012-11-01       0.47   0.05         0.47
2012-11-03       0.49   0.06         0.45
2012-11-04       0.53   0.00         0.39
2012-11-04       0.47   0.08         0.44
2012-11-04       0.49   0.03         0.48
2012-11-04       0.52   0.01         0.46
2012-11-04       0.50   0.03         0.47
2012-11-05       0.51   0.02         0.46
2012-11-07       0.51   0.00         0.41

>>> df1 = df.resample('D').apply(lambda x: x.tolist())

                                favorable                          other                     unfavorable
2012-10-25                   [0.48, 0.51]                   [0.03, 0.02]                    [0.49, 0.48]
2012-10-26                         [0.56]                         [0.04]                           [0.4]
2012-10-27                         [0.51]                         [0.02]                          [0.47] 
2012-10-28       [0.48, 0.46, 0.48, 0.49]       [0.04, 0.09, 0.03, 0.03]        [0.49, 0.46, 0.49, 0.48]  
2012-10-29                             []                             []                              []
2012-10-30                         [0.53]                         [0.02]                          [0.45]
2012-10-31                             []                             []                              []
2012-11-01             [0.51, 0.49, 0.47]             [0.04, 0.03, 0.05]              [0.45, 0.49, 0.47] 
2012-11-02                             []                             []                              []
2012-11-03                         [0.49]                         [0.06]                          [0.45]
2012-11-04  [0.53, 0.47, 0.49, 0.52, 0.5]  [0.0, 0.08, 0.03, 0.01, 0.03]  [0.39, 0.44, 0.48, 0.46, 0.47]
2012-11-05                         [0.51]                         [0.02]                          [0.46]
2012-11-06                             []                             []                              []
2012-11-07                         [0.51]                          [0.0]                          [0.41]

>>> df1.rolling('5D').count().equals(df1)
True

>>> df1.rolling('5D').sum().equals(df1)
True

>>> df1.rolling('5D').apply(lambda x: x+5).equals(df1)
True

Problem description

To give more context, I wanted to find the five-day rolling average, but the rolling method does not include all the rows for the current date if there are multiple rows with the same date. See this output:

>>> df.rolling('5D').count()
            favorable  other  unfavorable
2012-10-25        1.0    1.0          1.0
2012-10-25        2.0    2.0          2.0
2012-10-26        3.0    3.0          3.0
2012-10-27        4.0    4.0          4.0
2012-10-28        5.0    5.0          5.0
2012-10-28        6.0    6.0          6.0
2012-10-28        7.0    7.0          7.0
2012-10-28        8.0    8.0          8.0
2012-10-30        7.0    7.0          7.0
2012-11-01        6.0    6.0          6.0
2012-11-01        7.0    7.0          7.0
2012-11-01        8.0    8.0          8.0
2012-11-03        5.0    5.0          5.0
2012-11-04        5.0    5.0          5.0
2012-11-04        6.0    6.0          6.0
2012-11-04        7.0    7.0          7.0
2012-11-04        8.0    8.0          8.0
2012-11-04        9.0    9.0          9.0
2012-11-05       10.0   10.0         10.0
2012-11-07        8.0    8.0          8.0

The first row for 10-25-2012 has a count of 1 and the second has a count of 2. You can see this pattern continue for all rows that have the same date. Because of this, I decided to group all the values of the same day into a list with resample and then use rolling on that frame to get the desired result. Strangely, the original DataFrame is being returned when there are lists as values.

Also, I think rolling has lots of room for improvement. I think it would be great to have the following:

Inclusive of all rows with the same exact timestamp
Customize window size in either direction for any amount of time. For instance, it could find a rolling average for the 4 previous days along with the following 7 days
Make it equivalent to groupby and resample - have same methods (there is no size method) and have it pass in pandas objects to the agg/apply methods. Currently, it passes in numpy arrays.

I believe SAS has the capability to do the custom window size.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2017-11-06T04:38:12Z

@tdpetrou : Thanks for reporting this! Wow, this is a lot to unpack here:

The output of the df.rolling('5D').count() is correct. The window is right-edged. We don't have an implementation yet for left-edged windows. If you would like to add that, you're more than welcome!
The "following 7 days" is essentially asking for left-edged windows, so my previous point applies.
You can change inclusion by passing in the closed parameter.
Regarding df1, yikes! Yeah, something looks buggy there IINM. Have a look around and see if there's anything funny in our code that jumps out.

jreback · 2017-11-06T21:48:58Z

lists are not first class as elements; there is almost no support in rolling for what you are trying to do; nor likely to add as this is non idiomatic and non performant

jreback · 2017-11-06T21:51:25Z

this is also a duplicate issue for much of this ; @gfyoung if you would find would be appreciated

tdpetrou · 2017-11-06T22:33:40Z

@jreback There is no other easy way to do what I am trying to do. rolling needs a lot of improvement to be anywhere near capable of what SAS has to offer.

jreback · 2017-11-06T22:37:49Z

and why doesn’t rolling.mean do it? that seems exactly what you are trying to do

tdpetrou · 2017-11-06T22:49:28Z

rolling.mean is not close to as functional as I need it. It only uses the current row and not all the rows with the current date whenever there are multiple rows with the same date.

df.rolling('5D').mean()

            favorable     other  unfavorable
2012-10-25   0.480000  0.030000     0.490000
2012-10-25   0.495000  0.025000     0.485000
2012-10-26   0.516667  0.030000     0.456667
2012-10-27   0.515000  0.027500     0.460000
2012-10-28   0.508000  0.030000     0.466000
2012-10-28   0.500000  0.040000     0.465000
2012-10-28   0.497143  0.038571     0.468571
2012-10-28   0.496250  0.037500     0.470000
2012-10-30   0.501429  0.038571     0.462857
2012-11-01   0.491667  0.041667     0.470000
2012-11-01   0.491429  0.040000     0.472857
2012-11-01   0.488750  0.041250     0.472500
2012-11-03   0.498000  0.040000     0.462000
2012-11-04   0.498000  0.036000     0.450000
2012-11-04   0.493333  0.043333     0.448333
2012-11-04   0.492857  0.041429     0.452857
2012-11-04   0.496250  0.037500     0.453750
2012-11-04   0.496667  0.036667     0.455556
2012-11-05   0.498000  0.035000     0.456000
2012-11-07   0.502500  0.028750     0.445000

You see how all the days that have the same date have different numbers? There are four rows for October 28 and they each have a different mean. I can't specify to use all the days of the current date. I also can't specify to look up at the next n days and simultaneously the previous m days. Nor is there a size method, and apply gets sent a one dimensional array. Seems very broken to me.

jreback · 2017-11-06T22:58:58Z

of course this is how rolling works
you should simply resample first to a daily basis

tdpetrou · 2017-11-06T23:07:44Z

This wouldn't work if you wanted an evenly-weighted mean, which is why I wanted to collect all the values together first in a list. You could do it in a roundabout way with df.resample('D').agg(['sum', 'count']) and then do it with rolling. Although, you would not be able to use apply because it only accepts a single column at a time as a numpy array (very strange). You would have to precalculate some total sums and weights. It would be a mess. df.rolling('5D').mean() would be way, way nicer.

letalvoj · 2017-11-29T09:07:37Z

I have a simmilar problem, but with sets:

import pandas as pd
pd.Series(data=[{1},{2},{3},{4}], index=[1,2,3,4]).rolling(2).apply(list)

# yields:
# 1    {1}
# 2    {2}
# 3    {3}
# 4    {4}
# dtype: object

# yet I should be something like:
# 1    None
# 2    [{1},{2}]
# 3    [{2},{3}]
# 4    [{3},{4}]
# dtype: object

@jreback if rolling over non first class elements is not supported pandas should either throw warning of fail with an error. IMO returning an unexpected invalid result is not correct behaviour. I spend a lot of time trying to find out what am I doing wrong that the rolling functions does not return what I expect Do you agree?

JoranDox · 2018-02-15T14:52:34Z

Having the same issue with string data, no mention of what is wrong anywhere, no mention of unsupported datatypes in the docs, and no error, just wrong output.

I disagree very much with the "resample" tag and request the "bug" tag be added again.

tdpetrou · 2018-02-15T15:14:51Z

@NightFantomJ2 I am building a similar library to pandas called dexplo that will allow you to specify a window size in either direction for all rolling operations. For instance, you could want a window that was from 3 to the left to 5 to the right, or from 5 to the right to 10 to the right. It's about 1 month from official release.

ron819 · 2019-01-20T10:14:10Z

@tdpetrou was there an official release?

tdpetrou · 2019-01-21T16:34:05Z

Not yet. Should happen middle of 2019.

…

On Sun, Jan 20, 2019 at 5:14 AM ron819 ***@***.***> wrote: @tdpetrou <https://github.com/tdpetrou> was there an official release? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18129 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AG5FfxHLpJYFkg6muh0hltzvqSYFA8V2ks5vFEGDgaJpZM4QSpSo> .

gfyoung added Enhancement Resample resample method labels Nov 6, 2017

gfyoung added the Bug label Nov 6, 2017

jreback removed the Bug label Nov 6, 2017

jorisvandenbossche mentioned this issue Mar 17, 2018

Confusing - rolling min( ) function "accepting" objects #20244

Closed

WillAyd added the Window rolling, ewma, expanding label Oct 5, 2018

WillAyd mentioned this issue Oct 5, 2018

ENH: allow rolling with non-numerical (eg string) data #23002

Open

mroeschke removed the Resample resample method label May 8, 2020

mroeschke added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Jun 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.rolling does nothing when values are in a list #18129

DataFrame.rolling does nothing when values are in a list #18129

tdpetrou commented Nov 6, 2017

INSTALLED VERSIONS

gfyoung commented Nov 6, 2017

jreback commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

letalvoj commented Nov 29, 2017

JoranDox commented Feb 15, 2018 •

edited

Loading

tdpetrou commented Feb 15, 2018

ron819 commented Jan 20, 2019

tdpetrou commented Jan 21, 2019 via email

DataFrame.rolling does nothing when values are in a list #18129

DataFrame.rolling does nothing when values are in a list #18129

Comments

tdpetrou commented Nov 6, 2017

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Nov 6, 2017

jreback commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

jreback commented Nov 6, 2017

tdpetrou commented Nov 6, 2017

letalvoj commented Nov 29, 2017

JoranDox commented Feb 15, 2018 • edited Loading

tdpetrou commented Feb 15, 2018

ron819 commented Jan 20, 2019

tdpetrou commented Jan 21, 2019 via email

Output of `pd.show_versions()`

JoranDox commented Feb 15, 2018 •

edited

Loading