Skip to content

DataFrame.rolling does nothing when values are in a list #18129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tdpetrou opened this issue Nov 6, 2017 · 13 comments
Open

DataFrame.rolling does nothing when values are in a list #18129

tdpetrou opened this issue Nov 6, 2017 · 13 comments
Labels
Enhancement Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Window rolling, ewma, expanding

Comments

@tdpetrou
Copy link
Contributor

tdpetrou commented Nov 6, 2017

This issue is based on data from this Stack Overflow post.

First, I get all the values for each day into a list using resample. I then try and apply a five-day rolling function and the original DataFrame is returned. No calculation happens.

>>> d = {'favorable': [0.48, 0.51, 0.56, 0.51, 0.48, 0.46, 0.48, 0.49, 0.53, 0.51, 0.49, 0.47, 0.49, 0.53, 0.47, 0.49, 0.52, 0.5, 0.51, 0.51], 
         'unfavorable': [0.49, 0.48, 0.4, 0.47, 0.49, 0.46, 0.49, 0.48, 0.45, 0.45, 0.49, 0.47, 0.45, 0.39, 0.44, 0.48, 0.46, 0.47, 0.46, 0.41], 
         'other': [0.03, 0.02, 0.04, 0.02, 0.04, 0.09, 0.03, 0.03, 0.02, 0.04, 0.03, 0.05, 0.06, 0.0, 0.08, 0.03, 0.01, 0.03, 0.02, 0.0]}
>>> index = pd.DatetimeIndex(['2012-10-25', '2012-10-25', '2012-10-26', '2012-10-27', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-28', '2012-10-30', '2012-11-01', '2012-11-01', '2012-11-01', '2012-11-03', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-04', '2012-11-05', '2012-11-07'])
>>> df = pd.DataFrame(d, index=index)
>>> df

            favorable  other  unfavorable
2012-10-25       0.48   0.03         0.49
2012-10-25       0.51   0.02         0.48
2012-10-26       0.56   0.04         0.40
2012-10-27       0.51   0.02         0.47
2012-10-28       0.48   0.04         0.49
2012-10-28       0.46   0.09         0.46
2012-10-28       0.48   0.03         0.49
2012-10-28       0.49   0.03         0.48
2012-10-30       0.53   0.02         0.45
2012-11-01       0.51   0.04         0.45
2012-11-01       0.49   0.03         0.49
2012-11-01       0.47   0.05         0.47
2012-11-03       0.49   0.06         0.45
2012-11-04       0.53   0.00         0.39
2012-11-04       0.47   0.08         0.44
2012-11-04       0.49   0.03         0.48
2012-11-04       0.52   0.01         0.46
2012-11-04       0.50   0.03         0.47
2012-11-05       0.51   0.02         0.46
2012-11-07       0.51   0.00         0.41

>>> df1 = df.resample('D').apply(lambda x: x.tolist())

                                favorable                          other                     unfavorable
2012-10-25                   [0.48, 0.51]                   [0.03, 0.02]                    [0.49, 0.48]
2012-10-26                         [0.56]                         [0.04]                           [0.4]
2012-10-27                         [0.51]                         [0.02]                          [0.47] 
2012-10-28       [0.48, 0.46, 0.48, 0.49]       [0.04, 0.09, 0.03, 0.03]        [0.49, 0.46, 0.49, 0.48]  
2012-10-29                             []                             []                              []
2012-10-30                         [0.53]                         [0.02]                          [0.45]
2012-10-31                             []                             []                              []
2012-11-01             [0.51, 0.49, 0.47]             [0.04, 0.03, 0.05]              [0.45, 0.49, 0.47] 
2012-11-02                             []                             []                              []
2012-11-03                         [0.49]                         [0.06]                          [0.45]
2012-11-04  [0.53, 0.47, 0.49, 0.52, 0.5]  [0.0, 0.08, 0.03, 0.01, 0.03]  [0.39, 0.44, 0.48, 0.46, 0.47]
2012-11-05                         [0.51]                         [0.02]                          [0.46]
2012-11-06                             []                             []                              []
2012-11-07                         [0.51]                          [0.0]                          [0.41]

>>> df1.rolling('5D').count().equals(df1)
True

>>> df1.rolling('5D').sum().equals(df1)
True

>>> df1.rolling('5D').apply(lambda x: x+5).equals(df1)
True

Problem description

To give more context, I wanted to find the five-day rolling average, but the rolling method does not include all the rows for the current date if there are multiple rows with the same date. See this output:

>>> df.rolling('5D').count()
            favorable  other  unfavorable
2012-10-25        1.0    1.0          1.0
2012-10-25        2.0    2.0          2.0
2012-10-26        3.0    3.0          3.0
2012-10-27        4.0    4.0          4.0
2012-10-28        5.0    5.0          5.0
2012-10-28        6.0    6.0          6.0
2012-10-28        7.0    7.0          7.0
2012-10-28        8.0    8.0          8.0
2012-10-30        7.0    7.0          7.0
2012-11-01        6.0    6.0          6.0
2012-11-01        7.0    7.0          7.0
2012-11-01        8.0    8.0          8.0
2012-11-03        5.0    5.0          5.0
2012-11-04        5.0    5.0          5.0
2012-11-04        6.0    6.0          6.0
2012-11-04        7.0    7.0          7.0
2012-11-04        8.0    8.0          8.0
2012-11-04        9.0    9.0          9.0
2012-11-05       10.0   10.0         10.0
2012-11-07        8.0    8.0          8.0

The first row for 10-25-2012 has a count of 1 and the second has a count of 2. You can see this pattern continue for all rows that have the same date. Because of this, I decided to group all the values of the same day into a list with resample and then use rolling on that frame to get the desired result. Strangely, the original DataFrame is being returned when there are lists as values.

Also, I think rolling has lots of room for improvement. I think it would be great to have the following:

  • Inclusive of all rows with the same exact timestamp
  • Customize window size in either direction for any amount of time. For instance, it could find a rolling average for the 4 previous days along with the following 7 days
  • Make it equivalent to groupby and resample - have same methods (there is no size method) and have it pass in pandas objects to the agg/apply methods. Currently, it passes in numpy arrays.

I believe SAS has the capability to do the custom window size.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Enhancement Resample resample method labels Nov 6, 2017
@gfyoung
Copy link
Member

gfyoung commented Nov 6, 2017

@tdpetrou : Thanks for reporting this! Wow, this is a lot to unpack here:

  • The output of the df.rolling('5D').count() is correct. The window is right-edged. We don't have an implementation yet for left-edged windows. If you would like to add that, you're more than welcome!

  • The "following 7 days" is essentially asking for left-edged windows, so my previous point applies.

  • You can change inclusion by passing in the closed parameter.

  • Regarding df1, yikes! Yeah, something looks buggy there IINM. Have a look around and see if there's anything funny in our code that jumps out.

@gfyoung gfyoung added the Bug label Nov 6, 2017
@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

lists are not first class as elements; there is almost no support in rolling for what you are trying to do; nor likely to add as this is non idiomatic and non performant

@jreback jreback removed the Bug label Nov 6, 2017
@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

this is also a duplicate issue for much of this ; @gfyoung if you would find would be appreciated

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Nov 6, 2017

@jreback There is no other easy way to do what I am trying to do. rolling needs a lot of improvement to be anywhere near capable of what SAS has to offer.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

and why doesn’t rolling.mean do it? that seems exactly what you are trying to do

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Nov 6, 2017

rolling.mean is not close to as functional as I need it. It only uses the current row and not all the rows with the current date whenever there are multiple rows with the same date.

df.rolling('5D').mean()

            favorable     other  unfavorable
2012-10-25   0.480000  0.030000     0.490000
2012-10-25   0.495000  0.025000     0.485000
2012-10-26   0.516667  0.030000     0.456667
2012-10-27   0.515000  0.027500     0.460000
2012-10-28   0.508000  0.030000     0.466000
2012-10-28   0.500000  0.040000     0.465000
2012-10-28   0.497143  0.038571     0.468571
2012-10-28   0.496250  0.037500     0.470000
2012-10-30   0.501429  0.038571     0.462857
2012-11-01   0.491667  0.041667     0.470000
2012-11-01   0.491429  0.040000     0.472857
2012-11-01   0.488750  0.041250     0.472500
2012-11-03   0.498000  0.040000     0.462000
2012-11-04   0.498000  0.036000     0.450000
2012-11-04   0.493333  0.043333     0.448333
2012-11-04   0.492857  0.041429     0.452857
2012-11-04   0.496250  0.037500     0.453750
2012-11-04   0.496667  0.036667     0.455556
2012-11-05   0.498000  0.035000     0.456000
2012-11-07   0.502500  0.028750     0.445000

You see how all the days that have the same date have different numbers? There are four rows for October 28 and they each have a different mean. I can't specify to use all the days of the current date. I also can't specify to look up at the next n days and simultaneously the previous m days. Nor is there a size method, and apply gets sent a one dimensional array. Seems very broken to me.

@jreback
Copy link
Contributor

jreback commented Nov 6, 2017

of course this is how rolling works
you should simply resample first to a daily basis

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Nov 6, 2017

This wouldn't work if you wanted an evenly-weighted mean, which is why I wanted to collect all the values together first in a list. You could do it in a roundabout way with df.resample('D').agg(['sum', 'count']) and then do it with rolling. Although, you would not be able to use apply because it only accepts a single column at a time as a numpy array (very strange). You would have to precalculate some total sums and weights. It would be a mess. df.rolling('5D').mean() would be way, way nicer.

@letalvoj
Copy link

I have a simmilar problem, but with sets:

import pandas as pd
pd.Series(data=[{1},{2},{3},{4}], index=[1,2,3,4]).rolling(2).apply(list)

# yields:
# 1    {1}
# 2    {2}
# 3    {3}
# 4    {4}
# dtype: object

# yet I should be something like:
# 1    None
# 2    [{1},{2}]
# 3    [{2},{3}]
# 4    [{3},{4}]
# dtype: object

@jreback if rolling over non first class elements is not supported pandas should either throw warning of fail with an error. IMO returning an unexpected invalid result is not correct behaviour. I spend a lot of time trying to find out what am I doing wrong that the rolling functions does not return what I expect Do you agree?

@JoranDox
Copy link

JoranDox commented Feb 15, 2018

Having the same issue with string data, no mention of what is wrong anywhere, no mention of unsupported datatypes in the docs, and no error, just wrong output.

I disagree very much with the "resample" tag and request the "bug" tag be added again.

@tdpetrou
Copy link
Contributor Author

@NightFantomJ2 I am building a similar library to pandas called dexplo that will allow you to specify a window size in either direction for all rolling operations. For instance, you could want a window that was from 3 to the left to 5 to the right, or from 5 to the right to 10 to the right. It's about 1 month from official release.

@ron819
Copy link

ron819 commented Jan 20, 2019

@tdpetrou was there an official release?

@tdpetrou
Copy link
Contributor Author

tdpetrou commented Jan 21, 2019 via email

@mroeschke mroeschke removed the Resample resample method label May 8, 2020
@mroeschke mroeschke added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Jun 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

8 participants