Improve poor performance of groupby resampling #14079

spillz · 2016-08-24T15:48:46Z

When trying to resample transactions data where there are infrequent transactions for a large number of people, I get horrible performance. Compare the function annualize with the clunkier but faster annualize2 below. Ideally resample should be able to handle multiindex data and resample on 1 of the dimensions without the need to resort to groupby.

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np

def annualize(df):
    def resample(df):
        df = df.set_index('DATE')
        df = df.resample('A').ffill()
        df = df.reset_index()
        return df
    df = df.groupby('ID', as_index = False).apply(resample)
    df['YEAR'] = df['DATE'].dt.year
    df = df.reset_index(drop = True)
    return df

def annualize2(df):
    #DF must be sorted by ID then DATE
    df = df.sort_values(['ID','DATE']).reset_index(drop = True)
    #Extract the year from the date and keep last observation from each year
    df['YEAR'] = df['DATE'].dt.year
    df = df.drop_duplicates(subset = ['ID','YEAR'], keep = 'last')
    #Get the YEAR and ID of the next record down
    ndf = df[['ID', 'YEAR']].shift(-1)
    #Want to duplicate records where ID's between record pairs match and spacing is more than year
    match = (ndf['ID'] == df['ID'])
    reps = np.array(np.maximum(match*(ndf['YEAR'] - df['YEAR']),1), dtype = int)
    reps[-1] = 1 # the last value will by -inf
    #Do the duplication
    df = df.loc[np.repeat(df.index, reps)]
    #Now we need to fix the year field
    cs = np.ones(len(df))
    cs[np.cumsum(reps[:-1])] -= reps[:-1]
    df['YEAR'] = df['YEAR'] + np.cumsum(cs)-1
    df = df.reset_index(drop = True)
    return df

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])
    print(annualize(df))
    print(annualize2(df))

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    print('annualize')
    print(timeit.timeit('annualize(df)', setup = 'from __main__ import annualize, df', number = 1))
    print('annualize2')
    print(timeit.timeit('annualize2(df)', setup = 'from __main__ import annualize2, df', number = 1))

Expected Output

        DATE  ID STATUS  YEAR
0 2005-12-31   1      B  2005
1 2006-12-31   1      B  2006
2 2007-12-31   1      B  2007
3 2008-12-31   1      B  2008
4 2009-12-31   1      B  2009
5 2010-12-31   1      D  2010
   ID       DATE STATUS    YEAR
0   1 2005-04-30      B  2005.0
1   1 2005-04-30      B  2006.0
2   1 2005-04-30      B  2007.0
3   1 2005-04-30      B  2008.0
4   1 2005-04-30      B  2009.0
5   1 2010-08-20      D  2010.0
annualize
26.4788940619
annualize2
0.0899452309903

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-08-24T22:17:26Z

I didn't look into detail to your example, but note that the two functions do return a different result, so do not seem to be equivalent.

jreback · 2016-08-24T22:56:36Z

duplicate of #11296

pull-requests are welcome!

jreback · 2016-08-24T22:57:37Z

The actual grouping / resample are not bounding this here at all, rather it is the fact that .fillna (and cousins) are not in cython, meaning this is basically a python loop.

spillz · 2016-08-26T14:28:26Z

Yeah, to be clear I wasn't proposing the other code as a replacement, just to show the performance issue (it gets worse on even bigger data sets). I didn't think it was worth the extra fiddling to get the column ordering and dtypes to be the same. The general application is converting transaction records into an unbalanced panel data set, which I can then run standard regression methods on.

I had assumed that it was mostly the groupby that was causing the slowness. Is that wrong?

Based on the assumption I thought that one solution for my use case would be if resample could support pad operations for non-unique indexes (or more complete support for resampling multiindexes, which is essentially what my DATE and ID pairs represent). Currently if I do:

import pandas
import numpy as np

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    df = df.set_index('DATE')
    df = df.resample('A').ffill()
    print(df)

I get:

Traceback (most recent call last):
  File "C:\Users\#####\Source\PANDAS~2.PY", line 56, in <module>
    df = df.resample('A').ffill()
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 424, in pad
    return self._upsample('pad', limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 704, in _upsample
    limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2744, in reindex
    **kwargs)
  File "c:\anaconda2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
    fill_value, copy).__finalize__(self)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2690, in _reindex_axes
    fill_value, limit, tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
    tolerance=tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\indexes\base.py", line 2342, in reindex
    raise ValueError("cannot reindex a non-unique index "
ValueError: cannot reindex a non-unique index with a method or limit

Whereas it would be nice if it was able to handle non-unique indexes by treating the non-ascending dates as boundaries and return something like:

         ID STATUS  YEAR
2005   1      B  2005
2006   1      B  2006
2007   1      B  2007
2008   1      B  2008
2009   1      B  2009
2010   1      D  2010
2005   2      B  2005
2006   2      B  2006
2007   2      B  2007
2008   2      B  2008
2009   2      B  2009
2010   2      D  2010
2005   3      B  2005
2006   3      B  2006
2007   3      B  2007
2008   3      B  2008
2009   3      B  2009
2010   3      D  2010
...

But I guess that's moot if the problem is that reindex filling relies on looping.

As for pull requests, I found the code that does the reindexing / resampling to be pretty opaque so would need a few pointers to get started.

eromoe · 2021-05-25T01:56:58Z

Number of row : 960,000
cost half hour , still hanging . What's wrong with groupby resampling?

jreback closed this as completed Aug 24, 2016

jreback added Groupby Performance Memory or execution speed performance Resample resample method labels Aug 24, 2016

jreback added this to the No action milestone Aug 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve poor performance of groupby resampling #14079

Improve poor performance of groupby resampling #14079

spillz commented Aug 24, 2016

jorisvandenbossche commented Aug 24, 2016

jreback commented Aug 24, 2016

jreback commented Aug 24, 2016

spillz commented Aug 26, 2016

eromoe commented May 25, 2021

Improve poor performance of groupby resampling #14079

Improve poor performance of groupby resampling #14079

Comments

spillz commented Aug 24, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Aug 24, 2016

jreback commented Aug 24, 2016

jreback commented Aug 24, 2016

spillz commented Aug 26, 2016

eromoe commented May 25, 2021

output of `pd.show_versions()`