Skip to content

Improve poor performance of groupby resampling #14079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spillz opened this issue Aug 24, 2016 · 5 comments
Closed

Improve poor performance of groupby resampling #14079

spillz opened this issue Aug 24, 2016 · 5 comments
Labels
Groupby Performance Memory or execution speed performance Resample resample method

Comments

@spillz
Copy link

spillz commented Aug 24, 2016

When trying to resample transactions data where there are infrequent transactions for a large number of people, I get horrible performance. Compare the function annualize with the clunkier but faster annualize2 below. Ideally resample should be able to handle multiindex data and resample on 1 of the dimensions without the need to resort to groupby.

Code Sample, a copy-pastable example if possible

import pandas
import numpy as np

def annualize(df):
    def resample(df):
        df = df.set_index('DATE')
        df = df.resample('A').ffill()
        df = df.reset_index()
        return df
    df = df.groupby('ID', as_index = False).apply(resample)
    df['YEAR'] = df['DATE'].dt.year
    df = df.reset_index(drop = True)
    return df

def annualize2(df):
    #DF must be sorted by ID then DATE
    df = df.sort_values(['ID','DATE']).reset_index(drop = True)
    #Extract the year from the date and keep last observation from each year
    df['YEAR'] = df['DATE'].dt.year
    df = df.drop_duplicates(subset = ['ID','YEAR'], keep = 'last')
    #Get the YEAR and ID of the next record down
    ndf = df[['ID', 'YEAR']].shift(-1)
    #Want to duplicate records where ID's between record pairs match and spacing is more than year
    match = (ndf['ID'] == df['ID'])
    reps = np.array(np.maximum(match*(ndf['YEAR'] - df['YEAR']),1), dtype = int)
    reps[-1] = 1 # the last value will by -inf
    #Do the duplication
    df = df.loc[np.repeat(df.index, reps)]
    #Now we need to fix the year field
    cs = np.ones(len(df))
    cs[np.cumsum(reps[:-1])] -= reps[:-1]
    df['YEAR'] = df['YEAR'] + np.cumsum(cs)-1
    df = df.reset_index(drop = True)
    return df

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])
    print(annualize(df))
    print(annualize2(df))

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    print('annualize')
    print(timeit.timeit('annualize(df)', setup = 'from __main__ import annualize, df', number = 1))
    print('annualize2')
    print(timeit.timeit('annualize2(df)', setup = 'from __main__ import annualize2, df', number = 1))

Expected Output

        DATE  ID STATUS  YEAR
0 2005-12-31   1      B  2005
1 2006-12-31   1      B  2006
2 2007-12-31   1      B  2007
3 2008-12-31   1      B  2008
4 2009-12-31   1      B  2009
5 2010-12-31   1      D  2010
   ID       DATE STATUS    YEAR
0   1 2005-04-30      B  2005.0
1   1 2005-04-30      B  2006.0
2   1 2005-04-30      B  2007.0
3   1 2005-04-30      B  2008.0
4   1 2005-04-30      B  2009.0
5   1 2010-08-20      D  2010.0
annualize
26.4788940619
annualize2
0.0899452309903

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 20.3
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

@jorisvandenbossche
Copy link
Member

I didn't look into detail to your example, but note that the two functions do return a different result, so do not seem to be equivalent.

@jreback
Copy link
Contributor

jreback commented Aug 24, 2016

duplicate of #11296

pull-requests are welcome!

@jreback jreback closed this as completed Aug 24, 2016
@jreback jreback added Groupby Performance Memory or execution speed performance Resample resample method labels Aug 24, 2016
@jreback jreback added this to the No action milestone Aug 24, 2016
@jreback
Copy link
Contributor

jreback commented Aug 24, 2016

The actual grouping / resample are not bounding this here at all, rather it is the fact that .fillna (and cousins) are not in cython, meaning this is basically a python loop.

@spillz
Copy link
Author

spillz commented Aug 26, 2016

Yeah, to be clear I wasn't proposing the other code as a replacement, just to show the performance issue (it gets worse on even bigger data sets). I didn't think it was worth the extra fiddling to get the column ordering and dtypes to be the same. The general application is converting transaction records into an unbalanced panel data set, which I can then run standard regression methods on.

I had assumed that it was mostly the groupby that was causing the slowness. Is that wrong?

Based on the assumption I thought that one solution for my use case would be if resample could support pad operations for non-unique indexes (or more complete support for resampling multiindexes, which is essentially what my DATE and ID pairs represent). Currently if I do:

import pandas
import numpy as np

if __name__ == '__main__':
    import StringIO
    import timeit
    #Generate some transaction record data
    data='''ID,DATE,STATUS
    1,3/1/2005,A
    1,4/30/2005,B
    1,6/30/2010,C
    1,8/20/2010,D
    '''
    df = pandas.read_csv(StringIO.StringIO(data))
    df['DATE'] = pandas.to_datetime(df['DATE'])

    #then repeat the rows of the table to get a more realistic sample size for measuring performance
    df = df.loc[np.tile(df.index.values, 10000)]
    df['ID'] = np.repeat(np.arange(10000), 4)

    df = df.set_index('DATE')
    df = df.resample('A').ffill()
    print(df)

I get:

Traceback (most recent call last):
  File "C:\Users\#####\Source\PANDAS~2.PY", line 56, in <module>
    df = df.resample('A').ffill()
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 424, in pad
    return self._upsample('pad', limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\tseries\resample.py", line 704, in _upsample
    limit=limit)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2744, in reindex
    **kwargs)
  File "c:\anaconda2\lib\site-packages\pandas\core\generic.py", line 2229, in reindex
    fill_value, copy).__finalize__(self)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2690, in _reindex_axes
    fill_value, limit, tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\core\frame.py", line 2698, in _reindex_index
    tolerance=tolerance)
  File "c:\anaconda2\lib\site-packages\pandas\indexes\base.py", line 2342, in reindex
    raise ValueError("cannot reindex a non-unique index "
ValueError: cannot reindex a non-unique index with a method or limit

Whereas it would be nice if it was able to handle non-unique indexes by treating the non-ascending dates as boundaries and return something like:

         ID STATUS  YEAR
2005   1      B  2005
2006   1      B  2006
2007   1      B  2007
2008   1      B  2008
2009   1      B  2009
2010   1      D  2010
2005   2      B  2005
2006   2      B  2006
2007   2      B  2007
2008   2      B  2008
2009   2      B  2009
2010   2      D  2010
2005   3      B  2005
2006   3      B  2006
2007   3      B  2007
2008   3      B  2008
2009   3      B  2009
2010   3      D  2010
...

But I guess that's moot if the problem is that reindex filling relies on looping.

As for pull requests, I found the code that does the reindexing / resampling to be pretty opaque so would need a few pointers to get started.

@eromoe
Copy link

eromoe commented May 25, 2021

Number of row : 960,000
cost half hour , still hanging . What's wrong with groupby resampling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Resample resample method
Projects
None yet
Development

No branches or pull requests

4 participants