Skip to content

sparse resampling not working with dictionary of columns? #15386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
randomgambit opened this issue Feb 13, 2017 · 6 comments
Closed

sparse resampling not working with dictionary of columns? #15386

randomgambit opened this issue Feb 13, 2017 · 6 comments
Labels

Comments

@randomgambit
Copy link

Hello there,

Have I said that Pandas is awesome? yes, many times ;-)

I have a question, I am working with a very large dataframe of trades, timestamped at the millisecond precision. Latest Pandas 19.2 here.

I need to resample the dataframe every 200 ms, but given that my data spans several years and I am only interested in resampling data between 10:00 am and 12:00 am every day (handled by between_time()), using a plain resample will crash and burn my machine.

Instead, I tried the sparse resampling shown in the http://pandas.pydata.org/pandas-docs/stable/timeseries.html#sparse-resampling, but it fails when i provide it with a dictionary of columns.

Is that expected? Is it a bug?

import pandas as pd
import numpy as np

rng = pd.date_range('2014-1-1', periods=100, freq='D') + pd.Timedelta('1s')
ts = pd.DataFrame({'value' : range(100)}, index=rng)


from functools import partial
from pandas.tseries.frequencies import to_offset

def round(t, freq):
 freq = to_offset(freq)
 return pd.Timestamp((t.value // freq.delta.value) * freq.delta.value)

# works
ts.groupby(partial(round, freq='3T')).value.sum()

# does not work
ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})

ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})
Traceback (most recent call last):

  File "<ipython-input-104-6004b307a469>", line 1, in <module>
    ts.groupby(partial(round, freq='3T')).apply({'value' : 'sum'})

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 674, in apply
    func = self._is_builtin_func(func)

  File "C:\Users\m1hxb02\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\base.py", line 644, in _is_builtin_func
    return self._builtin_table.get(arg, arg)

TypeError: unhashable type: 'dict'

Problem is: I need to resample several columns at once in my dataframe, eventually using different functions (sum, mean, max). Is anything wrong here?

Thanks~

@chris-b1
Copy link
Contributor

You want to be using .agg here. e.g.

ts.groupby(partial(round, freq='3T')).agg({'value' : ['sum', 'mean']})

To re-purpose this issue - not sure when, but DatetimeIndex now has a vectorized round method which will be significantly faster - doc example should be updated.

In [149]: %timeit ts.groupby(partial(round, freq='3T')).agg({'value' : 'sum'})
100 loops, best of 3: 6.56 ms per loop

In [150]: %timeit ts.groupby(ts.index.round('3T')).agg({'value' : 'sum'})
1000 loops, best of 3: 1.83 ms per loop

@randomgambit
Copy link
Author

@chris-b1 thanks! but the syntax for the regular resample is with apply right?

ts.resample('5Min').apply({'value' : 'sum'})

seems to work correctly

@chris-b1
Copy link
Contributor

To be honest I had no idea that worked, I think .agg would also be the idiomatic way with resample. @jreback ?

@randomgambit
Copy link
Author

randomgambit commented Feb 13, 2017

@chris-b1 summoning the great master @jreback
in my experience, pandas is smart enough (most of the time) to guess what apply is doing. That is, an agg or a transform. But Jeff knows better here

@jreback
Copy link
Contributor

jreback commented Feb 14, 2017

this will be handled in #14668

.apply does not accept a dictionary, see #14464

@jreback jreback closed this as completed Feb 14, 2017
@jreback jreback added this to the No action milestone Feb 14, 2017
@randomgambit
Copy link
Author

randomgambit commented Feb 14, 2017

@chris-b1 @jreback nice. it DOES appear to work, though, in the case of resample

ts.resample('5Min').apply({'value' : 'sum'}) gives the same output as
ts.resample('5Min').agg({'value' : 'sum'})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants