resample() should allow a "null" fill_method #11217

jd · 2015-10-02T08:12:50Z

I'm using resample() to aggregate data in a timeframe.

>>> s = pandas.Series((3, 4), (pandas.Timestamp("2014-1-1"), pandas.Timestamp("2015-1-1")))
>>> x = s.resample("1s")
>>> len(x)
31536001

When doing such a call, resample fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to not fill them in the Series and having so much memory used. AFAICS resample does not offer such as fill_method.

The text was updated successfully, but these errors were encountered:

jreback · 2015-10-02T11:02:43Z

this is trivially done with

s.resample('1s').dropna()

jd · 2015-10-02T11:34:05Z

@jreback Sure, but that still makes Pandas uses 500M+ RAM.

Things are actually worse for something like that:

>>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median')

The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM.

jreback · 2015-10-02T11:36:56Z

you need to show what you are going to do with this

why are you resampling such a large period? what's the point

jd · 2015-10-02T11:54:34Z

I don't understand why the reason is important here.
I treat statistics over a year with fine grained resolution up to the second.

jreback · 2015-10-02T12:24:47Z

a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data?

the point of resample IS to give you a full-rank set for every point in time.

TomAugspurger · 2015-10-02T12:54:43Z

@jd does snap cover your needs?

jd · 2015-10-02T13:27:26Z

@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket.

@TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :)

jreback · 2015-10-02T13:53:07Z

@jd

so api for this could be 1 of the following:

Series.resample(...., fill_method='sparse') (which could return a sparse Series)
Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
SparseSeries(....).resample(....) (which would by definition return a sparse Series)

none of these are implemented, but potentially useful.

want to take a stab at this?

jd · 2015-10-03T09:04:51Z

@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those.

I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :)

Thanks a lot!

jreback · 2015-10-03T14:37:38Z

Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.

In [24]: pd.options.display.max_rows=12

In [25]: np.random.seed(1234)

In [26]: i = date_range('20140101',periods=100000000,freq='s')
^[[A
In [27]: s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

In [28]: s
Out[28]: 
2014-07-19 19:23:27      0
2014-09-23 09:00:37      1
2014-01-12 07:42:04      2
2016-07-01 18:35:32      3
2014-07-24 05:43:52      4
2016-11-27 19:22:33      5
                      ... 
2015-01-04 08:11:55    994
2016-09-26 02:24:18    995
2016-03-01 18:48:41    996
2014-03-15 01:18:10    997
2015-07-14 06:45:46    998
2016-03-09 08:47:23    999
dtype: int64

In [29]: s.groupby([s.index.date,s.index.second]).sum()
Out[29]: 
2014-01-03  9      52
            50    414
2014-01-05  31    535
2014-01-07  16    937
2014-01-08  28    164
2014-01-12  4       2
                 ... 
2017-02-27  39    174
2017-03-01  5     536
            26    760
            31    208
            51    488
2017-03-03  7     810
dtype: int64

jd · 2015-10-03T15:48:09Z

That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the groupby result to something closer than the output of resample, but it looks like that indeed. I see the result has an index and 2 columns, not sure what the first column is for. Also the index at only date, and not timestamps. I guess it's just a matter of accessing and presenting the data in the same way than resample after.

Also the by argument for groupby is really not obvious to me; does it make possible to group by e.g. 3 minutes ?

jreback · 2015-10-03T15:57:44Z

yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want some. So this could be built into resample as I proposed above. The idea is that you map your index (even using a function is enough).

You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this:

xref #4314 (e.g. we should simply define this on a Timestamp)

In [25]: def round(t, freq):
    # round a Timestamp to a specified freq
    return pd.Timestamp((t.value/freq.delta.value)*freq.delta.value)
   ....: 

In [26]: from functools import partial

In [27]: s.groupby(partial(round,freq=pd.offsets.Minute(3))).sum()
Out[27]: 
2014-01-03 02:12:00     52
2014-01-03 11:00:00    414
2014-01-05 20:39:00    535
2014-01-07 07:03:00    937
2014-01-08 03:09:00    164
2014-01-12 07:42:00      2
                      ... 
2017-02-27 20:27:00    214
2017-03-01 04:51:00    488
2017-03-01 12:06:00    536
2017-03-01 15:45:00    208
2017-03-01 19:06:00    760
2017-03-03 00:06:00    810
dtype: int64

In [28]: s.groupby(partial(round,freq=pd.offsets.Minute(5))).sum()
Out[28]: 
2014-01-03 02:10:00     52
2014-01-03 11:00:00    414
2014-01-05 20:40:00    535
2014-01-07 07:00:00    937
2014-01-08 03:10:00    164
2014-01-12 07:40:00      2
                      ... 
2017-02-27 20:25:00    214
2017-03-01 04:50:00    488
2017-03-01 12:05:00    536
2017-03-01 15:45:00    208
2017-03-01 19:05:00    760
2017-03-03 00:05:00    810
dtype: int64

jreback · 2015-10-03T16:02:38Z

actually if u would like add this to the timeseries.rst docs under the resample section would be great

can add as an actual method at some point later

shoyer · 2015-10-04T04:42:28Z

+1 for fill_method='drop'

As discussed in #11217, there's another way of doing resampling that is not yet covered by `resample' itself. Let's document that.

This changes the resampling method we used to have by not doing any real resampling like Pandas used too. The `resampling' method from Pandas insert a lot of empty points filled with NaN as value if your timeserie is sparse – which is a typical case in Carbonara/Gnocchi. This ends up creating timeseries with millions of empty points, consuming hundreds of MB of memory for nothing. This method inspired by Jeff on pandas-dev/pandas#11217 implements a simpler versino of what `resample` does: it groups the sample by timestamp, and then compute an aggregation method on them. This avoids creating thousands of useless points and ends up being much faster and consume a *LOT* less memory. Benchmarked: for a new timeserie with 10k measures with 10-80k points by archive this reduces the memory usage of metricd from 2 GB to 100 MB and the compute speed of the most complicated aggregations like percentile to 15min to 20s (45× speed improvement). Change-Id: I1b8718508bdd4633e7324949b76184efc3718ede

joddm · 2018-05-13T16:23:29Z

Any updates here?

TomAugspurger · 2018-05-16T19:39:01Z

Still open if you want to take a shot at fixing it!

…

On Sun, May 13, 2018 at 11:23 AM, joddm ***@***.***> wrote: Any updates here? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11217 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIiEnJlgB3aUrag5ZlBTyoUylGfaYks5tyF4FgaJpZM4GH0Ly> .

xpl · 2018-09-29T00:36:04Z

You can also try this:

def resample (df, column, resolution, how='last'):
    if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
    df.index = df.index.floor (resolution)
    return getattr (df.groupby (column), how) ()

Examples:

resample (balances, 'timestamp', '1s', 'last')

resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum')

eromoe · 2019-05-13T02:13:31Z

It is better to have fill_method='drop' , because missing data should not be fill by zero by default . This caused a huge problem in my project .

Peque · 2019-06-10T16:42:06Z

@jreback You suggested (long time ago in this thread):

s.resample('1s').dropna()

How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version.

I would like to have a series resampled/grouped by '2D' and then .sum() the results to end up with only the dates in which there was some data in the original series (i.e.: I would rather have NaN if there is no data to sum so that I can drop those values later).

An example:

>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0, 1, 2, 3])
>>> series
2019-01-01    0
2019-01-05    1
2019-01-06    2
2019-01-09    3
dtype: int64
>>> series.resample('2D').sum()
2019-01-01    0
2019-01-03    0
2019-01-05    3
2019-01-07    0
2019-01-09    3
Freq: 2D, dtype: int64

Instead of the result I got, I was looking for:

2019-01-01    0
2019-01-05    3
2019-01-09    3
Freq: 2D, dtype: int64

Or at least (I don't mind if the full resampled index gets filled in memory):

2019-01-01    0
2019-01-03    Nan
2019-01-05    3
2019-01-07    Nan
2019-01-09    3
Freq: 2D, dtype: int64

Peque · 2019-06-11T11:24:14Z

So I came up with this, using min_count:

>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01    0.0
2019-01-05    3.0
2019-01-09    3.0
dtype: float64

Still fills the empty spaces with NaNs, but gets the job done for short series which can easily fit in memory.

maowerner · 2019-07-25T13:09:16Z

@TomAugspurger @jreback

Still open if you want to take a shot at fixing it!

Could you give me a few pointers how to start? I would like to try implementing

Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)

maowerner · 2019-07-26T20:26:57Z

Hmm, right now I don't really see how to do this cleanly. The Resampler is created in

pandas/pandas/core/generic.py

Line 8404 in 37f29e5

r = resample(

However, the fill_method parameter is only entering in the following function call.

pandas/pandas/core/generic.py

Line 8418 in 37f29e5

r, how=how, fill_method=fill_method, limit=limit

Furthermore it is deprecated in favor of

.resample().method().fillmethod()

Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop' sounds like a contradiction anyways.

I think it might be cleaner to bypass resample and call round() and groupby as proposed above. It would however be nice to have reference to this in resample.fillna().

Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s') be useful in other cases as well?

TomAugspurger · 2019-07-29T18:40:49Z

Sorry, I'm not familiar with this section of the code. I don't know ahead of time what the best approach is.

…

On Fri, Jul 26, 2019 at 3:27 PM Markus Werner ***@***.***> wrote: Hmm, right now I don't really see how to do this cleanly. The Resampler is created in https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404 However, the fill_method parameter is only entering in the following function call. https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418 Furthermore it is deprecated in favor of .resample().method().fillmethod() Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop' sounds like a contradiction anyways. I think it might be cleaner to bypass resample and call round() and groupby as proposed above. It would however be nice to have reference to this in resample.fillna(). Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s') be useful in other cases as well? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11217?email_source=notifications&email_token=AAKAOIXEUMNSRHDBI36UB43QBNMZVA5CNFSM4BQ7ILZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25UERI#issuecomment-515588677>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIRYGSBSFKGYL3W3KTLQBNMZVANCNFSM4BQ7ILZA> .

maowerner · 2019-07-30T08:05:09Z

So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data.

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random

np.random.seed(41)

i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

%timeit s.\
    resample('1H').\
    sum(min_count=1).\
    dropna()

2.4 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit s.\
    groupby(s.index.floor('1H')).\
    sum()

1.67 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit s.\
    groupby(s.index.round('1H')).\
    sum()

1.72 ms ± 9.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

mroeschke · 2020-03-31T05:14:53Z

Since resample no longer has a full_method, I don't think there's much left to implement directly here. If there's a related need using ffill or bfill, a new issue can be opened regarding that. Closing.

jreback added API Design Resample resample method labels Oct 2, 2015

jreback added Difficulty Intermediate Datetime Datetime data dtype labels Oct 2, 2015

jreback added this to the Next Major Release milestone Oct 3, 2015

jd mentioned this issue Oct 5, 2015

Enhance timeseries documentation #11242

Closed

jreback pushed a commit that referenced this issue Oct 8, 2015

timeseries: add tip about using groupby() rather than resample

f0c7c41

As discussed in #11217, there's another way of doing resampling that is not yet covered by `resample' itself. Let's document that.

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke closed this as completed Mar 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample() should allow a "null" fill_method #11217

resample() should allow a "null" fill_method #11217

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

TomAugspurger commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 3, 2015

jreback commented Oct 3, 2015

jd commented Oct 3, 2015

jreback commented Oct 3, 2015

jreback commented Oct 3, 2015

shoyer commented Oct 4, 2015

joddm commented May 13, 2018

TomAugspurger commented May 16, 2018 via email

xpl commented Sep 29, 2018

eromoe commented May 13, 2019 •

edited

Loading

Peque commented Jun 10, 2019

Peque commented Jun 11, 2019

maowerner commented Jul 25, 2019

maowerner commented Jul 26, 2019

TomAugspurger commented Jul 29, 2019 via email

maowerner commented Jul 30, 2019

mroeschke commented Mar 31, 2020

resample() should allow a "null" fill_method #11217

resample() should allow a "null" fill_method #11217

Comments

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

TomAugspurger commented Oct 2, 2015

jd commented Oct 2, 2015

jreback commented Oct 2, 2015

jd commented Oct 3, 2015

jreback commented Oct 3, 2015

jd commented Oct 3, 2015

jreback commented Oct 3, 2015

jreback commented Oct 3, 2015

shoyer commented Oct 4, 2015

joddm commented May 13, 2018

TomAugspurger commented May 16, 2018 via email

xpl commented Sep 29, 2018

eromoe commented May 13, 2019 • edited Loading

Peque commented Jun 10, 2019

Peque commented Jun 11, 2019

maowerner commented Jul 25, 2019

maowerner commented Jul 26, 2019

TomAugspurger commented Jul 29, 2019 via email

maowerner commented Jul 30, 2019

mroeschke commented Mar 31, 2020

eromoe commented May 13, 2019 •

edited

Loading