Skip to content

resample() should allow a "null" fill_method #11217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jd opened this issue Oct 2, 2015 · 25 comments
Closed

resample() should allow a "null" fill_method #11217

jd opened this issue Oct 2, 2015 · 25 comments
Labels
API Design Datetime Datetime data dtype Resample resample method

Comments

@jd
Copy link
Contributor

jd commented Oct 2, 2015

I'm using resample() to aggregate data in a timeframe.

>>> s = pandas.Series((3, 4), (pandas.Timestamp("2014-1-1"), pandas.Timestamp("2015-1-1")))
>>> x = s.resample("1s")
>>> len(x)
31536001

When doing such a call, resample fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to not fill them in the Series and having so much memory used. AFAICS resample does not offer such as fill_method.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

this is trivially done with

s.resample('1s').dropna()

@jd
Copy link
Contributor Author

jd commented Oct 2, 2015

@jreback Sure, but that still makes Pandas uses 500M+ RAM.

Things are actually worse for something like that:

>>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median')

The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

you need to show what you are going to do with this

why are you resampling such a large period? what's the point

@jd
Copy link
Contributor Author

jd commented Oct 2, 2015

I don't understand why the reason is important here.
I treat statistics over a year with fine grained resolution up to the second.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data?

the point of resample IS to give you a full-rank set for every point in time.

@jreback jreback added API Design Resample resample method labels Oct 2, 2015
@TomAugspurger
Copy link
Contributor

@jd does snap cover your needs?

@jd
Copy link
Contributor Author

jd commented Oct 2, 2015

@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket.

@TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :)

@jreback
Copy link
Contributor

jreback commented Oct 2, 2015

@jd

so api for this could be 1 of the following:

  • Series.resample(...., fill_method='sparse') (which could return a sparse Series)
  • Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)
  • SparseSeries(....).resample(....) (which would by definition return a sparse Series)

none of these are implemented, but potentially useful.

want to take a stab at this?

@jd
Copy link
Contributor Author

jd commented Oct 3, 2015

@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those.

I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :)

Thanks a lot!

@jreback
Copy link
Contributor

jreback commented Oct 3, 2015

Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.

In [24]: pd.options.display.max_rows=12

In [25]: np.random.seed(1234)

In [26]: i = date_range('20140101',periods=100000000,freq='s')
^[[A
In [27]: s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))

In [28]: s
Out[28]: 
2014-07-19 19:23:27      0
2014-09-23 09:00:37      1
2014-01-12 07:42:04      2
2016-07-01 18:35:32      3
2014-07-24 05:43:52      4
2016-11-27 19:22:33      5
                      ... 
2015-01-04 08:11:55    994
2016-09-26 02:24:18    995
2016-03-01 18:48:41    996
2014-03-15 01:18:10    997
2015-07-14 06:45:46    998
2016-03-09 08:47:23    999
dtype: int64

In [29]: s.groupby([s.index.date,s.index.second]).sum()
Out[29]: 
2014-01-03  9      52
            50    414
2014-01-05  31    535
2014-01-07  16    937
2014-01-08  28    164
2014-01-12  4       2
                 ... 
2017-02-27  39    174
2017-03-01  5     536
            26    760
            31    208
            51    488
2017-03-03  7     810
dtype: int64

@jreback jreback added this to the Next Major Release milestone Oct 3, 2015
@jd
Copy link
Contributor Author

jd commented Oct 3, 2015

That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the groupby result to something closer than the output of resample, but it looks like that indeed. I see the result has an index and 2 columns, not sure what the first column is for. Also the index at only date, and not timestamps. I guess it's just a matter of accessing and presenting the data in the same way than resample after.

Also the by argument for groupby is really not obvious to me; does it make possible to group by e.g. 3 minutes ?

@jreback
Copy link
Contributor

jreback commented Oct 3, 2015

yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want some. So this could be built into resample as I proposed above. The idea is that you map your index (even using a function is enough).

You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this:

xref #4314 (e.g. we should simply define this on a Timestamp)

In [25]: def round(t, freq):
    # round a Timestamp to a specified freq
    return pd.Timestamp((t.value/freq.delta.value)*freq.delta.value)
   ....: 

In [26]: from functools import partial

In [27]: s.groupby(partial(round,freq=pd.offsets.Minute(3))).sum()
Out[27]: 
2014-01-03 02:12:00     52
2014-01-03 11:00:00    414
2014-01-05 20:39:00    535
2014-01-07 07:03:00    937
2014-01-08 03:09:00    164
2014-01-12 07:42:00      2
                      ... 
2017-02-27 20:27:00    214
2017-03-01 04:51:00    488
2017-03-01 12:06:00    536
2017-03-01 15:45:00    208
2017-03-01 19:06:00    760
2017-03-03 00:06:00    810
dtype: int64

In [28]: s.groupby(partial(round,freq=pd.offsets.Minute(5))).sum()
Out[28]: 
2014-01-03 02:10:00     52
2014-01-03 11:00:00    414
2014-01-05 20:40:00    535
2014-01-07 07:00:00    937
2014-01-08 03:10:00    164
2014-01-12 07:40:00      2
                      ... 
2017-02-27 20:25:00    214
2017-03-01 04:50:00    488
2017-03-01 12:05:00    536
2017-03-01 15:45:00    208
2017-03-01 19:05:00    760
2017-03-03 00:05:00    810
dtype: int64

@jreback
Copy link
Contributor

jreback commented Oct 3, 2015

actually if u would like add this to the timeseries.rst docs under the resample section would be great

can add as an actual method at some point later

@shoyer
Copy link
Member

shoyer commented Oct 4, 2015

+1 for fill_method='drop'

jreback pushed a commit that referenced this issue Oct 8, 2015
As discussed in #11217, there's another way of doing resampling that is not yet
covered by `resample' itself. Let's document that.
openstack-gerrit pushed a commit to openstack-archive/gnocchi that referenced this issue Oct 8, 2015
This changes the resampling method we used to have by not doing any real
resampling like Pandas used too. The `resampling' method from Pandas insert a
lot of empty points filled with NaN as value if your timeserie is sparse –
which is a typical case in Carbonara/Gnocchi. This ends up creating timeseries
with millions of empty points, consuming hundreds of MB of memory for nothing.

This method inspired by Jeff on pandas-dev/pandas#11217
implements a simpler versino of what `resample` does: it groups the sample by
timestamp, and then compute an aggregation method on them. This avoids creating
thousands of useless points and ends up being much faster and consume a *LOT*
less memory.

Benchmarked: for a new timeserie with 10k measures with 10-80k points by
archive this reduces the memory usage of metricd from 2 GB to 100 MB and the
compute speed of the most complicated aggregations like percentile to 15min to
20s (45× speed improvement).

Change-Id: I1b8718508bdd4633e7324949b76184efc3718ede
@joddm
Copy link

joddm commented May 13, 2018

Any updates here?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 16, 2018 via email

@xpl
Copy link

xpl commented Sep 29, 2018

You can also try this:

def resample (df, column, resolution, how='last'):
    if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
    df.index = df.index.floor (resolution)
    return getattr (df.groupby (column), how) ()

Examples:

resample (balances, 'timestamp', '1s', 'last')
resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum')

@eromoe
Copy link

eromoe commented May 13, 2019

It is better to have fill_method='drop' , because missing data should not be fill by zero by default . This caused a huge problem in my project .

@Peque
Copy link
Contributor

Peque commented Jun 10, 2019

@jreback You suggested (long time ago in this thread):

s.resample('1s').dropna()

How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version.

I would like to have a series resampled/grouped by '2D' and then .sum() the results to end up with only the dates in which there was some data in the original series (i.e.: I would rather have NaN if there is no data to sum so that I can drop those values later).

An example:

>>> from pandas import date_range
>>> from pandas import Series
>>> dates = date_range('2019-01-01', '2019-01-10', freq='D')[[0, 4, 5, 8]]
>>> dates
DatetimeIndex(['2019-01-01', '2019-01-05', '2019-01-06', '2019-01-09'], dtype='datetime64[ns]', freq=None)
>>> series = Series(index=dates, data=[0, 1, 2, 3])
>>> series
2019-01-01    0
2019-01-05    1
2019-01-06    2
2019-01-09    3
dtype: int64
>>> series.resample('2D').sum()
2019-01-01    0
2019-01-03    0
2019-01-05    3
2019-01-07    0
2019-01-09    3
Freq: 2D, dtype: int64

Instead of the result I got, I was looking for:

2019-01-01    0
2019-01-05    3
2019-01-09    3
Freq: 2D, dtype: int64

Or at least (I don't mind if the full resampled index gets filled in memory):

2019-01-01    0
2019-01-03    Nan
2019-01-05    3
2019-01-07    Nan
2019-01-09    3
Freq: 2D, dtype: int64

@Peque
Copy link
Contributor

Peque commented Jun 11, 2019

So I came up with this, using min_count:

>>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01    0.0
2019-01-05    3.0
2019-01-09    3.0
dtype: float64

Still fills the empty spaces with NaNs, but gets the job done for short series which can easily fit in memory.

@maowerner
Copy link

@TomAugspurger @jreback

Still open if you want to take a shot at fixing it!

Could you give me a few pointers how to start? I would like to try implementing

Series.resample(...., fill_method='drop') (which could return a Series, but with no NaN)

@maowerner
Copy link

Hmm, right now I don't really see how to do this cleanly. The Resampler is created in

r = resample(

However, the fill_method parameter is only entering in the following function call.
r, how=how, fill_method=fill_method, limit=limit

Furthermore it is deprecated in favor of

.resample().method().fillmethod()

Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And fill_method = 'drop' sounds like a contradiction anyways.

I think it might be cleaner to bypass resample and call round() and groupby as proposed above. It would however be nice to have reference to this in resample.fillna().

Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round = '1s') be useful in other cases as well?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 29, 2019 via email

@maowerner
Copy link

So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data.

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random

np.random.seed(41)

i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000)))
%timeit s.\
    resample('1H').\
    sum(min_count=1).\
    dropna()
2.4 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit s.\
    groupby(s.index.floor('1H')).\
    sum()
1.67 ms ± 18.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s.\
    groupby(s.index.round('1H')).\
    sum()
1.72 ms ± 9.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@mroeschke
Copy link
Member

Since resample no longer has a full_method, I don't think there's much left to implement directly here. If there's a related need using ffill or bfill, a new issue can be opened regarding that. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Resample resample method
Projects
None yet
Development

No branches or pull requests