-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
resample() should allow a "null" fill_method #11217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this is trivially done with s.resample('1s').dropna() |
@jreback Sure, but that still makes Pandas uses 500M+ RAM. Things are actually worse for something like that: >>> s = pandas.Series(range(1000), pandas.date_range('2014-1-1', periods=1000))
>>> s.resample('1s', how='median') The result is pretty obvious and should take less than one second to compute. But since Pandas resample with thousands of points filled with NaN, it takes 10+ minutes to compute. And uses 1.5G+ RAM. |
you need to show what you are going to do with this why are you resampling such a large period? what's the point |
I don't understand why the reason is important here. |
a usecase/example of what you are doing would be helpful to understand the goal. e.g. are you merely snaping things to a close freq? or are you doing an actual resample but with sparse data? the point of resample IS to give you a full-rank set for every point in time. |
@jreback Yeah, I'm doing a resampling (aggregation) using sparse data, but I'm not only snapping – there might be several values in a time bucket. @TomAugspurger It does not, though I was looking for such a function anyway so thanks for the pointer :) |
so api for this could be 1 of the following:
none of these are implemented, but potentially useful. want to take a stab at this? |
@jreback Hum I've trouble figuring out the different between sparse and drop, but I'm sure I would need one of those. I don't think I'm know enough of Pandas internal to implement that right now, but I'd be happy to test it and report if you write some code. :) Thanks a lot! |
Here's a way to do a sparse resample. A resample is just a groupby. So we are grouping by a combination of the date, and a second (e.g. to give us a freq of 's'). This only groups by the representative elements in the input index (e.g. your sparse points), so this at most generates n groups where n is the length of the set.
|
That really looks like a good way of approaching the solution. I probably lack knowledge about Pandas usage to understand how to map the Also the |
yes this is a multi-groupby. Its a bit tricky as the default way of doing this will create all of the groups, but here you only want some. So this could be built into You are essentially rounding the value to whatever interval you want, so here's sort of a trivial way to do this: xref #4314 (e.g. we should simply define this on a
|
actually if u would like add this to the timeseries.rst docs under the resample section would be great can add as an actual method at some point later |
+1 for |
As discussed in #11217, there's another way of doing resampling that is not yet covered by `resample' itself. Let's document that.
This changes the resampling method we used to have by not doing any real resampling like Pandas used too. The `resampling' method from Pandas insert a lot of empty points filled with NaN as value if your timeserie is sparse – which is a typical case in Carbonara/Gnocchi. This ends up creating timeseries with millions of empty points, consuming hundreds of MB of memory for nothing. This method inspired by Jeff on pandas-dev/pandas#11217 implements a simpler versino of what `resample` does: it groups the sample by timestamp, and then compute an aggregation method on them. This avoids creating thousands of useless points and ends up being much faster and consume a *LOT* less memory. Benchmarked: for a new timeserie with 10k measures with 10-80k points by archive this reduces the memory usage of metricd from 2 GB to 100 MB and the compute speed of the most complicated aggregations like percentile to 15min to 20s (45× speed improvement). Change-Id: I1b8718508bdd4633e7324949b76184efc3718ede
Any updates here? |
Still open if you want to take a shot at fixing it!
…On Sun, May 13, 2018 at 11:23 AM, joddm ***@***.***> wrote:
Any updates here?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11217 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIiEnJlgB3aUrag5ZlBTyoUylGfaYks5tyF4FgaJpZM4GH0Ly>
.
|
You can also try this: def resample (df, column, resolution, how='last'):
if type (df.index) != pd.DatetimeIndex: df.set_index (column, inplace=True)
df.index = df.index.floor (resolution)
return getattr (df.groupby (column), how) () Examples: resample (balances, 'timestamp', '1s', 'last') resample (trades[['timestamp', 'amount']], 'timestamp', '1s', 'sum') |
It is better to have |
@jreback You suggested (long time ago in this thread): s.resample('1s').dropna() How would you recommend to do this nowadays? That way you suggested seems not to work with the latest Pandas version. I would like to have a series resampled/grouped by An example:
Instead of the result I got, I was looking for:
Or at least (I don't mind if the full resampled index gets filled in memory):
|
So I came up with this, using >>> series.resample('2D').sum(min_count=1).dropna()
2019-01-01 0.0
2019-01-05 3.0
2019-01-09 3.0
dtype: float64 Still fills the empty spaces with |
Could you give me a few pointers how to start? I would like to try implementing
|
Hmm, right now I don't really see how to do this cleanly. The Resampler is created in Line 8404 in 37f29e5
However, the fill_method parameter is only entering in the following function call.Line 8418 in 37f29e5
Furthermore it is deprecated in favor of
Making the Resampler dependent on fill_method would only reintroduce the coupling into the code. And I think it might be cleaner to bypass Do share this opinion? Is there somewhere else where this might be implemented? Wouldn't some ẁay to group by approximate values i.e. |
Sorry, I'm not familiar with this section of the code. I don't know ahead
of time what the best approach is.
…On Fri, Jul 26, 2019 at 3:27 PM Markus Werner ***@***.***> wrote:
Hmm, right now I don't really see how to do this cleanly. The Resampler is
created in
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8404
However, the fill_method parameter is only entering in the following
function call.
https://github.com/pandas-dev/pandas/blob/37f29e5bf967d76d76a05e84b5c9b14c9bc66f23/pandas/core/generic.py#L8418
Furthermore it is deprecated in favor of
.resample().method().fillmethod()
Making the Resampler dependent on fill_method would only reintroduce the
coupling into the code. And fill_method = 'drop' sounds like a
contradiction anyways.
I think it might be cleaner to bypass resample and call round() and
groupby as proposed above. It would however be nice to have reference to
this in resample.fillna().
Do share this opinion? Is there somewhere else where this might be
implemented? Wouldn't some ẁay to group by approximate values i.e. .groupby(round
= '1s') be useful in other cases as well?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11217?email_source=notifications&email_token=AAKAOIXEUMNSRHDBI36UB43QBNMZVA5CNFSM4BQ7ILZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD25UERI#issuecomment-515588677>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIRYGSBSFKGYL3W3KTLQBNMZVANCNFSM4BQ7ILZA>
.
|
So I tried this out with simply grouping by the rounded value. This works perfectly fine without intermediate objects. Below is a comparison with the workaround of @Peque I would like to mention this (or a similar) example in the documentation of resample. I think I am not the only one who was mislead to think that it is the correct function to group sparse data. import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import random
np.random.seed(41)
i = pd.date_range('20140101',periods=100000000,freq='s')
s = Series(range(1000),index=i.take(np.random.randint(0,len(i),size=1000))) %timeit s.\
resample('1H').\
sum(min_count=1).\
dropna()
%timeit s.\
groupby(s.index.floor('1H')).\
sum()
%timeit s.\
groupby(s.index.round('1H')).\
sum()
|
Since |
I'm using
resample()
to aggregate data in a timeframe.When doing such a call,
resample
fills with NaN all the (31536001 - 2) inexistent values, which ends up creating thousands of points and making Python using 500M+ RAM. The thing is that I don't care about the NaN point, so I would like to not fill them in theSeries
and having so much memory used. AFAICSresample
does not offer such asfill_method
.The text was updated successfully, but these errors were encountered: