-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
resample becomes non-deterministic, depending on DateTimeIndex values #28675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
PUSH - this issue is causing serious problems for me. Happy about any feedback. |
your issue has been open for 3 days we have 3000, when / if someone has a chance they can look |
If you look at df_1 and df_2 after resampling, you'll see that one start at 2019-09-30 09:40:59.984 and the other at 2019-09-30 09:41:59.986. Resampling does not start at the first value of the serie, but at the start of the day. So when resampling, you are not necessarly grouping the data as you think you would (it would work for resampling in a dividor of on minute in your specific case). Maybe what you wanna do is binning the values every 19 samples. This would give what your expect. Or look at the But resampling is pretty deterministic. |
Thank you @nrebena for your response.
What is the reason for this behavior? It does not make any sense to me. The
same data should be binned the exact same way, no matter what the start
time is - in my opinion.
Cheers,
Philip
On Fri 4. Oct 2019 at 00:56, nrebena ***@***.***> wrote:
If you look at df_1 and df_2 after resampling, you'll see that one start
at 2019-09-30 09:40:59.984 and the other at 2019-09-30 09:41:59.986.
Resampling does not start at the first value of the serie, but at the start
of the day. So when resampling, you are not necessarly grouping the data as
you think you would (it would work for resampling in a dividor of on minute
in your specific case).
Maybe what you wanna do is binning the values every 19 samples. This would
give what your expect. Or look at the base option of resampling
But resampling is pretty deterministic.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#28675?email_source=notifications&email_token=ABNSDGEGIYW2WIJ5GYZG5UTQMZMBVA5CNFSM4I3UWOYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAJSFNI#issuecomment-538124981>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABNSDGGEXUMOPTUYIOL7JBTQMZMBVANCNFSM4I3UWOYA>
.
--
…____________________________________________________
Dr. Philip Häusser
[email protected]
+49 172 5247343
http://enterbrainment.tv
|
The data are definitely binned in the same way, no matter the first value. You could consider the bin as fixed discretization of the axis, they do not depend on the values, as they should. Depending on what you really wanna do, you could also look at |
#31809 should help fix this, I have reproduced your code using the new import numpy as np
import pandas as pd
import datetime as dt
def np_to_df(data, start_time):
index = pd.DatetimeIndex(
[start_time + dt.timedelta(milliseconds=t) for t in range(len(data))])
df = pd.DataFrame(data, index=index)
return df
data = np.sin(np.arange(1000) / 30)
df_1 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 41))
df_2 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 42))
print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))
df_1_resampled = df_1.resample("19ms", origin="start").mean()
df_2_resampled = df_2.resample("19ms", origin="start").mean()
print("error_1-2:", np.mean(np.abs(df_1_resampled.values - df_2_resampled.values))) Output:
EDIT: use |
Minimal Example
Output:
Problem description
When you give the exact same data to the resample function, it becomes non-deterministic if the
DateTimeIndex
has differing values - even though the frequency is the same.Expected Output
The values of the two
DataFrames
should be exactly the same.Output of
pd.show_versions()
Happy about any help, @jreback ?
The text was updated successfully, but these errors were encountered: