resample becomes non-deterministic, depending on DateTimeIndex values #28675

haeusser · 2019-09-30T00:27:12Z

Minimal Example

import datetime as dt
import numpy as np
import pandas as pd


def np_to_df(data, start_time):
    index = pd.DatetimeIndex(
        [start_time + dt.timedelta(milliseconds=t) for t in range(len(data))])
    df = pd.DataFrame(data, index=index)
    return df


# generate sample data
data = np.sin(np.arange(1000) / 30)

# create DataFrames with DateTimeIndices
df_1 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 41))
df_2 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 42))

# print difference before resampling
print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

# resample
df_1 = df_1.resample("19ms").mean()
df_2 = df_2.resample("19ms").mean()

# print difference after resampling
print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

Output:

error_1-2: 0.0
error_1-2: 0.04119868246404099

Problem description

When you give the exact same data to the resample function, it becomes non-deterministic if the DateTimeIndex has differing values - even though the frequency is the same.

Expected Output

The values of the two DataFrames should be exactly the same.

Output of `pd.show_versions()`

commit : None
python : 3.6.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-51-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 9.0.1
setuptools : 41.0.1
Cython : None
pytest : 4.4.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : 0.999999999
pymysql : None
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.1.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Happy about any help, @jreback ?

The text was updated successfully, but these errors were encountered:

haeusser · 2019-10-02T11:29:59Z

PUSH - this issue is causing serious problems for me. Happy about any feedback.

jreback · 2019-10-02T13:33:33Z

PUSH - this issue is causing serious problems for me. Happy about any feedback.

your issue has been open for 3 days

we have 3000, when / if someone has a chance they can look

nrebena · 2019-10-03T20:56:13Z

If you look at df_1 and df_2 after resampling, you'll see that one start at 2019-09-30 09:40:59.984 and the other at 2019-09-30 09:41:59.986. Resampling does not start at the first value of the serie, but at the start of the day. So when resampling, you are not necessarly grouping the data as you think you would (it would work for resampling in a dividor of on minute in your specific case).

Maybe what you wanna do is binning the values every 19 samples. This would give what your expect. Or look at the base option of resampling

But resampling is pretty deterministic.

haeusser · 2019-10-04T05:13:35Z

Thank you @nrebena for your response. What is the reason for this behavior? It does not make any sense to me. The same data should be binned the exact same way, no matter what the start time is - in my opinion. Cheers, Philip

On Fri 4. Oct 2019 at 00:56, nrebena ***@***.***> wrote: If you look at df_1 and df_2 after resampling, you'll see that one start at 2019-09-30 09:40:59.984 and the other at 2019-09-30 09:41:59.986. Resampling does not start at the first value of the serie, but at the start of the day. So when resampling, you are not necessarly grouping the data as you think you would (it would work for resampling in a dividor of on minute in your specific case). Maybe what you wanna do is binning the values every 19 samples. This would give what your expect. Or look at the base option of resampling But resampling is pretty deterministic. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#28675?email_source=notifications&email_token=ABNSDGEGIYW2WIJ5GYZG5UTQMZMBVA5CNFSM4I3UWOYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAJSFNI#issuecomment-538124981>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNSDGGEXUMOPTUYIOL7JBTQMZMBVANCNFSM4I3UWOYA> .

--

…

____________________________________________________ Dr. Philip Häusser [email protected] +49 172 5247343 http://enterbrainment.tv

nrebena · 2019-10-04T08:15:20Z

The data are definitely binned in the same way, no matter the first value. You could consider the bin as fixed discretization of the axis, they do not depend on the values, as they should.

Depending on what you really wanna do, you could also look at pandas.cut, and define your one bin with a DatetimeIndex or something.

hasB4K · 2020-04-11T16:19:12Z

#31809 should help fix this, I have reproduced your code using the new origin argument:

import numpy as np
import pandas as pd
import datetime as dt

def np_to_df(data, start_time):
    index = pd.DatetimeIndex(
        [start_time + dt.timedelta(milliseconds=t) for t in range(len(data))])
    df = pd.DataFrame(data, index=index)
    return df

data = np.sin(np.arange(1000) / 30)

df_1 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 41))
df_2 = np_to_df(data, dt.datetime(2019, 9, 30, 9, 42))

print("error_1-2:", np.mean(np.abs(df_1.values - df_2.values)))

df_1_resampled = df_1.resample("19ms", origin="start").mean()
df_2_resampled = df_2.resample("19ms", origin="start").mean()

print("error_1-2:", np.mean(np.abs(df_1_resampled.values - df_2_resampled.values)))

Output:

error_1-2: 0.0
error_1-2: 0.0

EDIT: use start option on origin.

jbrockmendel added the Resample resample method label Oct 16, 2019

hasB4K mentioned this issue Apr 11, 2020

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Merged

9 tasks

jreback added this to the 1.1 milestone May 10, 2020

jreback closed this as completed in #31809 May 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resample becomes non-deterministic, depending on DateTimeIndex values #28675

resample becomes non-deterministic, depending on DateTimeIndex values #28675

haeusser commented Sep 30, 2019 •

edited

Loading

haeusser commented Oct 2, 2019

jreback commented Oct 2, 2019

nrebena commented Oct 3, 2019

haeusser commented Oct 4, 2019 via email

nrebena commented Oct 4, 2019

hasB4K commented Apr 11, 2020 •

edited

Loading

resample becomes non-deterministic, depending on DateTimeIndex values #28675

resample becomes non-deterministic, depending on DateTimeIndex values #28675

Comments

haeusser commented Sep 30, 2019 • edited Loading

Minimal Example

Problem description

Expected Output

Output of pd.show_versions()

haeusser commented Oct 2, 2019

jreback commented Oct 2, 2019

nrebena commented Oct 3, 2019

haeusser commented Oct 4, 2019 via email

nrebena commented Oct 4, 2019

hasB4K commented Apr 11, 2020 • edited Loading

haeusser commented Sep 30, 2019 •

edited

Loading

Output of `pd.show_versions()`

hasB4K commented Apr 11, 2020 •

edited

Loading