Skip to content

BUG: resample seems to convert hours to 00:00 #34833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
liverpool1026 opened this issue Jun 16, 2020 · 8 comments
Closed
3 tasks done

BUG: resample seems to convert hours to 00:00 #34833

liverpool1026 opened this issue Jun 16, 2020 · 8 comments
Labels
Bug Resample resample method

Comments

@liverpool1026
Copy link

liverpool1026 commented Jun 16, 2020

  • I have checked that this issue has not already been reported. (As far as I can see by using the search)

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import datetime as dt
import pandas as pd

start_time = dt.datetime(year=2018, month=6, day=7, hour=18)
end_time = dt.datetime(year=2018, month=6, day=27, hour=18)
time_index_df = pd.date_range(
    start_time,
    end_time,
    freq="12H",
    name="datetime",
).to_frame(index=False)

time_index_df["test"] = 1
time_index_df["name"] = "Name"

time_index_df = time_index_df.resample("2d", convention="end", on="datetime").mean().reset_index()

print(time_index_df["datetime"].values)

time_index_df.to_csv(f"test_resample_{pd.__version__}")

Problem description

On pandas 0.23 the behaviour of resample will keep the correct datetime (18:00 in this case) after resample.
But starting from 0.24, after resample, the datetime is now converted to (00:00).
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

On. 0.23
['2018-06-07T18:00:00.000000000' '2018-06-09T18:00:00.000000000'
'2018-06-11T18:00:00.000000000' '2018-06-13T18:00:00.000000000'
'2018-06-15T18:00:00.000000000' '2018-06-17T18:00:00.000000000'
'2018-06-19T18:00:00.000000000' '2018-06-21T18:00:00.000000000'
'2018-06-23T18:00:00.000000000' '2018-06-25T18:00:00.000000000'
'2018-06-27T18:00:00.000000000']

But Starting from 0.24 it is giving me
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

Output of pd.show_versions()

0.23
INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.23.0
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

0.24
INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.24.2
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None

@liverpool1026 liverpool1026 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020
@liverpool1026 liverpool1026 changed the title BUG: BUG: resample seems to convert hours to 00:00 Jun 16, 2020
@jreback
Copy link
Contributor

jreback commented Jun 16, 2020

you need to try in a much newer version
1.04 and also on master where resample has changed a bit

@liverpool1026
Copy link
Author

Can confirm the same behavior on 1.04 and master

1.04

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

master

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : 5fdd6f5
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
Version : #53~18.04.1-Ubuntu SMP Thu Jun 4 14:58:26 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.5.8
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.2.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 0.9.4
xlwt : None
numba : None

@dsandeep0138
Copy link

This is because the default value of "origin" in resample is "start_day". Please see below:

pandas/pandas/core/generic.py

Lines 7754 to 7768 in 4a267c6

def resample(
self,
rule,
axis=0,
closed: Optional[str] = None,
label: Optional[str] = None,
convention: str = "start",
kind: Optional[str] = None,
loffset=None,
base: Optional[int] = None,
on=None,
level=None,
origin: Union[str, TimestampConvertibleTypes] = "start_day",
offset: Optional[TimedeltaConvertibleTypes] = None,
) -> "Resampler":

The following are the definitions for the 'start' and 'start_day'. Since the default is 'start_day', the first day at midnight is considered. Hence hours you have mentioned is disregarded.

‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries

Looks like there was some issue due to which this is changed, and is described here: #31809

The default of "start" is more accurate in my opinion. @jreback @hasB4K @mroeschke Can you please comment further on this?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2020

@dsandeep0138 pls have a read on the issue and the number of issues this patches

this change is much more consistent with respect to the frequency rather than just happens to work

@dsandeep0138
Copy link

@jreback Thanks for the comment. Yes, I understand the change is awesome, and deals with the inconsistencies in frequencies and fixes so many issues :)

Can you please comment on this particular bug if it is expected then? Should we document it, if it is so to avoid confusion? Thanks.

@hasB4K
Copy link
Member

hasB4K commented Jun 17, 2020

@dsandeep0138 I'll reply a bit later to explain everything, no worries 😉

@liverpool1026
Copy link
Author

@deepandas11 That makes sense on master, by setting origin="start" fixes the issue.

However, it will still be an issue in 0.24 and 1.04 given that the origin kwarg was not introduced then.

If I am using 0.24, what's the way to preserve the hours?

@hasB4K
Copy link
Member

hasB4K commented Jun 18, 2020

@liverpool1026 The behavior has been to follow start_day since 2012 as you can see in this commit: 31ca168. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug 😕 ...

The behavior of start_day of origin has been preserved as the default value in #31809 to avoid creating breaking changes.

The proof that your code is relying on a bug fixed in v0.2.4:

import datetime as dt
import pandas as pd

def print_resample(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
    print(time_index_df)


print_resample(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):

EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 00:00:00     1
1 2018-06-07 12:00:00    12
2 2018-06-08 00:00:00    12
3 2018-06-08 12:00:00    12
4 2018-06-09 00:00:00    12
5 2018-06-09 12:00:00    12
6 2018-06-10 00:00:00    12

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 12:00:00    11
1 2018-06-08 00:00:00    12
2 2018-06-08 12:00:00    12
3 2018-06-09 00:00:00    12
4 2018-06-09 12:00:00    12
5 2018-06-10 00:00:00    12
  • Example 1 is the buggy one since it uses a frequency bigger than 1D.
  • Example 3 is showing you that the behavior is not to start at midnight of the day of the first timestamp, but we are using the start of the day to align the bins. The argument origin of ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.

Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:

import datetime as dt
import pandas as pd

def print_resample_simulate_origin_start(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    
    # hack: transform datetime into timestamps to resample on start of the timeseries
    time_index_df["datetime"] -= pd.Timestamp(0)

    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()

    # hack: transform datetime back into timestamps
    time_index_df["datetime"] = pd.to_datetime(time_index_df["datetime"])
    print(time_index_df)


print_resample_simulate_origin_start(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample_simulate_origin_start(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample_simulate_origin_start(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):


EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25
2 2018-06-11 11:00:00     0

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 11:00:00    12
1 2018-06-07 23:00:00    12
2 2018-06-08 11:00:00    12
3 2018-06-08 23:00:00    12
4 2018-06-09 11:00:00    12
5 2018-06-09 23:00:00    12
6 2018-06-10 11:00:00     1

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 13:00:00    12
1 2018-06-08 01:00:00    12
2 2018-06-08 13:00:00    12
3 2018-06-09 01:00:00    12
4 2018-06-09 13:00:00    12
5 2018-06-10 01:00:00    11

I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.

@hasB4K hasB4K added Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2020
@hasB4K hasB4K closed this as completed Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

No branches or pull requests

4 participants