BUG: resample seems to convert hours to 00:00 #34833

liverpool1026 · 2020-06-16T22:47:30Z

I have checked that this issue has not already been reported. (As far as I can see by using the search)
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
import datetime as dt
import pandas as pd

start_time = dt.datetime(year=2018, month=6, day=7, hour=18)
end_time = dt.datetime(year=2018, month=6, day=27, hour=18)
time_index_df = pd.date_range(
    start_time,
    end_time,
    freq="12H",
    name="datetime",
).to_frame(index=False)

time_index_df["test"] = 1
time_index_df["name"] = "Name"

time_index_df = time_index_df.resample("2d", convention="end", on="datetime").mean().reset_index()

print(time_index_df["datetime"].values)

time_index_df.to_csv(f"test_resample_{pd.__version__}")

Problem description

On pandas 0.23 the behaviour of resample will keep the correct datetime (18:00 in this case) after resample.
But starting from 0.24, after resample, the datetime is now converted to (00:00).
[this should explain why the current behaviour is a problem and why the expected output is a better solution]

Expected Output

On. 0.23
['2018-06-07T18:00:00.000000000' '2018-06-09T18:00:00.000000000'
'2018-06-11T18:00:00.000000000' '2018-06-13T18:00:00.000000000'
'2018-06-15T18:00:00.000000000' '2018-06-17T18:00:00.000000000'
'2018-06-19T18:00:00.000000000' '2018-06-21T18:00:00.000000000'
'2018-06-23T18:00:00.000000000' '2018-06-25T18:00:00.000000000'
'2018-06-27T18:00:00.000000000']

But Starting from 0.24 it is giving me
['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

Output of `pd.show_versions()`

0.23
INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.23.0
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
None

0.24
INSTALLED VERSIONS

commit: None
python: 3.6.9.final.0
python-bits: 64
OS: Linux
OS-release: 5.3.0-59-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_AU.UTF-8
LOCALE: en_AU.UTF-8

pandas: 0.24.2
pytest: None
pip: 20.0.2
setuptools: 46.0.0
Cython: None
numpy: 1.18.5
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2020.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.2.19
pymysql: None
psycopg2: 2.8.5 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None

The text was updated successfully, but these errors were encountered:

jreback · 2020-06-16T23:12:43Z

you need to try in a much newer version
1.04 and also on master where resample has changed a bit

liverpool1026 · 2020-06-16T23:36:14Z

Can confirm the same behavior on 1.04 and master

1.04

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.0.4
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

master

['2018-06-07T00:00:00.000000000' '2018-06-09T00:00:00.000000000'
'2018-06-11T00:00:00.000000000' '2018-06-13T00:00:00.000000000'
'2018-06-15T00:00:00.000000000' '2018-06-17T00:00:00.000000000'
'2018-06-19T00:00:00.000000000' '2018-06-21T00:00:00.000000000'
'2018-06-23T00:00:00.000000000' '2018-06-25T00:00:00.000000000'
'2018-06-27T00:00:00.000000000']

INSTALLED VERSIONS

commit : 5fdd6f5
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-59-generic
Version : #53~18.04.1-Ubuntu SMP Thu Jun 4 14:58:26 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_AU.UTF-8
LOCALE : en_AU.UTF-8

pandas : 1.1.0.dev0+1887.g5fdd6f50a
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 2.5.8
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.2.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 0.9.4
xlwt : None
numba : None

dsandeep0138 · 2020-06-17T20:10:10Z

This is because the default value of "origin" in resample is "start_day". Please see below:

pandas/pandas/core/generic.py

Lines 7754 to 7768 in 4a267c6

    
           def resample( 
        
               self, 
        
               rule, 
        
               axis=0, 
        
               closed: Optional[str] = None, 
        
               label: Optional[str] = None, 
        
               convention: str = "start", 
        
               kind: Optional[str] = None, 
        
               loffset=None, 
        
               base: Optional[int] = None, 
        
               on=None, 
        
               level=None, 
        
               origin: Union[str, TimestampConvertibleTypes] = "start_day", 
        
               offset: Optional[TimedeltaConvertibleTypes] = None, 
        
           ) -> "Resampler":

The following are the definitions for the 'start' and 'start_day'. Since the default is 'start_day', the first day at midnight is considered. Hence hours you have mentioned is disregarded.

‘start’: origin is the first value of the timeseries
‘start_day’: origin is the first day at midnight of the timeseries

Looks like there was some issue due to which this is changed, and is described here: #31809

The default of "start" is more accurate in my opinion. @jreback @hasB4K @mroeschke Can you please comment further on this?

jreback · 2020-06-17T20:48:21Z

@dsandeep0138 pls have a read on the issue and the number of issues this patches

this change is much more consistent with respect to the frequency rather than just happens to work

dsandeep0138 · 2020-06-17T21:09:18Z

@jreback Thanks for the comment. Yes, I understand the change is awesome, and deals with the inconsistencies in frequencies and fixes so many issues :)

Can you please comment on this particular bug if it is expected then? Should we document it, if it is so to avoid confusion? Thanks.

hasB4K · 2020-06-17T21:12:59Z

@dsandeep0138 I'll reply a bit later to explain everything, no worries 😉

liverpool1026 · 2020-06-17T23:07:53Z

@deepandas11 That makes sense on master, by setting origin="start" fixes the issue.

However, it will still be an issue in 0.24 and 1.04 given that the origin kwarg was not introduced then.

If I am using 0.24, what's the way to preserve the hours?

hasB4K · 2020-06-18T00:49:14Z

@liverpool1026 The behavior has been to follow start_day since 2012 as you can see in this commit: 31ca168. The behavior you are facing is actually a bug that has been fixed in v0.2.4 by #24159. Basically your current code is relying on a bug 😕 ...

The behavior of start_day of origin has been preserved as the default value in #31809 to avoid creating breaking changes.

The proof that your code is relying on a bug fixed in v0.2.4:

import datetime as dt
import pandas as pd

def print_resample(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()
    print(time_index_df)


print_resample(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):

EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 00:00:00     1
1 2018-06-07 12:00:00    12
2 2018-06-08 00:00:00    12
3 2018-06-08 12:00:00    12
4 2018-06-09 00:00:00    12
5 2018-06-09 12:00:00    12
6 2018-06-10 00:00:00    12

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 12:00:00    11
1 2018-06-08 00:00:00    12
2 2018-06-08 12:00:00    12
3 2018-06-09 00:00:00    12
4 2018-06-09 12:00:00    12
5 2018-06-10 00:00:00    12

Example 1 is the buggy one since it uses a frequency bigger than 1D.
Example 3 is showing you that the behavior is not to start at midnight of the day of the first timestamp, but we are using the start of the day to align the bins. The argument origin of ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809 is just giving more choice to the user that want to specify on how to align the bins of the resampled data.

Now, that being said... What should you do to align from the start of your timeseries with those constraint before the version 1.1.0? Well first, I would advise to wait a few months the release of 1.1.0... But you could hack a bit around by converting this temporally into Timedeltas and it should work:

import datetime as dt
import pandas as pd

def print_resample_simulate_origin_start(example_nb, start, end, resample_freq):
    print(f"\nEXAMPLE {example_nb}: {start} - {end} [{resample_freq}]")
    time_index_df = pd.date_range(start, end, freq="1H", name="datetime").to_frame(index=False)
    time_index_df["test"] = 1
    
    # hack: transform datetime into timestamps to resample on start of the timeseries
    time_index_df["datetime"] -= pd.Timestamp(0)

    time_index_df = time_index_df.resample(resample_freq, convention="end", on="datetime").sum().reset_index()

    # hack: transform datetime back into timestamps
    time_index_df["datetime"] = pd.to_datetime(time_index_df["datetime"])
    print(time_index_df)


print_resample_simulate_origin_start(1, "2018-06-07 11:00", "2018-06-10 11:00", "2D")
print_resample_simulate_origin_start(2, "2018-06-07 11:00", "2018-06-10 11:00", "12H")
print_resample_simulate_origin_start(3, "2018-06-07 13:00", "2018-06-10 11:00", "12H")

Outputs (v0.2.3):


EXAMPLE 1: 2018-06-07 11:00 - 2018-06-10 11:00 [2D]
             datetime  test
0 2018-06-07 11:00:00    48
1 2018-06-09 11:00:00    25
2 2018-06-11 11:00:00     0

EXAMPLE 2: 2018-06-07 11:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 11:00:00    12
1 2018-06-07 23:00:00    12
2 2018-06-08 11:00:00    12
3 2018-06-08 23:00:00    12
4 2018-06-09 11:00:00    12
5 2018-06-09 23:00:00    12
6 2018-06-10 11:00:00     1

EXAMPLE 3: 2018-06-07 13:00 - 2018-06-10 11:00 [12H]
             datetime  test
0 2018-06-07 13:00:00    12
1 2018-06-08 01:00:00    12
2 2018-06-08 13:00:00    12
3 2018-06-09 01:00:00    12
4 2018-06-09 13:00:00    12
5 2018-06-10 01:00:00    11

I know this solution is not ideal... But again, this has been fixed in #31809 and it will be in the upcoming 1.1.0. I hope I have answered to your questions/issues @liverpool1026 and @dsandeep0138.

liverpool1026 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2020

liverpool1026 changed the title ~~BUG:~~ BUG: resample seems to convert hours to 00:00 Jun 16, 2020

hasB4K added Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2020

hasB4K closed this as completed Jun 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: resample seems to convert hours to 00:00 #34833

BUG: resample seems to convert hours to 00:00 #34833

liverpool1026 commented Jun 16, 2020 •

edited

Loading

0.23
INSTALLED VERSIONS

0.24
INSTALLED VERSIONS

jreback commented Jun 16, 2020

liverpool1026 commented Jun 16, 2020

dsandeep0138 commented Jun 17, 2020

jreback commented Jun 17, 2020

dsandeep0138 commented Jun 17, 2020

hasB4K commented Jun 17, 2020

liverpool1026 commented Jun 17, 2020

hasB4K commented Jun 18, 2020 •

edited

Loading

BUG: resample seems to convert hours to 00:00 #34833

BUG: resample seems to convert hours to 00:00 #34833

Comments

liverpool1026 commented Jun 16, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

0.23 INSTALLED VERSIONS

0.24 INSTALLED VERSIONS

jreback commented Jun 16, 2020

liverpool1026 commented Jun 16, 2020

1.04

INSTALLED VERSIONS

master

INSTALLED VERSIONS

dsandeep0138 commented Jun 17, 2020

jreback commented Jun 17, 2020

dsandeep0138 commented Jun 17, 2020

hasB4K commented Jun 17, 2020

liverpool1026 commented Jun 17, 2020

hasB4K commented Jun 18, 2020 • edited Loading

liverpool1026 commented Jun 16, 2020 •

edited

Loading

Output of `pd.show_versions()`

0.23
INSTALLED VERSIONS

0.24
INSTALLED VERSIONS

hasB4K commented Jun 18, 2020 •

edited

Loading