Skip to content

BUG: date_range returns wrong output in some cases #35342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
tnnmk opened this issue Jul 19, 2020 · 5 comments
Open
2 of 3 tasks

BUG: date_range returns wrong output in some cases #35342

tnnmk opened this issue Jul 19, 2020 · 5 comments
Labels
Bug Datetime Datetime data dtype Frequency DateOffsets

Comments

@tnnmk
Copy link

tnnmk commented Jul 19, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
print(pd.date_range('2020-02-01 15:00', '2020-05-12 13:00', freq='MS')) # wrong output
print(pd.date_range('2020-02-01 15:00', '2020-05-12 15:00', freq='MS')) # correct output
print(pd.date_range('2020-02-01 15:00', '2020-05-12 16:00', freq='MS')) # correct output
print(pd.date_range('2020-02-01 15:00', '2020-05-01 01:00', freq='MS')) # wrong output
print(pd.date_range('2020-02-01 15:00', '2020-05-01 23:00', freq='MS')) # correct output
print(pd.date_range('2020-01-21 15:00', '2020-05-12 13:00', freq='MS')) # correct output

Currently outputs:

DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00', '2020-05-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00', '2020-05-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00', '2020-05-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')
DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00', '2020-05-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')

Problem description

Seems that if freq='MS', the date of start is the first day of a month, and the time of start is after the time of end, then the last month is missing from the output.

The same problems seems to also occur if freq='M' and the date of start is the last day of a month (and the time of start is after the time of end), as demonstrated below.

(Perhaps the problem could occur also with other values of the freq parameter. However, I only tested with 'MS' and 'M'.)

Expected Output

I would expect every code line above to produce the following output:

DatetimeIndex(['2020-02-01 15:00:00', '2020-03-01 15:00:00',
               '2020-04-01 15:00:00', '2020-05-01 15:00:00'],
              dtype='datetime64[ns]', freq='MS')

Another code Sample, a copy-pastable example

Here is an example with freq='M'.

import pandas as pd
print(pd.date_range('2020-01-31 15:00', '2020-05-12 13:00', freq='M')) # wrong output
print(pd.date_range('2020-01-31 17:00', '2020-05-12 17:00', freq='M')) # correct output
print(pd.date_range('2020-01-30 15:00', '2020-05-12 13:00', freq='M')) # correct output

Expected Output

Here the expect output of each code line above is:

DatetimeIndex(['2020-01-31 17:00:00', '2020-02-29 17:00:00',
               '2020-03-31 17:00:00', '2020-04-30 17:00:00'],
              dtype='datetime64[ns]', freq='M')

Output of pd.show_versions()

(While the below output reports pandas version 1.0.3, i have also tested with version 1.0.5 with same results.)

INSTALLED VERSIONS

commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-62-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : fi_FI.UTF-8
LOCALE : fi_FI.UTF-8

pandas : 1.0.3
numpy : 1.19.0
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : 0.29.20
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 0.9.6
lxml.etree : 4.2.1
html5lib : 0.999999999
pymysql : None
psycopg2 : None
jinja2 : 2.10
IPython : 5.5.0
pandas_datareader: None
bs4 : 4.6.0
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.1
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.5.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.1.0
xlwt : None
xlsxwriter : 0.9.6
numba : None

@tnnmk tnnmk added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 19, 2020
@zky001
Copy link

zky001 commented Jul 20, 2020

I will try to fix this problem

@zky001
Copy link

zky001 commented Jul 20, 2020

I think this problem come with generate_range function in pandas.core.arrays.datetimes.py file

@zky001
Copy link

zky001 commented Jul 20, 2020

I think this problem come with generate_range function in pandas.core.arrays.datetimes.py file

and the error occured on code offset.rollback() this function output bad results。

zky001 added a commit to zky001/pandas that referenced this issue Jul 20, 2020
change some issues be metioned at pandas-dev#35342
@simonjayhawkins simonjayhawkins added Frequency DateOffsets Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 20, 2020
zky001 added a commit to zky001/pandas that referenced this issue Jul 23, 2020
fix issues be metioned at pandas-dev#35342
@jbrockmendel
Copy link
Member

The issue here is in core.arrays.datetimes.generate_range in the lines

    elif end and not offset.is_on_offset(end):
        end = offset.rollback(end)

"2020-05-12 13:00" is getting rolled back to "2020-05-01 13:00"

@olveirap
Copy link

olveirap commented Feb 11, 2021

I'm having the same issue under a different usage. I'm currently splitting a query constrained by a start and end timestamps into smaller chunks and using date_range to get the start and end of the subqueries.

For example:

import pandas as pd
date_range = pd.date_range(start=pd.Timestamp("2021-01-01 00:00:00"), end=pd.Timestamp("2021-01-11 00:00:00"), freq="3D")

Returns: DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-07', '2021-01-10'], dtype='datetime64[ns]', freq='3D')

I was expecting DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-07', '2021-01-10', 2021-01-11], dtype='datetime64[ns]', freq='3D') or something to be raised about this silent not inclusion of the right boundary. I've been querying with 7D and 15D as frequency and have been missing information for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Frequency DateOffsets
Projects
None yet
Development

No branches or pull requests

5 participants