Skip to content

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
vlsd opened this issue Jun 25, 2021 · 5 comments · Fixed by #42244
Closed
2 of 3 tasks

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

vlsd opened this issue Jun 25, 2021 · 5 comments · Fixed by #42244
Labels
Milestone

Comments

@vlsd
Copy link

vlsd commented Jun 25, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

print(pd.to_datetime(['2021-03-14 12:05:45 -0400', '2021-03-14 03:06:00 -0400']))
print(pd.to_datetime(['2021-03-14 01:06:30 -0500', '2021-03-13 23:23:40 -0500']))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='raise'))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='coerce'))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='ignore'))

The output is

DatetimeIndex(['2021-03-14 12:05:45-04:00', '2021-03-14 03:06:00-04:00'], dtype='datetime64[ns, pytz.FixedOffset(-240)]', freq=None)
DatetimeIndex(['2021-03-14 01:06:30-05:00', '2021-03-13 23:23:40-05:00'], dtype='datetime64[ns, pytz.FixedOffset(-300)]', freq=None)
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')

Problem description

The first two lines work as expected, but the following three lines do not return a DatetimeIndex but also don't raise any exceptions in the process. This behavior is undocumented, and quite hard to debug/track down, since the failure is completely silent. It seems that the reason is the input contains times from two different time zones.

Expected Output

Either a DatetimeIndex object or an exception to be raised.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7c48ff4
python : 3.9.1.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.5
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.2.1
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : 1.4.2
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@vlsd vlsd added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021
@mroeschke
Copy link
Member

Thanks for the report. We have been testing this behavior for a while so it's unlikely to change. Documentation was recently added for this behavior (which will be in the 1.3 docs)

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.to_datetime.html

@vlsd
Copy link
Author

vlsd commented Jun 25, 2021

Would it be possible to move the notice about the return type sometimes being object (and when that's expected behavior) into the "Returns:" section of the documentation? It's very easy to miss it at the bottom of the page

@mroeschke
Copy link
Member

Good point. The return type section should specifically call out and Index with object type with Timestamp objects with mixed timezones

@mroeschke mroeschke reopened this Jun 25, 2021
@mroeschke mroeschke added Docs good first issue Datetime Datetime data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021
@smarie
Copy link
Contributor

smarie commented Jun 28, 2021

If I may jump in, I do not think that the current behaviour is intuitive for "the masses".

The fundamental problem is that most users do not understand the difference between the concepts of timezone and time offset. Therefore most APIs available (in many programming languages, not only python) are of poor quality, because they do not convey a clear explanation. pandas could try to make it clear and easy, a bit like in moments.js.

In real-world data, it is extremely frequent to

  • not see any time offset nor any timezone in a string, as in 2020-01-01
  • see the UTC timezone in the string, as in 2020-01-01T00:00:00Z
  • see a time offset (not a timezone) in the string, as in 2020-01-01T00:00:00+0100

Even if stdlib's strftime() supports %Z to parse string-represented timezones (such as 'Europe/Paris'), we usually do not meet that in datasets, we meet far more often time offsets like +0100 that can be parsed with %z.

To come back to the issue at hand, when I write this:

pd.to_datetime(["2020-10-25T01:30:00+02:00", "2020-10-25T02:00:00+01:00"])

The datetimes are not of mixed timezone. They are both in Europe/Paris timezone. But the timezone information is not written in the string, because it is almost never written in the string (as explained above).
The resulting index has dtype O (despite the documentation stating that if the timezone is not mixed it shouldn't), and we have to turn the utc=True flag, which ... well... is absolutely not intuitive since data is not UTC !

If you agree, I would suggest at least to first modify #42244 so that "mixed time offset" appears instead of "mixed timezone" (I'll do the suggestion directly in the PR). For future versions, we could maybe discuss how to improve/replace this "utc=True" parameter to propose a more readable/intuitive alternative...

What do you think?
Thanks for maintaining this great project

@vlsd
Copy link
Author

vlsd commented Jun 28, 2021

@smarie That's a really good point, I hadn't even noticed at first but I think the data that prompted my confusion was, just like you describe, not in two different time zones but with different offsets while the time zone was experiencing daylight savings. In fact, the entries at the start and end of my (long) timeseries had the same offset, so it took me forever to figure out what was happening. Perhaps the default behavior could be printing a warning or raising an exception when mixed offset data is sent to the function, with an option to turn that off if the object dtype is desired as a return value.

@jreback jreback added this to the 1.4 milestone Jul 1, 2021
smarie added a commit to smarie/pandas that referenced this issue Jul 12, 2021
Following pandas-dev#42244 , improved documentation about datetime parsing.

See also pandas-dev#42229 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants