BUG: to_datetime() returns object instead of datetime type or raising exception #42229

vlsd · 2021-06-25T16:59:35Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

print(pd.to_datetime(['2021-03-14 12:05:45 -0400', '2021-03-14 03:06:00 -0400']))
print(pd.to_datetime(['2021-03-14 01:06:30 -0500', '2021-03-13 23:23:40 -0500']))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='raise'))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='coerce'))
print(pd.to_datetime(['2021-03-14 03:06:00 -0400', '2021-03-14 01:06:30 -0500'], errors='ignore'))

The output is

DatetimeIndex(['2021-03-14 12:05:45-04:00', '2021-03-14 03:06:00-04:00'], dtype='datetime64[ns, pytz.FixedOffset(-240)]', freq=None)
DatetimeIndex(['2021-03-14 01:06:30-05:00', '2021-03-13 23:23:40-05:00'], dtype='datetime64[ns, pytz.FixedOffset(-300)]', freq=None)
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')
Index([2021-03-14 03:06:00-04:00, 2021-03-14 01:06:30-05:00], dtype='object')

Problem description

The first two lines work as expected, but the following three lines do not return a DatetimeIndex but also don't raise any exceptions in the process. This behavior is undocumented, and quite hard to debug/track down, since the failure is completely silent. It seems that the reason is the input contains times from two different time zones.

Expected Output

Either a DatetimeIndex object or an exception to be raised.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 7c48ff4
python : 3.9.1.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.5
numpy : 1.19.5
pytz : 2020.5
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.2.1
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.22.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : 1.4.2
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-06-25T17:19:09Z

Thanks for the report. We have been testing this behavior for a while so it's unlikely to change. Documentation was recently added for this behavior (which will be in the 1.3 docs)

https://pandas.pydata.org/pandas-docs/dev/reference/api/pandas.to_datetime.html

vlsd · 2021-06-25T18:08:20Z

Would it be possible to move the notice about the return type sometimes being object (and when that's expected behavior) into the "Returns:" section of the documentation? It's very easy to miss it at the bottom of the page

mroeschke · 2021-06-25T18:09:51Z

Good point. The return type section should specifically call out and Index with object type with Timestamp objects with mixed timezones

smarie · 2021-06-28T20:21:10Z

If I may jump in, I do not think that the current behaviour is intuitive for "the masses".

The fundamental problem is that most users do not understand the difference between the concepts of timezone and time offset. Therefore most APIs available (in many programming languages, not only python) are of poor quality, because they do not convey a clear explanation. pandas could try to make it clear and easy, a bit like in moments.js.

In real-world data, it is extremely frequent to

not see any time offset nor any timezone in a string, as in 2020-01-01
see the UTC timezone in the string, as in 2020-01-01T00:00:00Z
see a time offset (not a timezone) in the string, as in 2020-01-01T00:00:00+0100

Even if stdlib's strftime() supports %Z to parse string-represented timezones (such as 'Europe/Paris'), we usually do not meet that in datasets, we meet far more often time offsets like +0100 that can be parsed with %z.

To come back to the issue at hand, when I write this:

pd.to_datetime(["2020-10-25T01:30:00+02:00", "2020-10-25T02:00:00+01:00"])

The datetimes are not of mixed timezone. They are both in Europe/Paris timezone. But the timezone information is not written in the string, because it is almost never written in the string (as explained above).
The resulting index has dtype O (despite the documentation stating that if the timezone is not mixed it shouldn't), and we have to turn the utc=True flag, which ... well... is absolutely not intuitive since data is not UTC !

If you agree, I would suggest at least to first modify #42244 so that "mixed time offset" appears instead of "mixed timezone" (I'll do the suggestion directly in the PR). For future versions, we could maybe discuss how to improve/replace this "utc=True" parameter to propose a more readable/intuitive alternative...

What do you think?
Thanks for maintaining this great project

vlsd · 2021-06-28T21:01:36Z

@smarie That's a really good point, I hadn't even noticed at first but I think the data that prompted my confusion was, just like you describe, not in two different time zones but with different offsets while the time zone was experiencing daylight savings. In fact, the entries at the start and end of my (long) timeseries had the same offset, so it took me forever to figure out what was happening. Perhaps the default behavior could be printing a warning or raising an exception when mixed offset data is sent to the function, with an option to turn that off if the object dtype is desired as a return value.

Following pandas-dev#42244 , improved documentation about datetime parsing. See also pandas-dev#42229 (comment)

vlsd added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021

mroeschke closed this as completed Jun 25, 2021

mroeschke reopened this Jun 25, 2021

mroeschke added Docs good first issue Datetime Datetime data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 25, 2021

prabha-git mentioned this issue Jun 26, 2021

Updated the return type section of to_datetime #42244

Merged

4 tasks

jreback added this to the 1.4 milestone Jul 1, 2021

mroeschke closed this as completed in #42244 Jul 11, 2021

smarie added a commit to smarie/pandas that referenced this issue Jul 12, 2021

Update datetimes.py

bd4061c

Following pandas-dev#42244 , improved documentation about datetime parsing. See also pandas-dev#42229 (comment)

smarie mentioned this issue Jul 12, 2021

Improved docstring and return type hints for to_datetime #42494

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

vlsd commented Jun 25, 2021

INSTALLED VERSIONS

mroeschke commented Jun 25, 2021

vlsd commented Jun 25, 2021

mroeschke commented Jun 25, 2021

smarie commented Jun 28, 2021 •

edited

Loading

vlsd commented Jun 28, 2021

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

BUG: to_datetime() returns object instead of datetime type or raising exception #42229

Comments

vlsd commented Jun 25, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Jun 25, 2021

vlsd commented Jun 25, 2021

mroeschke commented Jun 25, 2021

smarie commented Jun 28, 2021 • edited Loading

vlsd commented Jun 28, 2021

Output of `pd.show_versions()`

smarie commented Jun 28, 2021 •

edited

Loading