Skip to content

BUG: to_datetime with box=False returning DatetimeIndex with mixed naive and aware datetime.datetimes #23722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Nov 15, 2018 · 18 comments
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype

Comments

@mroeschke
Copy link
Member

xref #23702 (comment)

In [1]: pd.__version__
Out[1]: '0.24.0.dev0+1022.ga23f90142'

In [2]: import pytz; import datetime

In [3]: pd.to_datetime([datetime.datetime(2018, 1, 1), datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC)], box=False)
Out[3]: DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

The current output also is incorrect if box=True I believe since the first argument is getting coerced to UTC.

Expected Result

In [3]: pd.to_datetime([datetime.datetime(2018, 1, 1), datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC)], box=False)
Out[3]:
array([datetime.datetime(2018, 1, 1, 0, 0),
       datetime.datetime(2018, 1, 1, 0, 0, tzinfo=<UTC>)], dtype=object)

In [4]: pd.to_datetime([datetime.datetime(2018, 1, 1), datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC)], box=True)
Out[4]: Index([2018-01-01 00:00:00, 2018-01-01 00:00:00+00:00], dtype='object')
@mroeschke mroeschke added Bug Datetime Datetime data dtype Timezones Timezone data dtype labels Nov 15, 2018
@mroeschke mroeschke changed the title BUG: to_datetime with box=False returning DatetimeIndex with mixed naive and aware datetimes BUG: to_datetime with box=False returning DatetimeIndex with mixed naive and aware datetime.datetimess Nov 15, 2018
@jbrockmendel
Copy link
Member

I'm pretty sure this ends up going through the datetime_to_datetime64 branch of to_datetime. The fix is going to involve merging that into array_to_datetime and handling the tzinfo checks more carefully (like you already did for the string cases).

Did the issue with hashability of dateutil timezones ever get resolved? I think we're going to need to use actual sets of tzinfo objects if extending that to the datetime cases.

BTW note that passing the list in the reversed order raises correctly.

@mroeschke
Copy link
Member Author

It appears there's some caution making dateutil timezones hashable: dateutil/dateutil#792 (comment)

Agreed this probably involves just merging datetime_to_datetime64 into array_to_datetime.

@mroeschke mroeschke changed the title BUG: to_datetime with box=False returning DatetimeIndex with mixed naive and aware datetime.datetimess BUG: to_datetime with box=False returning DatetimeIndex with mixed naive and aware datetime.datetimes Nov 15, 2018
@jbrockmendel
Copy link
Member

If hashing is still a problem then sets are out; could we map tzinfo objects to unique strings and use a dict? Actually we might even use the same dict that tslibs.timezones uses

@mroeschke
Copy link
Member Author

That could work. We could have a function that does the reverse of get_timezone that returns a string from a timezone object and just continue using sets. Would be manageable now with just pytz and dateutil timezones but haven't though too hard about the feasibility.

@jorisvandenbossche
Copy link
Member

Agreed that the boxing into an index is wrong, but the fact that it is actually parsed into datetime64, is that necessarily wrong? Or if we don't want to parse such a mixture of tz-aware/tz-naive, shouldn't the default be to raise an error?

@mroeschke
Copy link
Member Author

I guess in essence this appears as if to_datetime is then just wrapping this passed argument in either array(..., dtype=object) or Index(..., dtype=object), but is the intention of to_datetime to always coerce to datetime64 or raise (at least that was never clear to me)?

My impression was to always try preserve mixed types of arguments into a "lowest common datetime format"; therefore allowing cases like above to come in as an object array of datetime.datetimes

@jbrockmendel
Copy link
Member

The raise/not behavior should be determined by the errors kwarg I'd assume. It would not be surprising if the behavior is not internally consistent

@jorisvandenbossche
Copy link
Member

The expected behaviour is indeed not very well defined (at least not in the docstring). But there is the errors keyword, that is 'raise' by default and not 'ignore'.

Eg with out of range timestamps, by default it raises, but with 'ignore' it gives you back datetime.datetime objects:

In [10]: pd.to_datetime([datetime.datetime(1000, 1, 1)])
...
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 1000-01-01 00:00:00

In [11]: pd.to_datetime([datetime.datetime(1000, 1, 1)], errors='ignore') 
Out[11]: Index([1000-01-01 00:00:00], dtype='object')

In [14]: pd.to_datetime([datetime.datetime(1000, 1, 1)], errors='ignore', box=False)
Out[14]: array([datetime.datetime(1000, 1, 1, 0, 0)], dtype=object)

And as @jbrockmendel noted above, if you switch the order to put a tz-aware first, it also raises:

In [18]: pd.to_datetime([datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC), datetime.datetime(2018, 1, 1)], box=False)
...
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

Although here the errors='ignore' option does not work (it still raises the same error).

So there are clearly some inconsistencies that would be nice to solve.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 16, 2018

And some more inconsistencies:

The equivalent with strings (one tz aware, one not), "works" as well (so this is similar as the case with datetime.datetime):

In [26]: pd.to_datetime(["2018-01-01 00:00:00+00:00", "2018-01-02 00:00:00"])
Out[26]: DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-02 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

But now switching them in order returns datetime.datetime objects:

In [27]: pd.to_datetime(["2018-01-01 00:00:00", "2018-01-02 00:00:00+00:00"])
Out[27]: Index([2018-01-01 00:00:00, 2018-01-02 00:00:00+00:00], dtype='object')

In [28]: _.values
Out[28]: 
array([datetime.datetime(2018, 1, 1, 0, 0),
       datetime.datetime(2018, 1, 2, 0, 0, tzinfo=tzutc())], dtype=object)

@mroeschke
Copy link
Member Author

Well my impression is that errors is specifically for invalid parsing of dates and not that the input can be nicely converted to datetime64.

errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’

If ‘raise’, then invalid parsing will raise an exception
If ‘coerce’, then invalid parsing will be set as NaT
If ‘ignore’, then invalid parsing will return the input

So I think in the case of the original example input [datetime.datetime(2018, 1, 1), datetime.datetime(2018, 1, 1, tzinfo=pytz.UTC)] (similar to your last string example) these inputs can be "parsed" but shouldn't necessarily raise just because the inputs can't be nicely coerced to datetime64, so returning an array/Index of datetime.datetime objects is okay.

But I agree your last 2 examples should be made consistent regardless of order of arguments.

@jorisvandenbossche
Copy link
Member

these inputs can be "parsed" but shouldn't necessarily raise just because the inputs can't be nicely coerced to datetime64, so returning an array/Index of datetime.datetime objects is okay.

There's certainly some logic in there, but then from a user's standpoint: what's the point of returning those object dtype values (same as input)? I cannot do anything with it, I exactly passed them to to_datetime to make it proper datetime64 data.

Or differently put: if not with to_datetime, how can I convert those values (for this example of mixed datetime.datetime values) to datetime64[ns] ?

@jorisvandenbossche
Copy link
Member

Or differently put: if not with to_datetime, how can I convert those values (for this example of mixed datetime.datetime values) to datetime64[ns] ?

To answer myself: by passing utc=True it should still work I suppose.
But then, I personally find a default behaviour of raising an informative error message saying that you have mixed tz-aware/tz-naive data and that you can use utc=True more useful I think than silently returning the input objects.

@mroeschke
Copy link
Member Author

I cannot do anything with it, I exactly passed them to to_datetime to make it proper datetime64 data

So if this is the minimum assumption (which we should document) for to_datetime then yes, we shouldn't ever return datetime.datetime objects unless errors='ignore'. I can get on board with this behavior.

I have a slight suspicion some internals may depend on retaining mixed timezone or mixed tz-aware/tz-naive values (e.g. I just changed this for parsing utc offsets: https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets)

@jbrockmendel
Copy link
Member

There are a number of overlapping issues involved here. The OutOfBoundsDatetime behavior is a particularly difficult one. To the extent possible, I think we should consider that separately for the time being.

boxing is currently done at several places inside to_datetime. As a first/early step, can we trim that down to being down in just one place?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 16, 2018

boxing is currently done at several places inside to_datetime. As a first/early step, can we trim that down to being down in just one place?

+1

The boxing can indeed be discussed separately, but I think the original issue was not only about boxing, but also the question about should it return datetime64 data or object data (or raise).

In any case, if you have a mixture of tz-aware and tz-naive values, the behaviour should not depend on the order and which of the two happened to be the first value ..

@jorisvandenbossche
Copy link
Member

So if this is the minimum assumption (which we should document) for to_datetime then yes, we shouldn't ever return datetime.datetime objects unless errors='ignore'. I can get on board with this behavior.

I am not sure it will be best in all cases or what we would break when trying to do this, but I think it is at least a "clear" explanation: the function returns datetime64[ns] data (boxed or not), or raises that it cannot do that (by default, the keywords can then still influence it)

@jbrockmendel
Copy link
Member

I just made a GH Project for to_datetime to collect all of these issues/discussions

@mroeschke
Copy link
Member Author

Since box was deprecated and removed. This is no longer an issue. Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

3 participants