Skip to content

ENH: Format datetime.datetime and pd.Timestamp objects in pd.to_datetime #49338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

ENH: Format datetime.datetime and pd.Timestamp objects in pd.to_datetime #49338

wants to merge 15 commits into from

Conversation

aaossa
Copy link
Contributor

@aaossa aaossa commented Oct 26, 2022


Description of changes

When passing datetime.datetime or pd.Timestamp objects to pd.to_datetime, while using the format argument, a ValueError is raised because those object does not match the expected format.

The implemented solution looks for these non-string objects in the array_strptime method, and then each object is processed to maintain its contents (which might differ from the expected format for strings), and then continuing processing as usual.

When applying `pd.to_datetime` on array-like structure that contain a
`datetime.datetime` object, while using the `format` argument, a
`ValueError` is raised because the `datetime.datetime` object does not
match the expected format.

The implemented solution looks for `datetime.datetime` instances in the
`array_strptime` method. If an instance of this type is found, it's
properly handled by the new `_parse_python_datetime_object`, which
returns the expected Numpy datetime object.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa changed the title Parse datetime.datetime objects properly in pd.to_datetime ENH: Parse datetime.datetime objects properly in pd.to_datetime Oct 26, 2022
Two tests are introduced to check that `datetime.datetime` objects are
skipped while parsing array-like structures to `pd.to_datetime` and
the `format` argument is being used.

Before the fix, all of these tests and their test cases failed, except
`test_to_datetime_arraylike_contains_pydatetime_utc` when cache was
`True` and the `init_constructor` was `Index`.

The test creates an input of data that must be formated and a
`datetime.datetime` object. Then, the values are passed to the
array-like `init_constructor` to be converted to datetime using the
`pd.to_datetime` method. The test must not fail, and the result must
match the expected data.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
The introduced enhancement of handling `datetime.datetime` objects in
the input of `pd.to_datetime` when using the `format` argument is now
included in the whatsnew

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa marked this pull request as ready for review October 26, 2022 19:29
@mroeschke mroeschke added the Datetime Datetime data dtype label Oct 27, 2022
Instead of manually handling the `datetime.datetime` and `Timestamp`
objects by skipping formatting and leaving them untouched, apply the
passed format (`fmt`). This allow us to return a consistent output
while reusing existing parsing code

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa marked this pull request as draft October 28, 2022 17:37
The new expecte bahavior is apply the format (`fmt`) on
`datetime.datetime` and `Timestamp` objects to create a consistent
output. This test includes some unexpected cases, like the limitation
of resolution on `datetime.datetime` objects (Python does not support
nanosecond resolution [1]) and also when formatting `Timestamp`
objects (same limitation [2])

[1]: python/cpython#59648
[2]: #29461

Signed-off-by: Antonio Ossa Guerra <[email protected]>
As described in the previous commit, the new expected behavior when
handling `datetime.datetime` and `Timestamp` objects changed, by
applying the passed format (`fmt`) instead of raising `ValueError`

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa changed the title ENH: Parse datetime.datetime objects properly in pd.to_datetime ENH: Format datetime.datetime and pd.Timestamp objects in pd.to_datetime Oct 28, 2022
@aaossa aaossa marked this pull request as ready for review October 28, 2022 20:08
@aaossa aaossa requested review from mroeschke and jbrockmendel and removed request for mroeschke October 28, 2022 20:09
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this

I've just left some comments

@aaossa aaossa marked this pull request as draft November 2, 2022 03:06
The `PyDateTime_Check` function is used in other parts of the source
to properly check if an object is an instance of type
`PyDateTime_DateTimeType` or a subtype of the same type.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
Parametrized constructors add no information that is relevant in the
modified test, so they can be safely removed

Signed-off-by: Antonio Ossa Guerra <[email protected]>
When using ISO8601 the format is inferred, so no format has to be
explicitly defined. This test case includes input strings in ISO8601
to assert the behavior when the input includes `Timestamp` and
`datetime.datetime` objects

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa marked this pull request as ready for review November 2, 2022 16:08
Improve description by explaining the change of behavior in the result

Signed-off-by: Antonio Ossa Guerra <[email protected]>
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

I think the handling of timezones might not be quite right:

In [2]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')

In [3]: to_datetime([ts])
Out[3]: DatetimeIndex(['1950-01-01 00:00:00+00:00'], dtype='datetime64[ns, Europe/London]', freq=None)

In [4]: to_datetime([ts], format='%m-%d-%Y')
Out[4]: DatetimeIndex(['1950-01-01'], dtype='datetime64[ns]', freq=None)

@aaossa aaossa requested review from MarcoGorelli and removed request for jbrockmendel November 7, 2022 17:13
To avoid introducing an unexpected behavior, we'll skip objects that
are already `pd.Timestamp` or `datetime.datetime`

Signed-off-by: Antonio Ossa Guerra <[email protected]>
Simplify the tests further by testing specific cases that are actually
relevant to the introduced behavior. The first test includes cases for
both custom and ISO8601, while the second tests timezone-aware objects
while forcing UTC and while leaving the offsets alone

Signed-off-by: Antonio Ossa Guerra <[email protected]>
`pd.to_datetime` now simply skips `datetime.datetime` and `Timestamp`
objects when the `format` argument is present, instead of trying to
use the same format in the output

Signed-off-by: Antonio Ossa Guerra <[email protected]>
When using the `format` argument on `pd.to_timestamp`, and the input
contains localized `pd.Timestamp` or `datetime.datetime` objects, the
result should be localized too.

A bug was triggered when `format` did not contain `%z` or `%Z`,
because the `_array_strptime_with_fallback` function did not expect
any type different than string. Now this assumption is no longer
valid, since we now let `pd.Timestamp` and `datetime.datetime` objects
pass unmodified.

The fix is to change the condition that checks if returning a
localized output is necessary: instead of looking at the passed
`format`, return a localized result when any input object contains a
timezone.

Also, a new test is added to check this case. A case is included for a
`pd.Timestamp` and another for a `datetime.datetime` object, and each
compares the actual result with the expected result when the input is
localized and when is not.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating

It looks like now, mixing timezone-aware and timezone-naive is allowed if format is specified? Is this right? Shouldn't the behaviour be consistent with when format is not passed?

In [1]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')

In [2]: ts1 = Timestamp('1950-01-01')

In [3]: to_datetime([ts, ts1])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [3], line 1
----> 1 to_datetime([ts, ts1])

File ~/pandas-dev/pandas/core/tools/datetimes.py:1127, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1125         result = _convert_and_box_cache(argc, cache_array)
   1126     else:
-> 1127         result = convert_listlike(argc, format)
   1128 else:
   1129     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:440, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    438 assert format is None or infer_datetime_format
    439 utc = tz == "utc"
--> 440 result, tz_parsed = objects_to_datetime64ns(
    441     arg,
    442     dayfirst=dayfirst,
    443     yearfirst=yearfirst,
    444     utc=utc,
    445     errors=errors,
    446     require_iso8601=require_iso8601,
    447     allow_object=True,
    448 )
    450 if tz_parsed is not None:
    451     # We can take a shortcut since the datetime64 numpy array
    452     # is in UTC
    453     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2204, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
   2202 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2203 try:
-> 2204     result, tz_parsed = tslib.array_to_datetime(
   2205         data.ravel("K"),
   2206         errors=errors,
   2207         utc=utc,
   2208         dayfirst=dayfirst,
   2209         yearfirst=yearfirst,
   2210         require_iso8601=require_iso8601,
   2211         allow_mixed=allow_mixed,
   2212     )
   2213     result = result.reshape(data.shape, order=order)
   2214 except OverflowError as err:
   2215     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:444, in pandas._libs.tslib.array_to_datetime()
    442 @cython.wraparound(False)
    443 @cython.boundscheck(False)
--> 444 cpdef array_to_datetime(
    445     ndarray[object] values,
    446     str errors='raise',

File ~/pandas-dev/pandas/_libs/tslib.pyx:553, in pandas._libs.tslib.array_to_datetime()
    551 found_naive = True
    552 if found_tz and not utc_convert:
--> 553     raise ValueError('Cannot mix tz-aware with '
    554                      'tz-naive values')
    555 if isinstance(val, _Timestamp):

ValueError: Cannot mix tz-aware with tz-naive values

In [4]: to_datetime([ts, ts1], format='%m-%d-%Y')
Out[4]: Index([1950-01-01 00:00:00+00:00, 1950-01-01 00:00:00], dtype='object')

@aaossa aaossa marked this pull request as draft November 8, 2022 15:04
Just like `to_datetime` when `format` is not passed, we should raise a
`ValueError` if the input contains some tz-aware objects and some
naive objects. The change is to identify if the `format` expects
tz-aware objects (check for `"%z"` and `"%Z"`) and then compare this
expectation with the `tzinfo` of each object. This modification
includes relevant tests.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@aaossa aaossa marked this pull request as ready for review November 8, 2022 18:55
@aaossa aaossa requested a review from MarcoGorelli November 8, 2022 18:55
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this!

I'm still getting errors if we have timezone-aware Timestamps:

In [9]: In [1]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')
   ...: 
   ...: In [2]: ts1 = Timestamp('1950-01-01').tz_localize('Europe/London')

In [10]: to_datetime([ts, ts1], format='%d-%m-%Y')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [10], line 1
----> 1 to_datetime([ts, ts1], format='%d-%m-%Y')

File ~/pandas-dev/pandas/core/tools/datetimes.py:1127, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1125         result = _convert_and_box_cache(argc, cache_array)
   1126     else:
-> 1127         result = convert_listlike(argc, format)
   1128 else:
   1129     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:432, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    429         format = None
    431 if format is not None:
--> 432     res = _to_datetime_with_format(
    433         arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
    434     )
    435     if res is not None:
    436         return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:540, in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    537         return _box_as_indexlike(result, utc=utc, name=name)
    539 # fallback
--> 540 res = _array_strptime_with_fallback(
    541     arg, name, tz, fmt, exact, errors, infer_datetime_format
    542 )
    543 return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:475, in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
    472 utc = tz == "utc"
    474 try:
--> 475     result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
    476 except OutOfBoundsDatetime:
    477     if errors == "raise":

File ~/pandas-dev/pandas/_libs/tslibs/strptime.pyx:151, in pandas._libs.tslibs.strptime.array_strptime()
    149     raise ValueError("Cannot mix tz-aware with tz-naive values")
    150 elif val.tzinfo is not None and not expect_tz_aware:
--> 151     raise ValueError("Cannot mix tz-aware with tz-naive values")
    152 result_timezone[i] = val.tzinfo
    153 continue

ValueError: Cannot mix tz-aware with tz-naive values

In this case, they're both tz-aware, so is it right that the error message refers to tz-naive?

Comment on lines +148 to +150
if val.tzinfo is None and expect_tz_aware:
raise ValueError("Cannot mix tz-aware with tz-naive values")
elif val.tzinfo is not None and not expect_tz_aware:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it work to simplify this into

Suggested change
if val.tzinfo is None and expect_tz_aware:
raise ValueError("Cannot mix tz-aware with tz-naive values")
elif val.tzinfo is not None and not expect_tz_aware:
if val.tzinfo is not None ^ expect_tz_aware:

?

@aaossa
Copy link
Contributor Author

aaossa commented Nov 12, 2022

In this case, they're both tz-aware, so is it right that the error message refers to tz-naive?

To be clear, should this case raise a ValueError or not? Is the message the problem? What's the expected behavior in this case?

Relatedly, what should happen if there are no string to format? That argument would not have any effect, is a warning necessary/appropriate?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 12, 2022

If they all have the same offset, then something like:

In [3]: to_datetime(['01-01-2000 00:00:00 +01:00', '01-02-2000 00:00:00 +01:00'], format='%d-%m-%Y %H:%M:%S %z')
Out[3]: DatetimeIndex(['2000-01-01 00:00:00+01:00', '2000-02-01 00:00:00+01:00'], dtype='datetime64[ns, pytz.FixedOffset(60)]', freq=None)

If they have different offsets:

In [4]: to_datetime(['01-01-2000 00:00:00 +01:00', '01-02-2000 00:00:00 +02:00'], format='%d-%m-%Y %H:%M:%S %z')
Out[4]: Index([2000-01-01 00:00:00+01:00, 2000-02-01 00:00:00+02:00], dtype='object')

Relatedly, what should happen if there are no string to format? That argument would not have any effect, is a warning necessary/appropriate?

🤔 that's a good one - tempted to say that we should just let the argument not have any effect, wouldn't bother with a warning

EDIT: I wasn't quite correct here. If they have different offsets, then it should be allowed if utc=True, and should raise otherwise. It should be consistent with when format is an ISO8601 format, which already allows for pydatetime objects

@MarcoGorelli
Copy link
Member

I'd suggest looking at

cpdef array_to_datetime(

and perhaps making a function to factor out from there, because in the ISO8601 case then it seems that datetime / Timestamp objects are handled fine:

In [10]: to_datetime(['2000-01-01', ts], format='%Y-%m-%d')
Out[10]: DatetimeIndex(['2000-01-01', '1950-01-01'], dtype='datetime64[ns]', freq=None)

In [11]: to_datetime(['2000-01-01', ts], format='%Y-%d-%m')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [11], line 1
----> 1 to_datetime(['2000-01-01', ts], format='%Y-%d-%m')

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:1100, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1098         result = _convert_and_box_cache(argc, cache_array)
   1099     else:
-> 1100         result = convert_listlike(argc, format)
   1101 else:
   1102     result = convert_listlike(np.array([arg]), format)[0]

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:430, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    427         format = None
    429 if format is not None:
--> 430     res = _to_datetime_with_format(
    431         arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
    432     )
    433     if res is not None:
    434         return res

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:538, in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
    535         return _box_as_indexlike(result, utc=utc, name=name)
    537 # fallback
--> 538 res = _array_strptime_with_fallback(
    539     arg, name, tz, fmt, exact, errors, infer_datetime_format
    540 )
    541 return res

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:473, in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
    470 utc = tz == "utc"
    472 try:
--> 473     result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
    474 except OutOfBoundsDatetime:
    475     if errors == "raise":

File ~/tmp/.venv/lib/python3.10/site-packages/pandas/_libs/tslibs/strptime.pyx:156, in pandas._libs.tslibs.strptime.array_strptime()

ValueError: unconverted data remains:  00:00:00

@aaossa
Copy link
Contributor Author

aaossa commented Nov 30, 2022

Closing in favor of #49893

@aaossa aaossa closed this Nov 30, 2022
@aaossa aaossa deleted the issue-49298 branch November 30, 2022 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Do not fail when parsing pydatetime objects in pd.to_datetime
4 participants