-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Format datetime.datetime
and pd.Timestamp
objects in pd.to_datetime
#49338
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When applying `pd.to_datetime` on array-like structure that contain a `datetime.datetime` object, while using the `format` argument, a `ValueError` is raised because the `datetime.datetime` object does not match the expected format. The implemented solution looks for `datetime.datetime` instances in the `array_strptime` method. If an instance of this type is found, it's properly handled by the new `_parse_python_datetime_object`, which returns the expected Numpy datetime object. Signed-off-by: Antonio Ossa Guerra <[email protected]>
datetime.datetime
objects properly in pd.to_datetime
datetime.datetime
objects properly in pd.to_datetime
Two tests are introduced to check that `datetime.datetime` objects are skipped while parsing array-like structures to `pd.to_datetime` and the `format` argument is being used. Before the fix, all of these tests and their test cases failed, except `test_to_datetime_arraylike_contains_pydatetime_utc` when cache was `True` and the `init_constructor` was `Index`. The test creates an input of data that must be formated and a `datetime.datetime` object. Then, the values are passed to the array-like `init_constructor` to be converted to datetime using the `pd.to_datetime` method. The test must not fail, and the result must match the expected data. Signed-off-by: Antonio Ossa Guerra <[email protected]>
The introduced enhancement of handling `datetime.datetime` objects in the input of `pd.to_datetime` when using the `format` argument is now included in the whatsnew Signed-off-by: Antonio Ossa Guerra <[email protected]>
Instead of manually handling the `datetime.datetime` and `Timestamp` objects by skipping formatting and leaving them untouched, apply the passed format (`fmt`). This allow us to return a consistent output while reusing existing parsing code Signed-off-by: Antonio Ossa Guerra <[email protected]>
The new expecte bahavior is apply the format (`fmt`) on `datetime.datetime` and `Timestamp` objects to create a consistent output. This test includes some unexpected cases, like the limitation of resolution on `datetime.datetime` objects (Python does not support nanosecond resolution [1]) and also when formatting `Timestamp` objects (same limitation [2]) [1]: python/cpython#59648 [2]: #29461 Signed-off-by: Antonio Ossa Guerra <[email protected]>
As described in the previous commit, the new expected behavior when handling `datetime.datetime` and `Timestamp` objects changed, by applying the passed format (`fmt`) instead of raising `ValueError` Signed-off-by: Antonio Ossa Guerra <[email protected]>
datetime.datetime
objects properly in pd.to_datetime
datetime.datetime
and pd.Timestamp
objects in pd.to_datetime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this
I've just left some comments
The `PyDateTime_Check` function is used in other parts of the source to properly check if an object is an instance of type `PyDateTime_DateTimeType` or a subtype of the same type. Signed-off-by: Antonio Ossa Guerra <[email protected]>
Parametrized constructors add no information that is relevant in the modified test, so they can be safely removed Signed-off-by: Antonio Ossa Guerra <[email protected]>
When using ISO8601 the format is inferred, so no format has to be explicitly defined. This test case includes input strings in ISO8601 to assert the behavior when the input includes `Timestamp` and `datetime.datetime` objects Signed-off-by: Antonio Ossa Guerra <[email protected]>
Improve description by explaining the change of behavior in the result Signed-off-by: Antonio Ossa Guerra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating!
I think the handling of timezones might not be quite right:
In [2]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')
In [3]: to_datetime([ts])
Out[3]: DatetimeIndex(['1950-01-01 00:00:00+00:00'], dtype='datetime64[ns, Europe/London]', freq=None)
In [4]: to_datetime([ts], format='%m-%d-%Y')
Out[4]: DatetimeIndex(['1950-01-01'], dtype='datetime64[ns]', freq=None)
To avoid introducing an unexpected behavior, we'll skip objects that are already `pd.Timestamp` or `datetime.datetime` Signed-off-by: Antonio Ossa Guerra <[email protected]>
Simplify the tests further by testing specific cases that are actually relevant to the introduced behavior. The first test includes cases for both custom and ISO8601, while the second tests timezone-aware objects while forcing UTC and while leaving the offsets alone Signed-off-by: Antonio Ossa Guerra <[email protected]>
`pd.to_datetime` now simply skips `datetime.datetime` and `Timestamp` objects when the `format` argument is present, instead of trying to use the same format in the output Signed-off-by: Antonio Ossa Guerra <[email protected]>
When using the `format` argument on `pd.to_timestamp`, and the input contains localized `pd.Timestamp` or `datetime.datetime` objects, the result should be localized too. A bug was triggered when `format` did not contain `%z` or `%Z`, because the `_array_strptime_with_fallback` function did not expect any type different than string. Now this assumption is no longer valid, since we now let `pd.Timestamp` and `datetime.datetime` objects pass unmodified. The fix is to change the condition that checks if returning a localized output is necessary: instead of looking at the passed `format`, return a localized result when any input object contains a timezone. Also, a new test is added to check this case. A case is included for a `pd.Timestamp` and another for a `datetime.datetime` object, and each compares the actual result with the expected result when the input is localized and when is not. Signed-off-by: Antonio Ossa Guerra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating
It looks like now, mixing timezone-aware and timezone-naive is allowed if format
is specified? Is this right? Shouldn't the behaviour be consistent with when format
is not passed?
In [1]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')
In [2]: ts1 = Timestamp('1950-01-01')
In [3]: to_datetime([ts, ts1])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [3], line 1
----> 1 to_datetime([ts, ts1])
File ~/pandas-dev/pandas/core/tools/datetimes.py:1127, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1125 result = _convert_and_box_cache(argc, cache_array)
1126 else:
-> 1127 result = convert_listlike(argc, format)
1128 else:
1129 result = convert_listlike(np.array([arg]), format)[0]
File ~/pandas-dev/pandas/core/tools/datetimes.py:440, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
438 assert format is None or infer_datetime_format
439 utc = tz == "utc"
--> 440 result, tz_parsed = objects_to_datetime64ns(
441 arg,
442 dayfirst=dayfirst,
443 yearfirst=yearfirst,
444 utc=utc,
445 errors=errors,
446 require_iso8601=require_iso8601,
447 allow_object=True,
448 )
450 if tz_parsed is not None:
451 # We can take a shortcut since the datetime64 numpy array
452 # is in UTC
453 dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))
File ~/pandas-dev/pandas/core/arrays/datetimes.py:2204, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, allow_mixed)
2202 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
2203 try:
-> 2204 result, tz_parsed = tslib.array_to_datetime(
2205 data.ravel("K"),
2206 errors=errors,
2207 utc=utc,
2208 dayfirst=dayfirst,
2209 yearfirst=yearfirst,
2210 require_iso8601=require_iso8601,
2211 allow_mixed=allow_mixed,
2212 )
2213 result = result.reshape(data.shape, order=order)
2214 except OverflowError as err:
2215 # Exception is raised when a part of date is greater than 32 bit signed int
File ~/pandas-dev/pandas/_libs/tslib.pyx:444, in pandas._libs.tslib.array_to_datetime()
442 @cython.wraparound(False)
443 @cython.boundscheck(False)
--> 444 cpdef array_to_datetime(
445 ndarray[object] values,
446 str errors='raise',
File ~/pandas-dev/pandas/_libs/tslib.pyx:553, in pandas._libs.tslib.array_to_datetime()
551 found_naive = True
552 if found_tz and not utc_convert:
--> 553 raise ValueError('Cannot mix tz-aware with '
554 'tz-naive values')
555 if isinstance(val, _Timestamp):
ValueError: Cannot mix tz-aware with tz-naive values
In [4]: to_datetime([ts, ts1], format='%m-%d-%Y')
Out[4]: Index([1950-01-01 00:00:00+00:00, 1950-01-01 00:00:00], dtype='object')
Just like `to_datetime` when `format` is not passed, we should raise a `ValueError` if the input contains some tz-aware objects and some naive objects. The change is to identify if the `format` expects tz-aware objects (check for `"%z"` and `"%Z"`) and then compare this expectation with the `tzinfo` of each object. This modification includes relevant tests. Signed-off-by: Antonio Ossa Guerra <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this!
I'm still getting errors if we have timezone-aware Timestamps:
In [9]: In [1]: ts = Timestamp('1950-01-01').tz_localize('Europe/London')
...:
...: In [2]: ts1 = Timestamp('1950-01-01').tz_localize('Europe/London')
In [10]: to_datetime([ts, ts1], format='%d-%m-%Y')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [10], line 1
----> 1 to_datetime([ts, ts1], format='%d-%m-%Y')
File ~/pandas-dev/pandas/core/tools/datetimes.py:1127, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1125 result = _convert_and_box_cache(argc, cache_array)
1126 else:
-> 1127 result = convert_listlike(argc, format)
1128 else:
1129 result = convert_listlike(np.array([arg]), format)[0]
File ~/pandas-dev/pandas/core/tools/datetimes.py:432, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
429 format = None
431 if format is not None:
--> 432 res = _to_datetime_with_format(
433 arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
434 )
435 if res is not None:
436 return res
File ~/pandas-dev/pandas/core/tools/datetimes.py:540, in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
537 return _box_as_indexlike(result, utc=utc, name=name)
539 # fallback
--> 540 res = _array_strptime_with_fallback(
541 arg, name, tz, fmt, exact, errors, infer_datetime_format
542 )
543 return res
File ~/pandas-dev/pandas/core/tools/datetimes.py:475, in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
472 utc = tz == "utc"
474 try:
--> 475 result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
476 except OutOfBoundsDatetime:
477 if errors == "raise":
File ~/pandas-dev/pandas/_libs/tslibs/strptime.pyx:151, in pandas._libs.tslibs.strptime.array_strptime()
149 raise ValueError("Cannot mix tz-aware with tz-naive values")
150 elif val.tzinfo is not None and not expect_tz_aware:
--> 151 raise ValueError("Cannot mix tz-aware with tz-naive values")
152 result_timezone[i] = val.tzinfo
153 continue
ValueError: Cannot mix tz-aware with tz-naive values
In this case, they're both tz-aware
, so is it right that the error message refers to tz-naive
?
if val.tzinfo is None and expect_tz_aware: | ||
raise ValueError("Cannot mix tz-aware with tz-naive values") | ||
elif val.tzinfo is not None and not expect_tz_aware: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it work to simplify this into
if val.tzinfo is None and expect_tz_aware: | |
raise ValueError("Cannot mix tz-aware with tz-naive values") | |
elif val.tzinfo is not None and not expect_tz_aware: | |
if val.tzinfo is not None ^ expect_tz_aware: |
?
To be clear, should this case raise a ValueError or not? Is the message the problem? What's the expected behavior in this case? Relatedly, what should happen if there are no string to format? That argument would not have any effect, is a warning necessary/appropriate? |
If they all have the same offset, then something like: In [3]: to_datetime(['01-01-2000 00:00:00 +01:00', '01-02-2000 00:00:00 +01:00'], format='%d-%m-%Y %H:%M:%S %z')
Out[3]: DatetimeIndex(['2000-01-01 00:00:00+01:00', '2000-02-01 00:00:00+01:00'], dtype='datetime64[ns, pytz.FixedOffset(60)]', freq=None) If they have different offsets:
🤔 that's a good one - tempted to say that we should just let the argument not have any effect, wouldn't bother with a warning EDIT: I wasn't quite correct here. If they have different offsets, then it should be allowed if |
I'd suggest looking at Line 442 in f3c46cd
and perhaps making a function to factor out from there, because in the ISO8601 case then it seems that datetime / Timestamp objects are handled fine: In [10]: to_datetime(['2000-01-01', ts], format='%Y-%m-%d')
Out[10]: DatetimeIndex(['2000-01-01', '1950-01-01'], dtype='datetime64[ns]', freq=None)
In [11]: to_datetime(['2000-01-01', ts], format='%Y-%d-%m')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [11], line 1
----> 1 to_datetime(['2000-01-01', ts], format='%Y-%d-%m')
File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:1100, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1098 result = _convert_and_box_cache(argc, cache_array)
1099 else:
-> 1100 result = convert_listlike(argc, format)
1101 else:
1102 result = convert_listlike(np.array([arg]), format)[0]
File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:430, in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
427 format = None
429 if format is not None:
--> 430 res = _to_datetime_with_format(
431 arg, orig_arg, name, tz, format, exact, errors, infer_datetime_format
432 )
433 if res is not None:
434 return res
File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:538, in _to_datetime_with_format(arg, orig_arg, name, tz, fmt, exact, errors, infer_datetime_format)
535 return _box_as_indexlike(result, utc=utc, name=name)
537 # fallback
--> 538 res = _array_strptime_with_fallback(
539 arg, name, tz, fmt, exact, errors, infer_datetime_format
540 )
541 return res
File ~/tmp/.venv/lib/python3.10/site-packages/pandas/core/tools/datetimes.py:473, in _array_strptime_with_fallback(arg, name, tz, fmt, exact, errors, infer_datetime_format)
470 utc = tz == "utc"
472 try:
--> 473 result, timezones = array_strptime(arg, fmt, exact=exact, errors=errors)
474 except OutOfBoundsDatetime:
475 if errors == "raise":
File ~/tmp/.venv/lib/python3.10/site-packages/pandas/_libs/tslibs/strptime.pyx:156, in pandas._libs.tslibs.strptime.array_strptime()
ValueError: unconverted data remains: 00:00:00 |
Closing in favor of #49893 |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Description of changes
When passing
datetime.datetime
orpd.Timestamp
objects topd.to_datetime
, while using theformat
argument, aValueError
is raised because those object does not match the expected format.The implemented solution looks for these non-string objects in the
array_strptime
method, and then each object is processed to maintain its contents (which might differ from the expectedformat
for strings), and then continuing processing as usual.