Skip to content

BUG: Do not fail when parsing pydatetime objects in pd.to_datetime #49893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 1, 2022

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Nov 24, 2022

Granted, there's already a PR open to address this (#49338), it may be stale. As this is a blocker for PDEP0004, I've taken it forwards, but have taken care to include @aaossa as commit author.
I've fixed it up a bit to handle the objections I had and have re-written the tests

aaossa and others added 4 commits November 24, 2022 12:38
author Antonio Ossa Guerra <[email protected]> 1666800993 -0300
committer MarcoGorelli <> 1669293453 +0000

Parse `datetime` properly in `pd.to_datetime`

When applying `pd.to_datetime` on array-like structure that contain a
`datetime.datetime` object, while using the `format` argument, a
`ValueError` is raised because the `datetime.datetime` object does not
match the expected format.

The implemented solution looks for `datetime.datetime` instances in the
`array_strptime` method. If an instance of this type is found, it's
properly handled by the new `_parse_python_datetime_object`, which
returns the expected Numpy datetime object.

Signed-off-by: Antonio Ossa Guerra <[email protected]>
@MarcoGorelli MarcoGorelli changed the title Gh 49298 BUG: Do not attempt to parse datetime.datetime objects in pd.to_datetime Nov 24, 2022
@MarcoGorelli MarcoGorelli added the Datetime Datetime data dtype label Nov 24, 2022
@MarcoGorelli MarcoGorelli changed the title BUG: Do not attempt to parse datetime.datetime objects in pd.to_datetime BUG: Do not fail when parsing pydatetime objects in pd.to_datetime Nov 24, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good couple comments

pyproject.toml Outdated
@@ -92,6 +92,7 @@ disable = [
"unsupported-assignment-operation",
"unsupported-membership-test",
"unused-import",
"use-a-generator",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which line was throwing this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd originally written

if any(tz is not None for tz in timezones):

as a list comprehension. It just seemed like too minor of a thing to bother annoying other contributors with, but no objections to keeping this check enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see this enabled. We had a different package check for this previously.

The short circuiting behavior described in https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/use-a-generator.html seems like a useful check to guarantee

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, have re-enabled

I might try to get this back into pyupgrade - it was taken out after some similar rewrites caused some performance degradations, but for any I'd imagine it would be a consistent improvement asottile/pyupgrade#704

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the context.

# note: ISO8601 formats go down a fastpath, so we need to check both
# a ISO8601 format and a non-ISO8601 one
ts1 = constructor(input[0])
ts2 = input[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be wrapped by the constructor as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for you review! I didn't intend it to be - like this the input contains a mixture of strings and pydatetime objects

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@jbrockmendel
Copy link
Member

havent looked at the tests yet, but the non-test changes look good

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM merge when ready @jbrockmendel

ids=["non-ISO8601 format", "ISO8601 format"],
)
@pytest.mark.parametrize(
"utc, input, expected",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe avoid using input as argument name could help to avoid possible confusion with the input built-in

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good one, thanks!

@aaossa
Copy link
Contributor

aaossa commented Nov 30, 2022

Thanks for keeping this updated @MarcoGorelli , should I close #49338 in favor of this PR?

@MarcoGorelli
Copy link
Member Author

thanks @aaossa ! sure, yes please

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 1, 2022

havent looked at the tests yet, but the non-test changes look good

merging for now then to unblock the PDEP4 PR, if you have comments on the tests I'll address them separately

thanks all for your reviews!

@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Dec 1, 2022
@MarcoGorelli MarcoGorelli merged commit 2804681 into pandas-dev:main Dec 1, 2022
@topper-123
Copy link
Contributor

I seem to haven gotten a lot of failed tests from this PR, especially in pandas/tests/tools/test_to_datetime.pywhere I get 155 failed tests.

The error message are like TypeError: array_strptime() got an unexpected keyword argument 'utc'

Example error report:

[gw2] darwin -- Python 3.9.7 /opt/homebrew/Caskroom/miniconda/base/envs/pandas/bin/python

self = <pandas.tests.tools.test_to_datetime.TestToDatetimeInferFormat object at 0x1404f9070>, cache = False

    def test_to_datetime_infer_datetime_format_series_start_with_nans(self, cache):
        ser = Series(
            np.array(
                [
                    np.nan,
                    np.nan,
                    "01/01/2011 00:00:00",
                    "01/02/2011 00:00:00",
                    "01/03/2011 00:00:00",
                ],
                dtype=object,
            )
        )

        tm.assert_series_equal(
            to_datetime(ser, infer_datetime_format=False, cache=cache),
>           to_datetime(ser, infer_datetime_format=True, cache=cache),
        )

pandas/tests/tools/test_to_datetime.py:2405:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/tools/datetimes.py:1083: in to_datetime
    values = convert_listlike(arg._values, format)
pandas/core/tools/datetimes.py:438: in _convert_listlike_datetimes
    res = _to_datetime_with_format(
pandas/core/tools/datetimes.py:544: in _to_datetime_with_format
    res = _array_strptime_with_fallback(
pandas/core/tools/datetimes.py:478: in _array_strptime_with_fallback
    result, timezones = array_strptime(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   'p': 18,  # an additional key, only with I
E   TypeError: array_strptime() got an unexpected keyword argument 'utc'

pandas/_libs/tslibs/strptime.pyx:56: TypeError
___________________________ TestToDatetimeInferFormat.test_infer_datetime_format_tz_name[UTC-0] ____________________________
[gw2] darwin -- Python 3.9.7 /opt/homebrew/Caskroom/miniconda/base/envs/pandas/bin/python

self = <pandas.tests.tools.test_to_datetime.TestToDatetimeInferFormat object at 0x1404f9220>, tz_name = 'UTC', offset = 0

    @pytest.mark.parametrize(
        "tz_name, offset", [("UTC", 0), ("UTC-3", 180), ("UTC+3", -180)]
    )
    def test_infer_datetime_format_tz_name(self, tz_name, offset):
        # GH 33133
        ser = Series([f"2019-02-02 08:07:13 {tz_name}"])
>       result = to_datetime(ser, infer_datetime_format=True)

pandas/tests/tools/test_to_datetime.py:2414:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/tools/datetimes.py:1083: in to_datetime
    values = convert_listlike(arg._values, format)
pandas/core/tools/datetimes.py:438: in _convert_listlike_datetimes
    res = _to_datetime_with_format(
pandas/core/tools/datetimes.py:544: in _to_datetime_with_format
    res = _array_strptime_with_fallback(
pandas/core/tools/datetimes.py:478: in _array_strptime_with_fallback
    result, timezones = array_strptime(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   'p': 18,  # an additional key, only with I
E   TypeError: array_strptime() got an unexpected keyword argument 'utc'

pandas/_libs/tslibs/strptime.pyx:56: TypeError

My OS is MacOs 12.5.1 and AFAIKS my packages are up-to-date:

INSTALLED VERSIONS

commit : 2804681
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:35 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T8101
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.0.dev0+805.g2804681c45
numpy : 1.23.4
pytz : 2022.1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 22.2.2
Cython : 0.29.32
pytest : 7.1.2
hypothesis : 6.37.0
sphinx : 5.0.2
blosc : 1.10.6
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 0.8.3
fsspec : 2022.10.0
gcsfs : None
matplotlib : 3.5.3
numba : 0.56.4
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : 1.0.10
s3fs : 2022.10.0
scipy : 1.9.1
snappy : None
sqlalchemy : 1.4.43
tables : 3.7.0
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.18.0
tzdata : None
qtpy : None
pyqt5 : None

@MarcoGorelli
Copy link
Member Author

this is fine in CI - does this function in pandas/_libs/tslibs/strptime.pyx have a utc argument? did you re-compile your C extensions? not sure what else it could be

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 4, 2022

ah wait, you're right @topper-123 , the PY39 macos job has started timing out

I'll take a look


EDIT: looks like it's passing in recent runs. So I think it might be unrelated to this PR

@topper-123
Copy link
Contributor

topper-123 commented Dec 4, 2022

Hmm I'll try out various stuff, maybe reinstall the python environment anew.

Thanks for responding so quickly.

@topper-123
Copy link
Contributor

A short follow-up: I got this fixed by compiling the c code using python setup.py build_ext --inplace --force -j 4 instead of the usual python setup.py build_ext -j 4.

So this error was caused by something in my build process and is unrelated to this PR.

@@ -495,7 +504,7 @@ def _array_strptime_with_fallback(
# Indicates to the caller to fallback to objects_to_datetime64ns
return None
else:
if "%Z" in fmt or "%z" in fmt:
if any(tz is not None for tz in timezones):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the difference here for if all the strings are tznaive but we saw a tzaware datetime object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Do not fail when parsing pydatetime objects in pd.to_datetime
6 participants