Skip to content

BUG: pd.to_datetime does not treat YYYYMMDD and YYYY/MM/DD in the same way as strptime #48440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
rhjmoore opened this issue Sep 7, 2022 · 6 comments
Closed
3 tasks done
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request

Comments

@rhjmoore
Copy link

rhjmoore commented Sep 7, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import datetime
dates = ['2023/01/23','20230123']
strptime_dates = []
for d in dates:
    try:
        dd = datetime.datetime.strptime(d, '%Y/%m/%d')
    except ValueError:
        dd = 'Bad Format'
    strptime_dates.append(dd)
print(pd.to_datetime(dates, errors='coerce', format='%Y/%m/%d'))
print(strptime_dates)

Issue Description

The documentation on this function references the formatting, and thus behaviour, of the standard strftime and strptime functions.
https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

By and large this is correct. However, in the above example we see that strptime correctly rejects and improperly formatted date (in this case one missing forward slashes between the date components) whilst pd.to_datetime ignores this error (despite the exact=True option being the default).

Pandas thus makes it impossible for the data coder to discover this error in the data without resorting to df.apply type functionality.

Expected Behavior

pd.to_datetime should mirror the behaviour of datetime.datetime.strptime.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805
python : 3.9.9.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:28:23 PDT 2022; root:xnu-8020.141.5~2/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.3.5
numpy : 1.20.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 59.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2022.01.0
fastparquet : None
gcsfs : None
matplotlib : 3.5.1
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 6.0.1
pyxlsb : None
s3fs : None
scipy : 1.7.3
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.20.2
xlrd : 2.0.1
xlwt : None
numba : 0.54.1

@rhjmoore rhjmoore added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2022
@mroeschke
Copy link
Member

Thanks for the report. Agreed if exact=True, the format '%Y/%m/%d' should not be able to parse '20230123'.

Internally it appears '%Y/%m/%d' is deemed a "iso8601" format so a custom require_iso8601 code path is used instead of the provided format.

@mroeschke mroeschke added Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2022
@andrewchen1216
Copy link
Contributor

take

@MarcoGorelli
Copy link
Member

is this the same as #12649? If so, it should be addressed by #49333

@MarcoGorelli MarcoGorelli added the Duplicate Report Duplicate issue or pull request label Nov 6, 2022
@MarcoGorelli
Copy link
Member

closing as I'm pretty sure it's a duplicate, but please do let me know if I've misunderstood and I'll reopen

@rhjmoore
Copy link
Author

rhjmoore commented Nov 7, 2022

Looks likely; interesting that other (quite old) bug didn't come up in the search, but glad that's it's being done. Is it possible to verify the failure case at the top of this report against the fix patch you link to?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 7, 2022

yeah there's some tests in https://github.com/pandas-dev/pandas/pull/49333/files#diff-388d9e4dc158bf81c94ed5df7ac7027cde97d599db685376f7988ed33bdba9b7 which check this kind of thing:

    @pytest.mark.parametrize(
        "input, format",
        [
            ("2020-01", "%Y/%m"),
            ("2020-01-01", "%Y/%m/%d"),
            ("2020-01-01 00", "%Y/%m/%dT%H"),
            ("2020-01-01T00", "%Y/%m/%d %H"),
            ("2020-01-01 00:00", "%Y/%m/%dT%H:%M"),
            ("2020-01-01T00:00", "%Y/%m/%d %H:%M"),
            ("2020-01-01 00:00:00", "%Y/%m/%dT%H:%M:%S"),
            ("2020-01-01T00:00:00", "%Y/%m/%d %H:%M:%S"),
        ],
    )
    def test_to_datetime_iso8601_separator(self, input, format):
        # https://github.com/pandas-dev/pandas/issues/12649
        with pytest.raises(
            ValueError,
            match=(
                rf"time data \"{input}\" at position 0 doesn\'t match format "
                rf"\"{format}\""
            ),
        ):
            to_datetime(input, format=format)

should be fixed by the time the next non-patch release comes out (2.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants