Skip to content

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
MarcoGorelli opened this issue Dec 4, 2022 · 4 comments · Fixed by #50242
Closed
3 tasks done

BUG: to_datetime with decimal number doesn't fail for %Y%m%d #50051

MarcoGorelli opened this issue Dec 4, 2022 · 4 comments · Fixed by #50242
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 4, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

to_datetime(['19801212.0'], format='%Y%m%d', exact=True)

Issue Description

This should fail, there's no interpretation of format in which 19801212.0 matches %Y%m%d

Expected Behavior

Probably something like

ValueError: unconverted data remains: .0

but certainly not that it passes without an error

Installed Versions

INSTALLED VERSIONS

commit : 732ad9f281bf6efffa12b8a872df98e18c11e3b7
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+838.g732ad9f281.dirty
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.59.0
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.7.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 2022.11.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.44
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 4, 2022
@MarcoGorelli MarcoGorelli changed the title BUG: to_datetime with decimal number doesn't fail for ISO formats BUG: to_datetime with decimal number doesn't fail for %Y%m%d Dec 4, 2022
@MarcoGorelli
Copy link
Member Author

We don't do this for any other format, why are we special-casing %Y%m%d?

In [1]: print(to_datetime(['198012.0'], format='%Y%m'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 1
----> 1 print(to_datetime(['198012.0'], format='%Y%m'))

File ~/pandas-dev/pandas/core/tools/datetimes.py:1115, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1113         result = _convert_and_box_cache(argc, cache_array)
   1114     else:
-> 1115         result = convert_listlike(argc, format)
   1116 else:
   1117     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:438, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    435         require_iso8601 = not infer_datetime_format
    437 if format is not None and not require_iso8601:
--> 438     res = _to_datetime_with_format(
    439         arg, orig_arg, name, utc, format, exact, errors, infer_datetime_format
    440     )
    441     if res is not None:
    442         return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:544, in _to_datetime_with_format(arg, orig_arg, name, utc, fmt, exact, errors, infer_datetime_format)
    541         return _box_as_indexlike(result, utc=utc, name=name)
    543 # fallback
--> 544 res = _array_strptime_with_fallback(
    545     arg, name, utc, fmt, exact, errors, infer_datetime_format
    546 )
    547 return res

File ~/pandas-dev/pandas/core/tools/datetimes.py:478, in _array_strptime_with_fallback(arg, name, utc, fmt, exact, errors, infer_datetime_format)
    474 """
    475 Call array_strptime, with fallback behavior depending on 'errors'.
    476 """
    477 try:
--> 478     result, timezones = array_strptime(
    479         arg, fmt, exact=exact, errors=errors, utc=utc
    480     )
    481 except OutOfBoundsDatetime:
    482     if errors == "raise":

File ~/pandas-dev/pandas/_libs/tslibs/strptime.pyx:185, in pandas._libs.tslibs.strptime.array_strptime()
    183             iresult[i] = NPY_NAT
    184             continue
--> 185         raise ValueError(f"unconverted data remains: {val[found.end():]}")
    186 
    187 # search

ValueError: unconverted data remains: .0

@MarcoGorelli
Copy link
Member Author

Pretty sure I disagree with this test:

def test_to_datetime_format_YYYYMMDD_with_nat(self, cache):
ser = Series([19801222, 19801222] + [19810105] * 5)
# with NaT
expected = Series(
[Timestamp("19801222"), Timestamp("19801222")] + [Timestamp("19810105")] * 5
)
expected[2] = np.nan
ser[2] = np.nan
result = to_datetime(ser, format="%Y%m%d", cache=cache)
tm.assert_series_equal(result, expected)
# string with NaT
ser2 = ser.apply(str)
ser2[2] = "nat"
result = to_datetime(ser2, format="%Y%m%d", cache=cache)
tm.assert_series_equal(result, expected)

These should raise, the inputs don't match the given format

I don't think the %Y%m%d special path serves any purpose, we should just remove it

@MarcoGorelli
Copy link
Member Author

This doesn't look right:

ser = Series([20121231, 20141231, 99991231])
result = to_datetime(ser, format="%Y%m%d", errors="ignore", cache=cache)
expected = Series(
[datetime(2012, 12, 31), datetime(2014, 12, 31), datetime(9999, 12, 31)],
dtype=object,
)
tm.assert_series_equal(result, expected)

If the input is invalid, errors='ignore' should return the input

@jorisvandenbossche
Copy link
Member

Resolved by #50242

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
2 participants