-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
to_datetime() with errors=coerce and without return different values #25143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Agreed this looks strange. Investigation and PR's welcome! |
So in |
I will have a look at this. |
Perhaps this is a bit of useful information. In pandas 0.23.4 the functionality looks to be correct:
|
I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if 'infer_datetime_format' is True. And reading the code here( pandas/pandas/core/tools/datetimes.py Line 296 in 4a20d5b
I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.
|
@mroeschke : any thoughts on this. |
So the core issue is that our format guessing function is unable to guess %z, UTC offsets. pandas/pandas/_libs/tslibs/parsing.pyx Line 635 in 2909b83
I agree with your suggestion though; EDIT: Actually this would fail independently of timezones. Since |
I have a similar issue on 0.24.1, it also seems to be related to different time zones in the same query: dates = ['2016-05-19T10:27:05', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False)) returns
but adding a dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False)) returns
Output of
|
same issue in 1.0.4
|
Still reproduced in 1.2.4 >>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="coerce")
DatetimeIndex(['2021-05-01', 'NaT'], dtype='datetime64[ns]', freq=None)
>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)
>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=False, errors="coerce")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None) The output is dependent on combination of infer_datetime_format and errors arguments. Output of
|
@evgeniikozlov hence the open issue label PRs from the community are how these things get fixed |
As of PDEP4, I've tried all the examples here, and they're all now behaving as expected For example: In [26]: dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
...: pd.to_datetime(dates)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[26], line 2
1 dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
----> 2 pd.to_datetime(dates)
File ~/pandas-dev/pandas/core/tools/datetimes.py:1098, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
1096 result = _convert_and_box_cache(argc, cache_array)
1097 else:
-> 1098 result = convert_listlike(argc, format)
1099 else:
1100 result = convert_listlike(np.array([arg]), format)[0]
File ~/pandas-dev/pandas/core/tools/datetimes.py:452, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
441 if format is not None and not require_iso8601:
442 return _to_datetime_with_format(
443 arg,
444 orig_arg,
(...)
449 errors,
450 )
--> 452 result, tz_parsed = objects_to_datetime64ns(
453 arg,
454 dayfirst=dayfirst,
455 yearfirst=yearfirst,
456 utc=utc,
457 errors=errors,
458 require_iso8601=require_iso8601,
459 allow_object=True,
460 format=format,
461 exact=exact,
462 )
464 if tz_parsed is not None:
465 # We can take a shortcut since the datetime64 numpy array
466 # is in UTC
467 dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))
File ~/pandas-dev/pandas/core/arrays/datetimes.py:2162, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
2160 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
2161 try:
-> 2162 result, tz_parsed = tslib.array_to_datetime(
2163 data.ravel("K"),
2164 errors=errors,
2165 utc=utc,
2166 dayfirst=dayfirst,
2167 yearfirst=yearfirst,
2168 require_iso8601=require_iso8601,
2169 format=format,
2170 exact=exact,
2171 )
2172 result = result.reshape(data.shape, order=order)
2173 except OverflowError as err:
2174 # Exception is raised when a part of date is greater than 32 bit signed int
File ~/pandas-dev/pandas/_libs/tslib.pyx:453, in pandas._libs.tslib.array_to_datetime()
451 @cython.wraparound(False)
452 @cython.boundscheck(False)
--> 453 cpdef array_to_datetime(
454 ndarray[object] values,
455 str errors="raise",
File ~/pandas-dev/pandas/_libs/tslib.pyx:614, in pandas._libs.tslib.array_to_datetime()
612 continue
613 elif is_raise:
--> 614 raise ValueError(
615 f"time data \"{val}\" at position {i} doesn't "
616 f"match format \"{format}\""
ValueError: time data "20/05/2016 11:28:06" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S%z" It's correct to raise here, as the second element doesn't match the format inferred from the first element In the In [28]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], format='%d-%b-%Y %H:%M:%S')
Out[28]: DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None) Closing for now then, but thanks for the report, and please do let me know if I've misunderstood anything heree |
@MarcoGorelli thanks for the explanation about "May" parsing, but it is still unclear why it works if |
That was a bug, and was fixed as part of the PDEP4 change On the main branch, you'd get an error:
This behaviour will be available in pandas 2.0.0, which'll hopefully come out around February |
Code Sample, a copy-pastable example if possible
Without errors='coerce'
With errors='coerce'
Problem description
The functionality of to_datetime() with errors='coerce' is different than without. If I understand some of the other issues raised on this topic correctly, the functionality is different in some cases by design. In this case, the dates are very similiar, although different format.
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: