to_datetime() with errors=coerce and without return different values #25143

grleblanc · 2019-02-04T14:52:39Z

Code Sample, a copy-pastable example if possible

Without errors='coerce'

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object

>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True)
>>> df
                                ts
0        2019-02-02 08:07:13+00:00
1 2019-02-02 08:03:22.540000+00:00
>>> df.dtypes
ts    datetime64[ns, UTC]
dtype: object

With errors='coerce'

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                   ts
0 2019-02-02 08:07:13
1                 NaT

Problem description

The functionality of to_datetime() with errors='coerce' is different than without. If I understand some of the other issues raised on this topic correctly, the functionality is different in some cases by design. In this case, the dates are very similiar, although different format.

Expected Output

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object

>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                                ts
0        2019-02-02 08:07:13+00:00
1 2019-02-02 08:03:22.540000+00:00
>>> df.dtypes
ts    datetime64[ns, UTC]
dtype: object

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2019-02-04T22:42:09Z

Agreed this looks strange. Investigation and PR's welcome!

mroeschke · 2019-02-10T01:15:23Z

So in to_datetime, we have a lot of sub functions that also attempts error handling while the the logic of when to return a result is inconsistent as well. I have a feeling that pulling out the error handling logic at more of a top level would help make this keyword return a more consistent result.

saurav2608 · 2019-02-13T17:42:03Z

I will have a look at this.

ngutzmann · 2019-02-13T21:15:12Z

Perhaps this is a bit of useful information. In pandas 0.23.4 the functionality looks to be correct:

>>> import pandas as pd
>>> pd.__version__
'0.23.4'
>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                       ts
0 2019-02-02 08:07:13.000
1 2019-02-02 08:03:22.540
>>> df.dtypes
ts    datetime64[ns]
dtype: object

saurav2608 · 2019-02-17T09:43:12Z

I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if 'infer_datetime_format' is True. And reading the code here(

pandas/pandas/core/tools/datetimes.py

Line 296 in 4a20d5b

if result is None:

) makes me think that this behavior is by design.

I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.

In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', 
    ...:                                 '01-02-2011 00:00:00', 
    ...:                                 '2011-01-03T00:00:00']))                                                                                                                 

In [13]: s                                                                                                                                                                        
Out[13]: 
0    01/01/2011 00:00:00
1    01-02-2011 00:00:00
2    2011-01-03T00:00:00
dtype: object
In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')                                                                                                           
Out[19]: 
0   2011-01-01
1          NaT
2          NaT
dtype: datetime64[ns]

In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')                                                                                                            
Out[20]: 
0   2011-01-01
1   2011-01-02
2   2011-01-03
dtype: datetime64[ns]

saurav2608 · 2019-02-20T06:01:41Z

I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if 'infer_datetime_format' is True. And reading the code here(

pandas/pandas/core/tools/datetimes.py

Line 296 in 4a20d5b

if result is None:
) makes me think that this behavior is by design.
I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.
In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', 
    ...:                                 '01-02-2011 00:00:00', 
    ...:                                 '2011-01-03T00:00:00']))                                                                                                                 

In [13]: s                                                                                                                                                                        
Out[13]: 
0    01/01/2011 00:00:00
1    01-02-2011 00:00:00
2    2011-01-03T00:00:00
dtype: object
In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')                                                                                                           
Out[19]: 
0   2011-01-01
1          NaT
2          NaT
dtype: datetime64[ns]

In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')                                                                                                            
Out[20]: 
0   2011-01-01
1   2011-01-02
2   2011-01-03
dtype: datetime64[ns]

@mroeschke : any thoughts on this.

mroeschke · 2019-02-20T06:33:32Z

So the core issue is that our format guessing function is unable to guess %z, UTC offsets.

pandas/pandas/_libs/tslibs/parsing.pyx

Line 635 in 2909b83

def _guess_datetime_format(dt_str, dayfirst=False, dt_str_parse=du_parse,

I agree with your suggestion though; errors='raise' should always raise whenever there's an error.

EDIT: Actually this would fail independently of timezones. Since infer_datetime_format guesses a format based on the first non-null time string, and the 2nd string is the example already is in a different format than the first.

nicolasdaviaud · 2019-02-26T11:26:46Z

I have a similar issue on 0.24.1, it also seems to be related to different time zones in the same query:

dates = ['2016-05-19T10:27:05', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']
['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']

but adding a Z to the first date changes the behaviour on the second

dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

[datetime.datetime(2016, 5, 19, 10, 27, 5, tzinfo=tzutc()) datetime.datetime(2016, 5, 20, 11, 28, 6) 'NaT']
['2016-05-19T10:27:05.000000000' 'NaT' 'NaT']

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.24.1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

ericman93 · 2020-06-09T14:35:34Z

same issue in 1.0.4

args = {
    'cache': True,
    'format': None,
    'infer_datetime_format': True
}
ser = pd.Series(['03-12-2016', '29-07-2017', '07-05-2016'])

pd.to_datetime(ser, errors='raise', **args)

outputs:
0   2016-03-12
1   2017-07-29
2   2016-07-05
dtype: datetime64[ns]

pd.to_datetime(ser, errors='coerce', **args)

outputs:
0   2016-03-12
1          NaT
2   2016-07-05
dtype: datetime64[ns]

evgeniikozlov · 2021-06-11T11:46:35Z

Still reproduced in 1.2.4

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="coerce")
DatetimeIndex(['2021-05-01', 'NaT'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=False, errors="coerce")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

The output is dependent on combination of infer_datetime_format and errors arguments.

Output of `pd.show_versions()`

pandas.show_versions()
INSTALLED VERSIONS

commit : 2cb9652
python : 3.7.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

jreback · 2021-06-11T11:48:20Z

@evgeniikozlov hence the open issue label

PRs from the community are how these things get fixed

MarcoGorelli · 2022-12-17T14:40:53Z

As of PDEP4, infer_datetime_format is obsolete

I've tried all the examples here, and they're all now behaving as expected

For example:

In [26]: dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
    ...: pd.to_datetime(dates)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 2
      1 dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
----> 2 pd.to_datetime(dates)

File ~/pandas-dev/pandas/core/tools/datetimes.py:1098, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1096         result = _convert_and_box_cache(argc, cache_array)
   1097     else:
-> 1098         result = convert_listlike(argc, format)
   1099 else:
   1100     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:452, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    441 if format is not None and not require_iso8601:
    442     return _to_datetime_with_format(
    443         arg,
    444         orig_arg,
   (...)
    449         errors,
    450     )
--> 452 result, tz_parsed = objects_to_datetime64ns(
    453     arg,
    454     dayfirst=dayfirst,
    455     yearfirst=yearfirst,
    456     utc=utc,
    457     errors=errors,
    458     require_iso8601=require_iso8601,
    459     allow_object=True,
    460     format=format,
    461     exact=exact,
    462 )
    464 if tz_parsed is not None:
    465     # We can take a shortcut since the datetime64 numpy array
    466     # is in UTC
    467     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2162, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2160 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2161 try:
-> 2162     result, tz_parsed = tslib.array_to_datetime(
   2163         data.ravel("K"),
   2164         errors=errors,
   2165         utc=utc,
   2166         dayfirst=dayfirst,
   2167         yearfirst=yearfirst,
   2168         require_iso8601=require_iso8601,
   2169         format=format,
   2170         exact=exact,
   2171     )
   2172     result = result.reshape(data.shape, order=order)
   2173 except OverflowError as err:
   2174     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:453, in pandas._libs.tslib.array_to_datetime()
    451 @cython.wraparound(False)
    452 @cython.boundscheck(False)
--> 453 cpdef array_to_datetime(
    454     ndarray[object] values,
    455     str errors="raise",

File ~/pandas-dev/pandas/_libs/tslib.pyx:614, in pandas._libs.tslib.array_to_datetime()
    612     continue
    613 elif is_raise:
--> 614     raise ValueError(
    615         f"time data \"{val}\" at position {i} doesn't "
    616         f"match format \"{format}\""

ValueError: time data "20/05/2016 11:28:06" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S%z"

It's correct to raise here, as the second element doesn't match the format inferred from the first element

In the 01-Sep-2021 example, you'll need to pass %b in the format, else it'll be inferred to be %B and will (correctly) error, as Sep doesn't match that directive:

In [28]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], format='%d-%b-%Y %H:%M:%S')
Out[28]: DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

Closing for now then, but thanks for the report, and please do let me know if I've misunderstood anything heree

evgeniikozlov · 2022-12-19T07:36:45Z

@MarcoGorelli thanks for the explanation about "May" parsing, but it is still unclear why it works if errors="raise" (my second example). It is confusing a lot why errors option has any affect on parsing in this case.

MarcoGorelli · 2022-12-19T08:46:36Z

That was a bug, and was fixed as part of the PDEP4 change

On the main branch, you'd get an error:

In [3]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
<ipython-input-3-08582d0c0651>:1: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
---------------------------------------------------------------------------
ValueError: time data '01-Sep-2021 00:00:00' does not match format '%d-%B-%Y %H:%M:%S' (match)

This behaviour will be available in pandas 2.0.0, which'll hopefully come out around February

mroeschke added Bug Datetime Datetime data dtype Timezones Timezone data dtype labels Feb 4, 2019

mroeschke mentioned this issue Mar 6, 2019

tricky timestamp conversion #25571

Closed

jreback mentioned this issue Sep 10, 2019

BUG: Fix to_datetime(errors='coerce') not swallowing all parser exceptions... #28367

Merged

5 tasks

mroeschke removed the Timezones Timezone data dtype label Mar 29, 2020

ghost mentioned this issue Sep 5, 2021

BUG: pd.to_datetime cannot infer and coerce dates with AM/PM and infer_datetime_format=True and errors="coerce". #43414

Closed

3 tasks

simonjayhawkins mentioned this issue Sep 5, 2021

BUG GH#43414 Infer and coerce datetimes am pm #43416

Closed

4 tasks

MarcoGorelli closed this as completed Dec 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_datetime() with errors=coerce and without return different values #25143

to_datetime() with errors=coerce and without return different values #25143

grleblanc commented Feb 4, 2019

INSTALLED VERSIONS

mroeschke commented Feb 4, 2019

mroeschke commented Feb 10, 2019 •

edited

Loading

saurav2608 commented Feb 13, 2019

ngutzmann commented Feb 13, 2019 •

edited

Loading

saurav2608 commented Feb 17, 2019

saurav2608 commented Feb 20, 2019

mroeschke commented Feb 20, 2019 •

edited

Loading

nicolasdaviaud commented Feb 26, 2019

ericman93 commented Jun 9, 2020 •

edited

Loading

evgeniikozlov commented Jun 11, 2021 •

edited

Loading

pandas.show_versions()
INSTALLED VERSIONS

jreback commented Jun 11, 2021

MarcoGorelli commented Dec 17, 2022

evgeniikozlov commented Dec 19, 2022

MarcoGorelli commented Dec 19, 2022

to_datetime() with errors=coerce and without return different values #25143

to_datetime() with errors=coerce and without return different values #25143

Comments

grleblanc commented Feb 4, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Feb 4, 2019

mroeschke commented Feb 10, 2019 • edited Loading

saurav2608 commented Feb 13, 2019

ngutzmann commented Feb 13, 2019 • edited Loading

saurav2608 commented Feb 17, 2019

saurav2608 commented Feb 20, 2019

mroeschke commented Feb 20, 2019 • edited Loading

nicolasdaviaud commented Feb 26, 2019

Output of pd.show_versions()

ericman93 commented Jun 9, 2020 • edited Loading

evgeniikozlov commented Jun 11, 2021 • edited Loading

Output of pd.show_versions()

pandas.show_versions() INSTALLED VERSIONS

jreback commented Jun 11, 2021

MarcoGorelli commented Dec 17, 2022

evgeniikozlov commented Dec 19, 2022

MarcoGorelli commented Dec 19, 2022

Output of `pd.show_versions()`

mroeschke commented Feb 10, 2019 •

edited

Loading

ngutzmann commented Feb 13, 2019 •

edited

Loading

mroeschke commented Feb 20, 2019 •

edited

Loading

Output of `pd.show_versions()`

ericman93 commented Jun 9, 2020 •

edited

Loading

evgeniikozlov commented Jun 11, 2021 •

edited

Loading

Output of `pd.show_versions()`

pandas.show_versions()
INSTALLED VERSIONS