Skip to content

to_datetime() with errors=coerce and without return different values #25143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
grleblanc opened this issue Feb 4, 2019 · 14 comments
Closed

to_datetime() with errors=coerce and without return different values #25143

grleblanc opened this issue Feb 4, 2019 · 14 comments
Labels
Bug Datetime Datetime data dtype

Comments

@grleblanc
Copy link

Code Sample, a copy-pastable example if possible

Without errors='coerce'

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object

>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True)
>>> df
                                ts
0        2019-02-02 08:07:13+00:00
1 2019-02-02 08:03:22.540000+00:00
>>> df.dtypes
ts    datetime64[ns, UTC]
dtype: object

With errors='coerce'

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                   ts
0 2019-02-02 08:07:13
1                 NaT

Problem description

The functionality of to_datetime() with errors='coerce' is different than without. If I understand some of the other issues raised on this topic correctly, the functionality is different in some cases by design. In this case, the dates are very similiar, although different format.

Expected Output

>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> d
{'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00']}
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df.dtypes
ts    object
dtype: object

>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                                ts
0        2019-02-02 08:07:13+00:00
1 2019-02-02 08:03:22.540000+00:00
>>> df.dtypes
ts    datetime64[ns, UTC]
dtype: object

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 3.8.0
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@mroeschke
Copy link
Member

Agreed this looks strange. Investigation and PR's welcome!

@mroeschke mroeschke added Bug Datetime Datetime data dtype Timezones Timezone data dtype labels Feb 4, 2019
@mroeschke
Copy link
Member

mroeschke commented Feb 10, 2019

So in to_datetime, we have a lot of sub functions that also attempts error handling while the the logic of when to return a result is inconsistent as well. I have a feeling that pulling out the error handling logic at more of a top level would help make this keyword return a more consistent result.

@saurav2608
Copy link

I will have a look at this.

@ngutzmann
Copy link

ngutzmann commented Feb 13, 2019

Perhaps this is a bit of useful information. In pandas 0.23.4 the functionality looks to be correct:

>>> import pandas as pd
>>> pd.__version__
'0.23.4'
>>> d = { 'ts': ['2019-02-02 08:07:13+00', '2019-02-02 08:03:22.54+00'] }
>>> df = pd.DataFrame(data=d)
>>> df
                          ts
0     2019-02-02 08:07:13+00
1  2019-02-02 08:03:22.54+00
>>> df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True, errors='coerce')
>>> df
                       ts
0 2019-02-02 08:07:13.000
1 2019-02-02 08:03:22.540
>>> df.dtypes
ts    datetime64[ns]
dtype: object

@saurav2608
Copy link

I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if 'infer_datetime_format' is True. And reading the code here(

if result is None:
) makes me think that this behavior is by design.

I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.

In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', 
    ...:                                 '01-02-2011 00:00:00', 
    ...:                                 '2011-01-03T00:00:00']))                                                                                                                 

In [13]: s                                                                                                                                                                        
Out[13]: 
0    01/01/2011 00:00:00
1    01-02-2011 00:00:00
2    2011-01-03T00:00:00
dtype: object
In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')                                                                                                           
Out[19]: 
0   2011-01-01
1          NaT
2          NaT
dtype: datetime64[ns]

In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')                                                                                                            
Out[20]: 
0   2011-01-01
1   2011-01-02
2   2011-01-03
dtype: datetime64[ns]

@saurav2608
Copy link

I think we need some discussion on what is the correct behavior in some of the corner cases. For example in the below case (on version 0.24.0). I think as errors is 'raise' the second command should raise an error. Currently, we retry parsing on error if 'infer_datetime_format' is True. And reading the code here(

pandas/pandas/core/tools/datetimes.py

Line 296 in 4a20d5b

if result is None:
) makes me think that this behavior is by design.
I think, if errors = 'raise' and anytime we encounter error (even with infer_datetime_format) we should raise and stop.

In [12]: s = pd.Series(np.array(['01/01/2011 00:00:00', 
    ...:                                 '01-02-2011 00:00:00', 
    ...:                                 '2011-01-03T00:00:00']))                                                                                                                 

In [13]: s                                                                                                                                                                        
Out[13]: 
0    01/01/2011 00:00:00
1    01-02-2011 00:00:00
2    2011-01-03T00:00:00
dtype: object
In [19]: pd.to_datetime(s, infer_datetime_format=True, errors='coerce')                                                                                                           
Out[19]: 
0   2011-01-01
1          NaT
2          NaT
dtype: datetime64[ns]

In [20]: pd.to_datetime(s, infer_datetime_format=True, errors='raise')                                                                                                            
Out[20]: 
0   2011-01-01
1   2011-01-02
2   2011-01-03
dtype: datetime64[ns]

@mroeschke : any thoughts on this.

@mroeschke
Copy link
Member

mroeschke commented Feb 20, 2019

So the core issue is that our format guessing function is unable to guess %z, UTC offsets.

def _guess_datetime_format(dt_str, dayfirst=False, dt_str_parse=du_parse,

I agree with your suggestion though; errors='raise' should always raise whenever there's an error.

EDIT: Actually this would fail independently of timezones. Since infer_datetime_format guesses a format based on the first non-null time string, and the 2nd string is the example already is in a different format than the first.

@nicolasdaviaud
Copy link

I have a similar issue on 0.24.1, it also seems to be related to different time zones in the same query:

dates = ['2016-05-19T10:27:05', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']
['2016-05-19T10:27:05.000000000' '2016-05-20T11:28:06.000000000' 'NaT']

but adding a Z to the first date changes the behaviour on the second

dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
print(pd.to_datetime(dates, errors='raise', infer_datetime_format=True, box=False))
print(pd.to_datetime(dates, errors='coerce', infer_datetime_format=True, box=False))

returns

[datetime.datetime(2016, 5, 19, 10, 27, 5, tzinfo=tzutc()) datetime.datetime(2016, 5, 20, 11, 28, 6) 'NaT']
['2016-05-19T10:27:05.000000000' 'NaT' 'NaT']

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0.24.1
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.16.1
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@ericman93
Copy link

ericman93 commented Jun 9, 2020

same issue in 1.0.4

args = {
    'cache': True,
    'format': None,
    'infer_datetime_format': True
}
ser = pd.Series(['03-12-2016', '29-07-2017', '07-05-2016'])
pd.to_datetime(ser, errors='raise', **args)

outputs:
0   2016-03-12
1   2017-07-29
2   2016-07-05
dtype: datetime64[ns]
pd.to_datetime(ser, errors='coerce', **args)

outputs:
0   2016-03-12
1          NaT
2   2016-07-05
dtype: datetime64[ns]

@evgeniikozlov
Copy link

evgeniikozlov commented Jun 11, 2021

Still reproduced in 1.2.4

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="coerce")
DatetimeIndex(['2021-05-01', 'NaT'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

>>pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=False, errors="coerce")
DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

The output is dependent on combination of infer_datetime_format and errors arguments.

Output of pd.show_versions()

pandas.show_versions()
INSTALLED VERSIONS

commit : 2cb9652
python : 3.7.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@jreback
Copy link
Contributor

jreback commented Jun 11, 2021

@evgeniikozlov hence the open issue label

PRs from the community are how these things get fixed

@MarcoGorelli
Copy link
Member

As of PDEP4, infer_datetime_format is obsolete

I've tried all the examples here, and they're all now behaving as expected

For example:

In [26]: dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
    ...: pd.to_datetime(dates)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 2
      1 dates = ['2016-05-19T10:27:05Z', '20/05/2016 11:28:06', '']
----> 2 pd.to_datetime(dates)

File ~/pandas-dev/pandas/core/tools/datetimes.py:1098, in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
   1096         result = _convert_and_box_cache(argc, cache_array)
   1097     else:
-> 1098         result = convert_listlike(argc, format)
   1099 else:
   1100     result = convert_listlike(np.array([arg]), format)[0]

File ~/pandas-dev/pandas/core/tools/datetimes.py:452, in _convert_listlike_datetimes(arg, format, name, utc, unit, errors, dayfirst, yearfirst, exact)
    441 if format is not None and not require_iso8601:
    442     return _to_datetime_with_format(
    443         arg,
    444         orig_arg,
   (...)
    449         errors,
    450     )
--> 452 result, tz_parsed = objects_to_datetime64ns(
    453     arg,
    454     dayfirst=dayfirst,
    455     yearfirst=yearfirst,
    456     utc=utc,
    457     errors=errors,
    458     require_iso8601=require_iso8601,
    459     allow_object=True,
    460     format=format,
    461     exact=exact,
    462 )
    464 if tz_parsed is not None:
    465     # We can take a shortcut since the datetime64 numpy array
    466     # is in UTC
    467     dta = DatetimeArray(result, dtype=tz_to_dtype(tz_parsed))

File ~/pandas-dev/pandas/core/arrays/datetimes.py:2162, in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object, format, exact)
   2160 order: Literal["F", "C"] = "F" if flags.f_contiguous else "C"
   2161 try:
-> 2162     result, tz_parsed = tslib.array_to_datetime(
   2163         data.ravel("K"),
   2164         errors=errors,
   2165         utc=utc,
   2166         dayfirst=dayfirst,
   2167         yearfirst=yearfirst,
   2168         require_iso8601=require_iso8601,
   2169         format=format,
   2170         exact=exact,
   2171     )
   2172     result = result.reshape(data.shape, order=order)
   2173 except OverflowError as err:
   2174     # Exception is raised when a part of date is greater than 32 bit signed int

File ~/pandas-dev/pandas/_libs/tslib.pyx:453, in pandas._libs.tslib.array_to_datetime()
    451 @cython.wraparound(False)
    452 @cython.boundscheck(False)
--> 453 cpdef array_to_datetime(
    454     ndarray[object] values,
    455     str errors="raise",

File ~/pandas-dev/pandas/_libs/tslib.pyx:614, in pandas._libs.tslib.array_to_datetime()
    612     continue
    613 elif is_raise:
--> 614     raise ValueError(
    615         f"time data \"{val}\" at position {i} doesn't "
    616         f"match format \"{format}\""

ValueError: time data "20/05/2016 11:28:06" at position 1 doesn't match format "%Y-%m-%dT%H:%M:%S%z"

It's correct to raise here, as the second element doesn't match the format inferred from the first element

In the 01-Sep-2021 example, you'll need to pass %b in the format, else it'll be inferred to be %B and will (correctly) error, as Sep doesn't match that directive:

In [28]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], format='%d-%b-%Y %H:%M:%S')
Out[28]: DatetimeIndex(['2021-05-01', '2021-09-01'], dtype='datetime64[ns]', freq=None)

Closing for now then, but thanks for the report, and please do let me know if I've misunderstood anything heree

@evgeniikozlov
Copy link

@MarcoGorelli thanks for the explanation about "May" parsing, but it is still unclear why it works if errors="raise" (my second example). It is confusing a lot why errors option has any affect on parsing in this case.

@MarcoGorelli
Copy link
Member

That was a bug, and was fixed as part of the PDEP4 change

On the main branch, you'd get an error:

In [3]: pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
<ipython-input-3-08582d0c0651>:1: UserWarning: The argument 'infer_datetime_format' is deprecated and will be removed in a future version. A strict version of it is now the default, see https://pandas.pydata.org/pdeps/0004-consistent-to-datetime-parsing.html. You can safely remove this argument.
  pandas.to_datetime(['01-May-2021 00:00:00', '01-Sep-2021 00:00:00'], infer_datetime_format=True, errors="raise")
---------------------------------------------------------------------------
ValueError: time data '01-Sep-2021 00:00:00' does not match format '%d-%B-%Y %H:%M:%S' (match)

This behaviour will be available in pandas 2.0.0, which'll hopefully come out around February

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

No branches or pull requests

9 participants