Skip to content

to_datetime(foo, errors='coerce') does not swallow all errors #28299

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
miggec opened this issue Sep 5, 2019 · 4 comments · Fixed by #28367
Closed

to_datetime(foo, errors='coerce') does not swallow all errors #28299

miggec opened this issue Sep 5, 2019 · 4 comments · Fixed by #28367
Labels
Bug Datetime Datetime data dtype
Milestone

Comments

@miggec
Copy link
Contributor

miggec commented Sep 5, 2019

Code Sample

# this fails with ValueError in 0.25.1:
pandas.to_datetime('200622-12-31', errors='coerce')
# but returns Timestamp('2022-06-21 19:00:00') in pandas 0.23.4

# this also fails: 
pandas.to_datetime('111111-24-11', errors='coerce')

# but this does not:
pandas.to_datetime('111111-23-11', errors='coerce')

Problem description

I have some text files with malformed dates, which at one point I will process with the above code. While trying to migrate my code from 23.4 to 25.1 I got the following:

.../my_file.py in <module>
----> 1 pandas.to_datetime('200622-12-31', errors='coerce')

.../lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

.../lib/python3.7/site-packages/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, infer_datetime_format, origin, cache)
    794             result = convert_listlike(arg, box, format)
    795     else:
--> 796         result = convert_listlike(np.array([arg]), box, format)[0]
    797 
    798     return result

.../lib/python3.7/site-packages/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    461             errors=errors,
    462             require_iso8601=require_iso8601,
--> 463             allow_object=True,
    464         )
    465 

.../lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1982             return values.view("i8"), tz_parsed
   1983         except (ValueError, TypeError):
-> 1984             raise e
   1985 
   1986     if tz_parsed is not None:

.../lib/python3.7/site-packages/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1973             dayfirst=dayfirst,
   1974             yearfirst=yearfirst,
-> 1975             require_iso8601=require_iso8601,
   1976         )
   1977     except ValueError as e:

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

ValueError: offset must be a timedelta strictly between -timedelta(hours=24) and timedelta(hours=24).

Expected Output

The main expectation is that an exception is not raised.

I would probably expect pandas.to_datetime('200622-12-31', errors='coerce') to return NaT, but pandas 23.4 seems to parse it into Timestamp('2022-06-21 19:00:00')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-58-generic
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.10
pytest : 4.6.3
hypothesis : 4.7.3
sphinx : None
blosc : None
feather : 0.4.0
xlsxwriter : 1.1.8
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : 0.9.3
psycopg2 : 2.7.7 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : 0.3.0
lxml.etree : 4.3.3
matplotlib : 3.1.1
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : 1.2.14
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Yea that does seem buggy. I think should return NaT as well. Care to investigate?

@WillAyd WillAyd added Bug Datetime Datetime data dtype labels Sep 5, 2019
@miggec
Copy link
Contributor Author

miggec commented Sep 6, 2019

@WillAyd thanks - I'm happy to dig into the project to investigate, though I'm quite unfamiliar with the inner workings of pandas.

Specifically, the error is raised the bottom of this block, from tslib.array_to_datetime:

# pandas/core/arrays/datatime.py from line 1965, in objects_to_datetime64ns:

# if str-dtype, convert
    data = np.array(data, copy=False, dtype=np.object_)

    try:
        result, tz_parsed = tslib.array_to_datetime(
            data,
            errors=errors,
            utc=utc,
            dayfirst=dayfirst,
            yearfirst=yearfirst,
            require_iso8601=require_iso8601,
        )
    except ValueError as e:
        try:
            values, tz_parsed = conversion.datetime_to_datetime64(data)
            # If tzaware, these values represent unix timestamps, so we
            #  return them as i8 to distinguish from wall times
            return values.view("i8"), tz_parsed
        except (ValueError, TypeError):
            raise e

Simple reproducible example:

import numpy as np

data = np.array(['200622-12-31'], copy=False, dtype=np.object_)
tslib.array_to_datetime(data, errors='coerce', utc=False, dayfirst=False, yearfirst=False, require_iso8601=False)

It looks like the outer try block in in objects_to_datetime64ns does not respect the errors argument - seems overly simple but the fix here is probably to return NaT if errors == 'coerce' and any Exception is raised.

EDIT: corrected the reproducible example to include errors='coerce'

@miggec
Copy link
Contributor Author

miggec commented Sep 6, 2019

(happy to raise a PR if the above suggestion seems sensible)

EDIT: the above suggestion would not work at all. The error needs to be handled somewhere in the Cython internals (in pandas._libs.tslib.array_to_datetime)

@miggec
Copy link
Contributor Author

miggec commented Sep 9, 2019

After further investigation it looks like this is caused by this open issue in the dateutil project:
dateutil/dateutil#188

To reproduce:

In [6]: import dateutil.parser                                                                                                                                                                                                                 
In [7]: timestamp_value = '200622-12-31'                                                                                                                                                                                                       
In [8]: timestamp = dateutil.parser.parse(timestamp_value)                                                                                                                                                                                     
In [10]: timestamp.utcoffset()                                                                                                                                                                                                                 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-a2bb0af2bddf> in <module>
----> 1 timestamp.utcoffset()

ValueError: offset must be a timedelta strictly between -timedelta(hours=24) and timedelta(hours=24).

The error isn't raised on parsing, but only when retrieving the offset parsed from the string.

This means that the block which attempts a try/except around parsing datetime strings in pandas/_libs/tslib.pyx does not hit the error:

# from line 608
                        try:
                            py_dt = parse_datetime_string(val,
                                                          dayfirst=dayfirst,
                                                          yearfirst=yearfirst)
                        except Exception:
                            if is_coerce:
                                iresult[i] = NPY_NAT
                                continue
                            raise TypeError("invalid string coercion to "
                                            "datetime")

                        # If the dateutil parser returned tzinfo, capture it
                        # to check if all arguments have the same tzinfo
                        tz = py_dt.utcoffset()

Accessing of the utcoffset() method should move inside the try block here, which would clear up this bug. Happy to make this change when I get the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants