BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649

ssche · 2016-03-17T02:28:38Z

Code Sample, a copy-pastable example if possible

import pandas as pd
s = pd.Series(['2012-01-01 10:00', '2012-02-10 23:20', '2012-03-22 12:40', '2012-05-02 02:00', '2012-06-11 15:20', '2012-07-22 04:40','2012-08-31 18:00', '2012-10-11 07:20', '2012-11-20 20:40', '2012-12-31 10:00'])
print s
0    2012-01-01 10:00
1    2012-02-10 23:20
2    2012-03-22 12:40
3    2012-05-02 02:00
4    2012-06-11 15:20
5    2012-07-22 04:40
6    2012-08-31 18:00
7    2012-10-11 07:20
8    2012-11-20 20:40
9    2012-12-31 10:00
dtype: object
print pd.to_datetime(s, format='%Y-%m-%d', errors='raise', infer_datetime_format=False, exact=True)
0   2012-01-01 10:00:00
1   2012-02-10 23:20:00
2   2012-03-22 12:40:00
3   2012-05-02 02:00:00
4   2012-06-11 15:20:00
5   2012-07-22 04:40:00
6   2012-08-31 18:00:00
7   2012-10-11 07:20:00
8   2012-11-20 20:40:00
9   2012-12-31 10:00:00
dtype: datetime64[ns]

Expected Output

A ValueError exception because string '2012-01-01 10:00' (and all others of this series) does not comply with format string '%Y-%m-%d' and exact conversion is required.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None

pandas: 0.17.1
nose: 1.3.6
pip: 8.0.3
setuptools: 18.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.16.0
statsmodels: None
IPython: 3.0.0
sphinx: None
patsy: None
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: 2.2.2
xlrd: 0.9.3
xlwt: 0.8.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None

The text was updated successfully, but these errors were encountered:

ssche · 2016-03-17T02:44:18Z

The problem is this part in pd.to_datetime():

       if format is not None:
            # There is a special fast-path for iso8601 formatted
            # datetime strings, so in those cases don't use the inferred
            # format because this path makes process slower in this
            # special case
            format_is_iso8601 = _format_is_iso(format)
            print 'format_is_iso8601', format_is_iso8601
            if format_is_iso8601:
                require_iso8601 = not infer_datetime_format
                format = None

When I change the condition to if format is not None and infer_datetime_format, it works as expected. Is this something that could be changed in Pandas. I find it quite confusing that despite explicitly setting infer_datetime_format=False there's still some internal magic happening in this function. Others may feel the same, not sure...

jreback · 2016-03-17T02:47:59Z

nothing to do with that (well you can disable if you set infer_datetime_format=False). Its here: https://docs.python.org/3/library/re.html#search-vs-match

if format is specified and exact=True (the default) its a .match like search, meaning it must match at the beginning (as opposed to anywhere). So this is a valid match.

I suppose we could be more stricted on exact, e.g. maybe make it exact='match|search|False or something.

but why is this a problem? (as an aside you have an ISO8601 string, so you don't need to do anything to parse it).

ssche · 2016-03-17T02:54:03Z

I'm trying out a few date formats and want to determine the correct one. If I try '%Y-%m-%d' and '%Y-%m-%d %H:%M:%S', they both match which isn't quite true.

The format parameter is definitely set to None in the code I posted above, so I think it has something to do with the snippet... I find this counter-intuitive given the parameters I passed into this function.

jreback · 2016-03-17T03:01:50Z

hmm, actually I take that back. it IS removing the format BECAUSE its ISO8601. I am not sure it makes sense to change this as there is a perf penalty for parsing ISO strings with an actual format string (via regex). And we expect (and users expect) that the perf will be the same whether they pass a compat format or not.

Further if it actually does match ISO how is this a problem?

ssche · 2016-03-17T03:14:15Z

In my case, I can easily fix this in the application by not checking for multiple iso format strings. I am trying to detect date format ambiguities, i.e. more than one date format matching which is a generalisation of discerning the %m/%d vs. %d/%m case.

However, I was surprised to get this kind of behaviour from this function as I thought I disabled all auto-magic (infer_datetime_format=False, exact=True) behaviour. I wasn't concerned with possible performance hits.

Perhaps a comment in the documentation would suffice to clarify this behaviour if it cannot be turned off through parameters.

ssche · 2016-03-17T03:16:09Z

nothing to do with that (well you can disable if you set infer_datetime_format=False). Its here: https://docs.python.org/3/library/re.html#search-vs-match

Your intuition seemed to have been the same as mine... But infer_datetime_format=False doesn't work like this for this iso date case.

jreback · 2016-03-17T03:16:17Z

@ssche I don't see any ambiguity in an ISO8601 string. That's the point, it cannot be anything but dayfirst=False, yearfirst=False by-definition.

jreback · 2016-03-17T03:17:55Z

        if format is not None:
            # There is a special fast-path for iso8601 formatted
            # datetime strings, so in those cases don't use the inferred
            # format because this path makes process slower in this
            # special case
            format_is_iso8601 = _format_is_iso(format)
            if format_is_iso8601:
                require_iso8601 = not infer_datetime_format
                format = None

the code says it all. If you would like to update the doc-string, by all means.

ssche · 2016-03-17T03:18:08Z

The ambiguity detection is something I do in my application. I just provided this bit of information as background info because you asked.
The point here is that infer_datetime_format=False, exact=True do some auto-magic behaviour which I find unexpected.

jreback · 2016-03-17T03:19:38Z

@ssche its not unexpected, you gave it a format string which is ISO8601. I don't see any problem here. If you give it a non-ISO8601 string then it would indeed raise.

jreback · 2016-03-17T03:20:42Z

In [16]: pd.to_datetime(s, format='%Y-%m-%d%H', infer_datetime_format=False, exact=True, errors='raise')
ValueError: time data '2012-01-01 10:00' does not match format '%Y-%m-%d%H' (match)

# you don't need any arguments actually, the default will be to raise if it is not a matched format.
In [17]: pd.to_datetime(s, format='%Y-%m-%d%H')
ValueError: time data '2012-01-01 10:00' does not match format '%Y-%m-%d%H' (match)

ssche · 2016-03-17T03:21:48Z

So you don't find it confusing that the behaviour of this function is date format-specific, i.e. given another date format an exception is raised?

jreback · 2016-03-17T03:22:57Z

@ssche I am not sure what you mean. Of course its date-format specific. Its just special cased to ISO8601 to match the general format.

ssche · 2016-03-17T03:24:23Z

In the iso-format case, the date format doesn't need to be exact, but for all other formats an exception will be raised when the format doesn't match. That's what I find confusing.

jreback · 2016-03-17T03:28:00Z

and so? again why is this a problem. by definition they are ISO8601 which matches a variety of non-ambiguous formats (where the time is optional), and separators can be optional (pandas actually allows a lot more separators than the standard).

So we tend to parse more easily lots of formats. I still don't understand why this is a problem. It parses correctly as it should, even though you gave an underspecified format.

ssche · 2016-03-17T03:35:36Z

So we tend to parse more easily lots of formats. I still don't understand why this is a problem. It parses correctly as it should, even though you gave an underspecified format.

That's exactly my problem. I have a very specific use case which requires to raise an exception when the date format doesn't match. Trying '%Y-%m-%d' and '%Y-%m-%d %H:%M:%S' both won't raise an exception for parsing '2012-01-01 10:00', so the application is unable to detect the proper (=exact, =non underspecified, =non overspecified) date format (that's all I care about).

I tried to make this more strict through the parameters I quoted a number of times already, but this didn't make the date conversion more strict. I don't know how else I can make this function more strict to only accept date formats that I specify and raise an exception if they don't.

jreback · 2016-03-17T03:39:20Z

well you can try to make a change to the code to see if u can make it very strict but I suspect this will be quite tricky (maybe add exact='strict') or something

alternately in your code it's easy enough to check for lengths of the input strings

ssche · 2016-03-17T03:50:20Z

So the if format is not None and infer_datetime_format "fix" I proposed wouldn't be acceptable?

jreback · 2016-03-17T03:53:29Z

no that defeats the purpose of iso format detection

you could allow infer_datetime_format='strict' for maybe exact='striict') which when s format is specified could bypass the inference check (and then would fail because it's not an exact match)

rsheftel · 2016-10-13T22:05:24Z

Just adding my two cents because I was running into the same exact problem. My use case is that I have a program that uses read_csv to load a csv file where the index is either just a date, or date and time. If it has a time I want to make it UTC time zone, if not then no time zone. The pythonic way to do it I thought would be try to read_csv with a parser using to_datetime that would try to match one format, and if that raised an exception try the other format. But by not raising the exception if there is not an exact match that approach does not work.

I would be in favor of an exact='strict' or similar to truly enforce the exact match.

fersarr · 2018-04-24T15:36:56Z

this breaks a lot of tests for us as well. Very similar use case as @rsheftel. The strict mode that raises exceptions would work for us too.

philipCrunchboards · 2019-03-13T16:35:48Z

We're having the same problem.
I'd like to be able to add some argument(s) to the following code to make it either raise or return NaT:

> to_datetime(0.0, format="%Y-%m-%d", exact=True)
Timestamp('1970-01-01 00:00:00')

I like the sound of exact="strict".

jo47011 · 2019-11-04T15:47:55Z

We ran into the same problem. I already raised a question at stackoverflow:
Pandas to_datetime not setting NaT on wrong format
Is there a decision yet on how to proceed with this issue?

jrmistry · 2020-08-25T21:30:47Z

Hello all,

Not sure if my issue is the same but posting here for suggestions.

Core problem: the pd.to_datetime function does not respect original values even if I set errors = 'ignore'

See example below. The column of values is stored in a UTC timestamp in milliseconds. When I try to convert it to a datetime object, pandas automatically fills in the NaN values with NaT even though I specifically told it to ignore errors. Any suggestions on how to fix this? Thanks

IngErnestoAlvarez · 2020-12-12T17:21:36Z

I think that's not a problem, to_datetime changes the dtype of the list to datetime, and NaT is the equivalent of NaN on datetime dtype. When u convert NaN to datetime it will change to NaT and that's not an error.

rsheftel · 2021-05-22T18:24:52Z

The PR #35741 seems to say that in the future the date_parser in read_csv will be deprecated. Does anyone know what the right way to invoke a custom date parser will be in the future?

brandon-leapyear · 2022-02-28T17:35:30Z

✨ This is an old work account. Please reference @brandonchinn178 for all future communication ✨

We're also coming up against this. As a minimal example, this is very confusing behavior for me:

>>> pd.to_datetime('2022/02/01', format="%Y-%m-%d")
Timestamp('2022-02-01 00:00:00')
>>> pd.to_datetime('2022/02/01', format="%Y-%m")
Timestamp('2022-02-01 00:00:00')

We're trying to require the input to EXACTLY be %Y-%m-%d; all other formats should error. The logic mentioned earlier in this thread is very odd behavior: if the string passed to format is a prefix of any ISO formats, format is implicitly converted to allow any ISO string. IMO this "any ISO format" logic should be under a completely different flag, like iso_format=True, and format should be an exact match

pandas/pandas/_libs/tslibs/parsing.pyx

Line 856 in 3cf3073

def format_is_iso(f: str) -> bint:

EDIT: as a hacky workaround, you can do:

# some string of characters unlikely to be in the column
# if they are in the column, the error message could get messed up;
# you might want to explicitly parse the error message with regex
PREFIX = "!&*@&!#"

try:
    pd.to_datetime(PREFIX + df["col"], format=PREFIX + "%Y-%m-%d")
except ValueError as e:
    # fix error message to remove the prefix
    raise ValueError(e.args[0].replace(PREFIX, ""))

aflag · 2022-04-12T17:25:25Z

We have also been bitten by this. Not only does it match a longer date format, but it parses integer types too. For instance:

 pd.to_datetime([1], format="%Y-%m-%d %H:%M")

doesn't raise any errors and parses the date to be 1970-01-01 00:00:00.000000001 however, by just changing the separator it works in a very different (and, imho, less surprising) way:

 pd.to_datetime([1], format="%Y|%m|%d %H:%M")

raises ValueError: time data '1' does not match format '%Y|%m|%d %H:%M' (match).

I agree with @brandon-leapyear, a new flag should be created to enable this behaviour and format should always match exactly.

ssche · 2022-09-18T23:49:18Z

xref #12585, #48621

MarcoGorelli · 2022-10-18T10:56:52Z

I think it should be possible to make the fast iso8601 path respect exact, that'd make it unnecessary to add an extra argument - I'm giving this a go (will require #49024 to be in first though)

Right, I think this can be done, there's just two issues to solve here:

that

pandas/pandas/_libs/tslibs/src/datetime/np_datetime_strings.c

Lines 69 to 72 in 009c4c6

    
           int parse_iso_8601_datetime(const char *str, int len, int want_exc, 
        
                                       npy_datetimestruct *out, 
        
                                       NPY_DATETIMEUNIT *out_bestunit, 
        
                                       int *out_local, int *out_tzoffset) {

isn't strict with respect to exact, it just keeps parsing the string. The format could be preserved and passed to it.

that if an element isn't a string, then it won't even reach the iso8601 function, it'll just go to

pandas/pandas/_libs/tslib.pyx

Line 570 in 307c69a

elif is_integer_object(val) or is_float_object(val):

So, we should only let it get there if format is None

MarcoGorelli · 2022-10-19T15:27:53Z

I've opened https://github.com/MarcoGorelli/pandas/pull/5/files as a POC to address this

The code is very messy, and I'll clean it up before opening a PR to this repo, but @ssche @aflag @brandon-leapyear @philipCrunchboards @fersarr @rsheftel I just wanted to ask if you could take a look at

https://github.com/MarcoGorelli/pandas/blob/fc419d526ad1f7cf2a1174dae6f9385734e7283e/pandas/tests/tools/test_to_datetime.py#L1736-L1824

and check if it addresses the issue you ran into - thanks 🙏

ssche · 2022-10-20T23:03:54Z

Looks good @MarcoGorelli. One thing I would add is a test that tries to convert a list of datetime strings with a mix of strings (some comply with the format, some don't, so I would expect the whole conversion to fail). I would also add a few ambiguous strings with month/day swapped (#12649 (comment)).

MarcoGorelli · 2022-10-21T06:32:28Z

thanks for taking a look!

I would also add a few ambiguous strings with month/day swapped

Not sure what you mean here, sorry - isn't ISO8601 always month before day? If it doesn't hit this ISO8601 fast-path, then it should be addressed by PDEP0004

If you have a specific test case in mind I can check

ssche · 2022-10-23T12:08:30Z

From your question and the fact that you renamed the ticket, I think there’s a misunderstanding. This ticket had nothing to do with ISO8601 compliant formats.
The problem was that it would parse dates even though they were not compliant with the format string without raising an exception by using some “auto-magic” behaviour. My use case (and the one of many others here) required exact parsing across a vector of dates. The fact that some date strings resembled/equalled ISO8601 and therefore would be going through some fasttrack code path didn’t matter to me (from a user’s point of view). I simply couldn’t use the current behaviour in my application as it would corerce some dates (what I called “auto-magic) because I was looking for an error message to let the end user decide which format is valid.

MarcoGorelli · 2022-10-23T12:26:10Z

The problem was that it would parse dates even though they were not compliant with the format string without raising an exception by using some “auto-magic” behaviour

Currently, it only does that if the format is ISO8601. So yes, this has everything to with ISO8601

For example, the following raises (format isn't ISO8601)

pd.to_datetime('01-01-2000 00:00:00', format='%d-%m-%Y')

whereas the following doesn't (format is ISO8601):

pd.to_datetime('2000-01-01 00:00:00', format='%Y-%m-%d')

I agree that to a user, this shouldn't make a difference. So the issue is about fixing the ISO8601 fast-path so that it also respects exact

ssche · 2022-10-23T13:06:52Z

Currently, it only does that if the format is ISO8601. So yes, this has everything to with ISO8601

Oh right, so this is not a general problem, but one that only arises when ISO8601 formats are used. That explains a lot (in particular this lenghty discussion I had at the time).

Thanks for clarifying!

Not sure what you mean here, sorry - isn't ISO8601 always month before day?

I meant how an array of [‘2022-01-13’, ‘2022-13-01’] would be interpreted with format='%Y-%m-%d', but I guess this would raise an error now and not try to coerce the 2nd date also to Jan 13 (not sure if it ever did).

jreback added Datetime Datetime data dtype API Design labels Mar 17, 2016

jreback added the Docs label Mar 17, 2016

jreback added this to the Next Major Release milestone Mar 17, 2016

mroeschke removed the API Design label Apr 23, 2021

mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed Docs labels Apr 23, 2021

MarcoGorelli mentioned this issue Sep 22, 2022

BUG: to_datetime with specific format are incorrectly matching #48707

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

MarcoGorelli changed the title ~~pandas.to_datetime() does not respect exact format string~~ BUG: pandas.to_datetime() does not respect exact format string with ISO8601 Oct 19, 2022

MarcoGorelli mentioned this issue Oct 19, 2022

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 MarcoGorelli/pandas#5

Closed

5 tasks

MarcoGorelli mentioned this issue Oct 19, 2022

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49185

Closed

5 tasks

This was referenced Oct 21, 2022

DOC: remove statement that %S parses all the way up to nanosecond even if no decimal place present #49231

Closed

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49232

Closed

MarcoGorelli mentioned this issue Oct 28, 2022

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #49333

Merged

5 tasks

MarcoGorelli mentioned this issue Nov 6, 2022

BUG: pd.to_datetime does not treat YYYYMMDD and YYYY/MM/DD in the same way as strptime #48440

Closed

3 tasks

MarcoGorelli closed this as completed in #49333 Nov 17, 2022

MarcoGorelli mentioned this issue Jan 18, 2023

Datetime parsing (PDEP-4): allow mixture of ISO formatted strings? #50411

Closed

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649

BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649

Comments

ssche commented Mar 17, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

ssche commented Mar 17, 2016

jreback commented Mar 17, 2016

rsheftel commented Oct 13, 2016

fersarr commented Apr 24, 2018 • edited Loading

philipCrunchboards commented Mar 13, 2019

jo47011 commented Nov 4, 2019

jrmistry commented Aug 25, 2020 • edited Loading

IngErnestoAlvarez commented Dec 12, 2020

rsheftel commented May 22, 2021

brandon-leapyear commented Feb 28, 2022 • edited Loading

aflag commented Apr 12, 2022 • edited Loading

ssche commented Sep 18, 2022 • edited Loading

MarcoGorelli commented Oct 18, 2022 • edited Loading

MarcoGorelli commented Oct 19, 2022

ssche commented Oct 20, 2022

MarcoGorelli commented Oct 21, 2022

ssche commented Oct 23, 2022

MarcoGorelli commented Oct 23, 2022

ssche commented Oct 23, 2022

output of `pd.show_versions()`

fersarr commented Apr 24, 2018 •

edited

Loading

jrmistry commented Aug 25, 2020 •

edited

Loading

brandon-leapyear commented Feb 28, 2022 •

edited

Loading

aflag commented Apr 12, 2022 •

edited

Loading

ssche commented Sep 18, 2022 •

edited

Loading

MarcoGorelli commented Oct 18, 2022 •

edited

Loading