-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: pandas.to_datetime() does not respect exact format string with ISO8601 #12649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The problem is this part in
When I change the condition to |
nothing to do with that (well you can disable if you set if I suppose we could be more stricted on but why is this a problem? (as an aside you have an ISO8601 string, so you don't need to do anything to parse it). |
I'm trying out a few date formats and want to determine the correct one. If I try The |
hmm, actually I take that back. it IS removing the format BECAUSE its ISO8601. I am not sure it makes sense to change this as there is a perf penalty for parsing ISO strings with an actual format string (via regex). And we expect (and users expect) that the perf will be the same whether they pass a compat format or not. Further if it actually does match ISO how is this a problem? |
In my case, I can easily fix this in the application by not checking for multiple iso format strings. I am trying to detect date format ambiguities, i.e. more than one date format matching which is a generalisation of discerning the However, I was surprised to get this kind of behaviour from this function as I thought I disabled all auto-magic ( Perhaps a comment in the documentation would suffice to clarify this behaviour if it cannot be turned off through parameters. |
Your intuition seemed to have been the same as mine... But |
@ssche I don't see any ambiguity in an ISO8601 string. That's the point, it cannot be anything but |
the code says it all. If you would like to update the doc-string, by all means. |
The ambiguity detection is something I do in my application. I just provided this bit of information as background info because you asked. |
@ssche its not unexpected, you gave it a |
|
So you don't find it confusing that the behaviour of this function is date format-specific, i.e. given another date format an exception is raised? |
@ssche I am not sure what you mean. Of course its date-format specific. Its just special cased to ISO8601 to match the general format. |
In the iso-format case, the date format doesn't need to be exact, but for all other formats an exception will be raised when the format doesn't match. That's what I find confusing. |
and so? again why is this a problem. by definition they are ISO8601 which matches a variety of non-ambiguous formats (where the time is optional), and separators can be optional (pandas actually allows a lot more separators than the standard). So we tend to parse more easily lots of formats. I still don't understand why this is a problem. It parses correctly as it should, even though you gave an underspecified format. |
That's exactly my problem. I have a very specific use case which requires to raise an exception when the date format doesn't match. Trying I tried to make this more strict through the parameters I quoted a number of times already, but this didn't make the date conversion more strict. I don't know how else I can make this function more strict to only accept date formats that I specify and raise an exception if they don't. |
well you can try to make a change to the code to see if u can make it very strict but I suspect this will be quite tricky (maybe add exact='strict') or something alternately in your code it's easy enough to check for lengths of the input strings |
So the |
no that defeats the purpose of iso format detection you could allow infer_datetime_format='strict' for maybe exact='striict') which when s format is specified could bypass the inference check (and then would fail because it's not an exact match) |
Just adding my two cents because I was running into the same exact problem. My use case is that I have a program that uses read_csv to load a csv file where the index is either just a date, or date and time. If it has a time I want to make it UTC time zone, if not then no time zone. The pythonic way to do it I thought would be try to read_csv with a parser using to_datetime that would try to match one format, and if that raised an exception try the other format. But by not raising the exception if there is not an exact match that approach does not work. I would be in favor of an exact='strict' or similar to truly enforce the exact match. |
this breaks a lot of tests for us as well. Very similar use case as @rsheftel. The strict mode that raises exceptions would work for us too. |
We're having the same problem.
I like the sound of |
We ran into the same problem. I already raised a question at stackoverflow: |
Hello all, Not sure if my issue is the same but posting here for suggestions. Core problem: the pd.to_datetime function does not respect original values even if I set errors = 'ignore' See example below. The column of values is stored in a UTC timestamp in milliseconds. When I try to convert it to a datetime object, pandas automatically fills in the NaN values with NaT even though I specifically told it to ignore errors. Any suggestions on how to fix this? Thanks |
I think that's not a problem, to_datetime changes the dtype of the list to datetime, and NaT is the equivalent of NaN on datetime dtype. When u convert NaN to datetime it will change to NaT and that's not an error. |
The PR #35741 seems to say that in the future the date_parser in read_csv will be deprecated. Does anyone know what the right way to invoke a custom date parser will be in the future? |
✨ This is an old work account. Please reference @brandonchinn178 for all future communication ✨ We're also coming up against this. As a minimal example, this is very confusing behavior for me: >>> pd.to_datetime('2022/02/01', format="%Y-%m-%d")
Timestamp('2022-02-01 00:00:00')
>>> pd.to_datetime('2022/02/01', format="%Y-%m")
Timestamp('2022-02-01 00:00:00') We're trying to require the input to EXACTLY be pandas/pandas/_libs/tslibs/parsing.pyx Line 856 in 3cf3073
EDIT: as a hacky workaround, you can do: # some string of characters unlikely to be in the column
# if they are in the column, the error message could get messed up;
# you might want to explicitly parse the error message with regex
PREFIX = "!&*@&!#"
try:
pd.to_datetime(PREFIX + df["col"], format=PREFIX + "%Y-%m-%d")
except ValueError as e:
# fix error message to remove the prefix
raise ValueError(e.args[0].replace(PREFIX, "")) |
We have also been bitten by this. Not only does it match a longer date format, but it parses integer types too. For instance:
doesn't raise any errors and parses the date to be
raises I agree with @brandon-leapyear, a new flag should be created to enable this behaviour and format should always match exactly. |
I think it should be possible to make the fast iso8601 path respect Right, I think this can be done, there's just two issues to solve here:
pandas/pandas/_libs/tslibs/src/datetime/np_datetime_strings.c Lines 69 to 72 in 009c4c6
isn't strict with respect to
Line 570 in 307c69a
So, we should only let it get there if |
I've opened https://github.com/MarcoGorelli/pandas/pull/5/files as a POC to address this The code is very messy, and I'll clean it up before opening a PR to this repo, but @ssche @aflag @brandon-leapyear @philipCrunchboards @fersarr @rsheftel I just wanted to ask if you could take a look at and check if it addresses the issue you ran into - thanks 🙏 |
Looks good @MarcoGorelli. One thing I would add is a test that tries to convert a list of datetime strings with a mix of strings (some comply with the format, some don't, so I would expect the whole conversion to fail). I would also add a few ambiguous strings with month/day swapped (#12649 (comment)). |
thanks for taking a look!
Not sure what you mean here, sorry - isn't ISO8601 always month before day? If it doesn't hit this ISO8601 fast-path, then it should be addressed by PDEP0004 If you have a specific test case in mind I can check |
From your question and the fact that you renamed the ticket, I think there’s a misunderstanding. This ticket had nothing to do with ISO8601 compliant formats. |
Currently, it only does that if the For example, the following raises (
whereas the following doesn't (
I agree that to a user, this shouldn't make a difference. So the issue is about fixing the ISO8601 fast-path so that it also respects |
Oh right, so this is not a general problem, but one that only arises when ISO8601 formats are used. That explains a lot (in particular this lenghty discussion I had at the time). Thanks for clarifying!
I meant how an array of |
Code Sample, a copy-pastable example if possible
Expected Output
A ValueError exception because string
'2012-01-01 10:00'
(and all others of this series) does not comply with format string'%Y-%m-%d'
and exact conversion is required.output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
pandas: 0.17.1
nose: 1.3.6
pip: 8.0.3
setuptools: 18.0
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.16.0
statsmodels: None
IPython: 3.0.0
sphinx: None
patsy: None
dateutil: 2.5.0
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: 2.2.2
xlrd: 0.9.3
xlwt: 0.8.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: None
html5lib: None
httplib2: 0.9.1
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
The text was updated successfully, but these errors were encountered: