Determine datetime_format from more than one element #15075

j-abi · 2017-01-06T13:28:16Z

Code Sample, a copy-pastable example if possible

import pandas as pd

series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered

converted_1 = pd.to_datetime(series_1, infer_datetime_format=True)
converted_2 = pd.to_datetime(series_2, infer_datetime_format=True)

print sorted(series_1) == sorted(series_2)  # True
print sorted(converted_1) == sorted(converted_2)  # False

Problem description

One would hope to obtain the same result when applying pd.to_datetime to the same series twice, but shuffled.

Expected Output

My understanding is that the format determined when setting infer_datetime_format=True is obtained from the first non-null value of the Series (see function _guess_datetime_format_for_array in tseries.tools), which explains the result above. I understand this logic in terms of optimizing the operation, and it does work as expected most of the time.
However, I feel like the example provided above is fairly generic. Ideally, the function would find the best datetime_format for the entire Series. Any ideas on how to implement this ?

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-57-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.3
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 2.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-01-06T13:33:31Z

xref #3341

you need to be explicit. with ambiguous dayfirst dates.

In [19]: series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
    ...: series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered
    ...:
    ...: converted_1 = pd.to_datetime(series_1, dayfirst=True)
    ...: converted_2 = pd.to_datetime(series_2, dayfirst=True)
    ...:
    ...:

In [20]: converted_1
Out[20]:
0   2000-03-13
1   2000-03-01
2   2000-03-02
dtype: datetime64[ns]

In [21]: converted_2
Out[21]:
0   2000-03-01
1   2000-03-13
2   2000-03-02
dtype: datetime64[ns]

j-abi · 2017-01-06T14:39:08Z

I understand that dayfirst=True solves my issue ... if I already know whether my dates are in European format.
The point of this feature request was not to report a bug but, more generally, whether we would be able to determine this automatically based on the input Series: the dates individually are ambiguous, but the Series as a whole is not.
Perhaps this is not feasible within reasonable computation cost, but as far as I am concerned it would be a big improvement.

jreback · 2017-01-06T14:43:38Z

see this issue: #12585

we have explicity discussed this. its is just waiting for someone to come along and do it. its not very difficult actually, pull -requests are welcome.

jorisvandenbossche · 2017-01-06T14:54:26Z

@jreback It is not completely related to issue #12585.
I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

But here, there are also other issues:

what @abiteboul asks: can the inferring of the datetime be based on more than the first value? (so the format inference would not fail in this case). Should look it up in the original issues/PRs discussing this feature, but probably performance is a good reason to not do this.
The failure to infer the datetime format is silent in this case, and for this reason the inconsistent parsing is maybe even more suprising, because the user thinks a format has been inferred.

jreback · 2017-01-06T15:12:17Z

yeah I suppose the format could be guess better here, but (maybe take up to three elements) and see if they match. BUT you still can contrive an example where this fails, so I would not be in favor of changing this, rather having the parser raise if the day/year first are violated, which is #12585

j-abi · 2017-01-06T15:13:24Z

I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

That is correct, I tried checking with timeit and it does seem that with the second series the inferring fails (on bigger series than in my example, setting infer_datetime_format=True leads to a ~50x speedup for the first series, and no speedup for the second).
Since this option is mainly there for speedup, it probably makes sense not to infer from more than the first value, and maybe some kind of warning would be appropriate when the inferring fails.

If we agree on this point, then I guess my issue is indeed basically the same as #12585. Sorry for mixing up two different problems, and thanks for your comments.

jreback closed this as completed Jan 6, 2017

jreback added Datetime Datetime data dtype Usage Question labels Jan 6, 2017

jreback added this to the No action milestone Jan 6, 2017

jorisvandenbossche mentioned this issue Jan 6, 2017

Inconsistent date parsing of to_datetime #12585

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine datetime_format from more than one element #15075

Determine datetime_format from more than one element #15075

j-abi commented Jan 6, 2017

jreback commented Jan 6, 2017

j-abi commented Jan 6, 2017

jreback commented Jan 6, 2017

jorisvandenbossche commented Jan 6, 2017

jreback commented Jan 6, 2017

j-abi commented Jan 6, 2017

Determine datetime_format from more than one element #15075

Determine datetime_format from more than one element #15075

Comments

j-abi commented Jan 6, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

jreback commented Jan 6, 2017

j-abi commented Jan 6, 2017

jreback commented Jan 6, 2017

jorisvandenbossche commented Jan 6, 2017

jreback commented Jan 6, 2017

j-abi commented Jan 6, 2017

Output of `pd.show_versions()`