Skip to content

BUG: Unexpected pd.to_datetime result #12583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dolaameng opened this issue Mar 10, 2016 · 7 comments
Closed

BUG: Unexpected pd.to_datetime result #12583

dolaameng opened this issue Mar 10, 2016 · 7 comments
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request

Comments

@dolaameng
Copy link

Code Sample, a copy-pastable example if possible

pd.to_datetime(["31/12/2014", "10/03/2011"])
Out[37]:
DatetimeIndex(['2014-12-31', '2011-10-03'], dtype='datetime64[ns]', freq=None)

Expected Output

DatetimeIndex(['2014-12-31', '2011-03-10'], dtype='datetime64[ns]', freq=None)

_Reason: Expect a default behavior (without extra format or dayfirst parameter) that implements a consistent datetime parsing within the same column._

Comments

Within the internal helper function pd.tseries.tools._to_datetime, the datetime format will be first inferred based on the first non-nan element, followed by a parsing via tslib.array_strptime(Line 357), which gives expected result.

pd.tseries.tools._guess_datetime_format_for_array(["31/12/2014", "10/03/2011"])
Out[25]:
'%d/%m/%Y'


pd.tslib.array_strptime(pd.tseries.tools.com._ensure_object(["31/12/2014", "10/03/2011"]), "%d/%m/%Y")
Out[46]:
array(['2014-12-31T08:00:00.000000000+0800',
       '2011-03-10T08:00:00.000000000+0800'], dtype='datetime64[ns]')

But the fallback function tslib.array_to_datetime doesn't use the inferred format(Line 373)

pd.tslib.array_to_datetime(pd.tseries.tools.com._ensure_object(["31/12/2014", "10/03/2011"]), format="%d/%m/%Y")
Out[51]:
array(['2014-12-31T08:00:00.000000000+0800',
       '2011-10-03T08:00:00.000000000+0800'], dtype='datetime64[ns]')

_This is caused by the default value (False) of parameter infer_datetime_format to the _to_datetime (Line 285)
Suggestion: make parameter infer_datetime_format to the _to_datetime default to True, it won't solve all problems, but at least half of them, depending on the format of the first element.
_

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Linux
OS-release: 3.8.13-44.el6uek.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None
@dolaameng
Copy link
Author

Also suggest to at least warn users of this behavior in the online document.

@crazycodingcat
Copy link

Agree with the warning suggestion.

A common sense assumption is: the date format in the same column should be consistent. But currently, pandas just infers each datetime independently, and SILENTLY. It does not care if the dates parsed use the same format. This could be a pitfall if the user is not careful in checking the parsing results. A check on consistency and warning would be very helpful.

@jorisvandenbossche
Copy link
Member

This is a known issue with to_datetime, which is caused by the (too much) flexibility of dateutil. You can 'solve' it by providing a format argument.
We should probably recommend this more strongly in the docs that you best pass a format argument if you want to be sure to have a consistent parsing.

See also #12501, #7348 and #3341

And we are indeed considering changing the default of infer_datetime_format to True, see #12061 for that.

Currently raising a warning is not possible as far as I know, as pandas just passes the strings to dateutil and pandas does not know how dateutil has parsed it (possibly inconsistent). And the flexibility of dateutil is also a feature of to_datetime to be able to handle messy datetime columns.

@jorisvandenbossche jorisvandenbossche added Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request labels Mar 10, 2016
@dolaameng
Copy link
Author

Thanks for linking to the previous issues. I am thinking about some heriustics used by readr package in R, which infers column types or date formats from first 100 rows, instead of a single one. This can be done before passing an explicit format to dateutil?

@jorisvandenbossche
Copy link
Member

There could maybe be improvements to infer_datetime_format (as indeed, using only one row will often not be able to infer a day-first).
Note that no format is passed to dateutil. When there is a format (passed or inferred), pandas does the parsing itself, it's when there is no fixed format that dateutil is used.

I made an overview issue for this: #12585

BTW, if you want to make some clarifications to the docs for now, PRs also very welcome!

@dolaameng
Copy link
Author

Proper documentation sounds like a good temp solution. Will try to help if possible.

@jreback jreback added this to the 0.18.1 milestone Mar 29, 2016
@jreback jreback modified the milestones: 0.18.2, 0.18.1 Apr 26, 2016
@jorisvandenbossche
Copy link
Member

Closing in favor of the master issue #12585

@jorisvandenbossche jorisvandenbossche modified the milestones: No action, 0.19.0 Aug 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants