-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: consider changing default of infer_datetime_format to True #12061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
if someone would run the benchmarks with this turned on we could prove whether this affects anything..... |
@chris-b1 can you do a PR for this? and see if any of the asv timings change? |
@jreback - here's the general tradeoff - adds some overhead (large relative, small absolute) to parsing iso8601, and of course speeds up non-iso8601 a lot. Does that seem worth it? In [35]: s = pd.date_range('1900-1-1', periods=1000).strftime('%Y-%m-%d')
In [36]: %timeit pd.to_datetime(s)
1000 loops, best of 3: 316 µs per loop
In [37]: %timeit pd.to_datetime(s, infer_datetime_format=True)
1000 loops, best of 3: 642 µs per loop
In [38]: s_noniso = pd.date_range('1900-1-1', periods=1000).strftime('%m/%d/%Y')
In [42]: %timeit pd.to_datetime(s_noniso)
10 loops, best of 3: 79.9 ms per loop
In [43]: %timeit pd.to_datetime(s_noniso, infer_datetime_format=True)
100 loops, best of 3: 3.82 ms per loop |
so is the iso parsing overhead just a couple of function calls? can u update with 10x (or 100) size as well to see how this scales |
below has this same test at different sizes. Infer only looks at the first element, so not completely sure why the overhead isn't flat.
|
would it be dumb to change the default to This way you get the perf benefit without overhead on smaller samples. |
@chris-b1 so what do you think? |
It didn't seem unreasonable - but I got to thinking - there are actually cases with ambigous dates where In [3]: pd.to_datetime(['24/1/2015', '1/2/2015'], infer_datetime_format=True)
Out[3]: DatetimeIndex(['2015-01-24', '2015-02-01'], dtype='datetime64[ns]', freq=None)
In [4]: pd.to_datetime(['24/1/2015', '1/2/2015'])
Out[4]: DatetimeIndex(['2015-01-24', '2015-01-02'], dtype='datetime64[ns]', freq=None) |
ahh, so the |
as of PDEP4, this parameter is deprecated and a stricter version if the default - closing then |
xref #12060
The text was updated successfully, but these errors were encountered: