Skip to content

PERF: consider changing default of infer_datetime_format to True #12061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chris-b1 opened this issue Jan 16, 2016 · 10 comments
Closed

PERF: consider changing default of infer_datetime_format to True #12061

chris-b1 opened this issue Jan 16, 2016 · 10 comments
Labels
Datetime Datetime data dtype Enhancement Performance Memory or execution speed performance

Comments

@chris-b1
Copy link
Contributor

xref #12060

@jreback jreback added Datetime Datetime data dtype API Design IO CSV read_csv, to_csv labels Jan 16, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 16, 2016
@jreback
Copy link
Contributor

jreback commented Jan 16, 2016

if someone would run the benchmarks with this turned on we could prove whether this affects anything.....

@jreback
Copy link
Contributor

jreback commented Feb 9, 2016

@chris-b1 can you do a PR for this?

and see if any of the asv timings change?

@chris-b1
Copy link
Contributor Author

@jreback - here's the general tradeoff - adds some overhead (large relative, small absolute) to parsing iso8601, and of course speeds up non-iso8601 a lot. Does that seem worth it?

In [35]: s = pd.date_range('1900-1-1', periods=1000).strftime('%Y-%m-%d')

In [36]: %timeit pd.to_datetime(s)
1000 loops, best of 3: 316 µs per loop

In [37]: %timeit pd.to_datetime(s, infer_datetime_format=True)
1000 loops, best of 3: 642 µs per loop

In [38]: s_noniso = pd.date_range('1900-1-1', periods=1000).strftime('%m/%d/%Y')

In [42]: %timeit pd.to_datetime(s_noniso)
10 loops, best of 3: 79.9 ms per loop

In [43]: %timeit pd.to_datetime(s_noniso, infer_datetime_format=True)
100 loops, best of 3: 3.82 ms per loop

@jreback
Copy link
Contributor

jreback commented Feb 10, 2016

so is the iso parsing overhead just a couple of function calls? can u update with 10x (or 100) size as well to see how this scales

@chris-b1
Copy link
Contributor Author

below has this same test at different sizes. diff is increase (decrease) in time, ratio is (infer) / (don't infer).

Infer only looks at the first element, so not completely sure why the overhead isn't flat.

n iso_diff iso_ratio noniso_diff noniso_ratio
1 0.000282747 4.23485 0.00024401 2.40649
100 0.000324795 4.44211 -0.00697929 0.0946573
1000 0.000332079 2.02872 -0.0772536 0.0462967
100000 0.00309632 1.11218 -7.62098 0.0440372
1000000 0.0303454 1.10112 -82.5801 0.0421022

@jreback
Copy link
Contributor

jreback commented Feb 10, 2016

would it be dumb to change the default to None, then set it based on the length of the passed array? e.g. say < =100 -> False, > 100 -> True. (of course if its explicity passed then just use that).

This way you get the perf benefit without overhead on smaller samples.

@jreback
Copy link
Contributor

jreback commented Feb 15, 2016

@chris-b1 so what do you think?

@chris-b1
Copy link
Contributor Author

It didn't seem unreasonable - but I got to thinking - there are actually cases with ambigous dates where False and True don't do the same thing (e.g. below). So now I'm not sure what to do.

In [3]: pd.to_datetime(['24/1/2015', '1/2/2015'], infer_datetime_format=True)
Out[3]: DatetimeIndex(['2015-01-24', '2015-02-01'], dtype='datetime64[ns]', freq=None)

In [4]: pd.to_datetime(['24/1/2015', '1/2/2015'])
Out[4]: DatetimeIndex(['2015-01-24', '2015-01-02'], dtype='datetime64[ns]', freq=None)

@jreback
Copy link
Contributor

jreback commented Feb 15, 2016

ahh, so the yearfirst & dayfirst may not be respected. So I think these need to be of 'standard' format (e.g. dayfirst=False and yearfirst=False) or we should rase if infer_datetime_format=True as these are incompatible (or just make it False)

@jreback jreback modified the milestones: 0.18.1, 0.18.0 Feb 27, 2016
@jreback jreback modified the milestones: 0.18.1, 0.18.2 Apr 25, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Aug 15, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017
@mroeschke mroeschke added Performance Memory or execution speed performance and removed IO CSV read_csv, to_csv labels Mar 31, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

as of PDEP4, this parameter is deprecated and a stricter version if the default - closing then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants