PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

jreback · 2014-11-27T02:48:55Z

closes #8989
closes #8903

Clearly the default is exact=True for back-compat
but allows for a match starting at the beginning (has always been like this),
but doesn't require the string to ONLY match the format, IOW, can be extra stuff after the match.

Avoids having to do a regex replace first.

In [21]: s = Series(['19MAY11','19MAY11:00:00:00']*100000)

In [22]: %timeit pd.to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')
1 loops, best of 3: 828 ms per loop

In [23]: %timeit pd.to_datetime(s,format='%d%b%y',exact=False)
1 loops, best of 3: 603 ms per loop

jorisvandenbossche · 2014-11-27T14:19:39Z

is the proposed keyword exact or complete?
the match can be anywhere? (not the start) but I suppose the most common use case is with / without time part, of subsecond part, timezone part (endings)

jreback · 2014-11-27T18:02:33Z

yeh, changing back to exact. Also going to change it to do a search rather than a match if exact=False so can find arbitrarily.

Acutally other use-cases for this

In [4]: s = Series(['19MAY11','foobar19MAY11','19MAY11:00:00:00','19MAY11 00:00:00Z'])

In [5]: s
Out[5]: 
0              19MAY11
1        foobar19MAY11
2     19MAY11:00:00:00
3    19MAY11 00:00:00Z
dtype: object

In [6]: pd.to_datetime(s,format='%d%b%y',exact=False)
Out[6]: 
0   2011-05-19
1   2011-05-19
2   2011-05-19
3   2011-05-19
dtype: datetime64[ns]

jreback · 2014-12-01T00:30:15Z

@jorisvandenbossche ?

jreback · 2014-12-03T22:17:18Z

@jorisvandenbossche ?

jorisvandenbossche · 2014-12-03T23:02:42Z

pandas/tseries/tools.py

@@ -195,6 +195,9 @@ def to_datetime(arg, errors='ignore', dayfirst=False, utc=None, box=True,
        If True returns a DatetimeIndex, if False returns ndarray of values
    format : string, default None
        strftime to parse time, eg "%d/%m/%Y"
+    exact : boolean, True by default
+        if True, require an exact format match


Can you add punctuation and capitals here? (when rendered html, sphinx does not keep the newline, so the visual formatting does not work there)

jorisvandenbossche · 2014-12-03T23:15:33Z

Does this also work for example what now fails (reading in nanoseconds when specifying a format):

In [11]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f")
-> 
ValueError: unconverted data remains: 001

jorisvandenbossche · 2014-12-03T23:19:58Z

doc/source/whatsnew/v0.15.2.txt

@@ -79,6 +79,8 @@ Performance
 ~~~~~~~~~~~
 - Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`)

+- Performance boost for ``to_datetime`` conversions with a passed ``format=`` kw, and the new ``exact=False`` for kw with non-exact matching (:issue:`8904`)


Can you add the new exact keyword as a separate entry in the enhancements section?

jorisvandenbossche · 2014-12-03T23:20:20Z

for the keyword name: match is also a possibility, but not sure it is better

Out of curiousity, where does the speed-up come from? The extra cython type declerations?

jreback · 2014-12-03T23:25:18Z

@jorisvandenbossche the speedup comes from not have to do an .str.extract(.....) first, then regular parse it. Currently the regex parsers uses .match (and nothing trailing). This allows it to use .search.

jreback · 2014-12-03T23:25:49Z

ok with match rather than exact

jreback · 2014-12-03T23:27:33Z

This works; is their an issue about this?

In [3]: pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:",exact=False)
Out[3]: Timestamp('2012-01-01 09:00:00')

jorisvandenbossche · 2014-12-03T23:31:10Z

No, no issue I think, and using this keyword solves the error, but of course not the problem that the nanoseconds are not read in. But I don't know if we can do something about that, as this is defined python formatting strings where f means microseconds

jreback · 2014-12-03T23:33:55Z

You don't actually need to use a formatting string. But we could easily change the 'f' to allow for nanoseconds I think. let me see.

`I``
n [4]: pd.to_datetime("2012-01-01 09:00:00.000000001")
Out[4]: Timestamp('2012-01-01 09:00:00.000000001')

jorisvandenbossche · 2014-12-03T23:36:26Z

vb_suite/timeseries.py

+"""
+timeseries_with_format = Benchmark("to_datetime(s,format='%d%b%y',exact=False)", \
+     setup, start_date=datetime(2014, 11, 26))
+timeseries_with_format_replace = Benchmark("to_datetime(s.str.replace(':\S+$',''),format='%d%b%y')", \


maybe timeseries_with_format_exact is a better name? (the replace does not really matter for what it is about)

jorisvandenbossche · 2014-12-03T23:37:52Z

@jreback yes, I know, the format string is not needed in this example because I used a default format, but in general that does not hold always

jorisvandenbossche · 2014-12-03T23:42:09Z

opened an issue for the nanoseconds #8989

jorisvandenbossche · 2014-12-03T23:44:39Z

And I copied only part of the line. I meant pd.to_datetime("2012-01-01 09:00:00.000000001", format="%Y-%m-%d %H:%M:%S.%f") above

jreback · 2014-12-05T01:59:57Z

@jorisvandenbossche ok on the kw?

…ng for datetimes (GH8904)

jorisvandenbossche · 2014-12-05T14:12:10Z

you mean exact vs match ? No strong opinion, exact seems a bit more self-explanatory, match is a bit more in line with the regex naming.

jreback · 2014-12-05T14:32:49Z

ok, merging then.

PERF: add exact kw to to_datetime to enable faster regex format parsing

jreback added API Design Strings String extension data type and string data Datetime Datetime data dtype Performance Memory or execution speed performance labels Nov 27, 2014

jreback added this to the 0.15.2 milestone Nov 27, 2014

jreback force-pushed the timere branch from f001999 to 1578ba6 Compare November 27, 2014 03:13

jorisvandenbossche mentioned this pull request Nov 27, 2014

ENH/PERF: new kw complete to to_datetime to allow partial matches on a datetime format #8903

Closed

jreback force-pushed the timere branch 4 times, most recently from 8c7f505 to cc8a3de Compare November 30, 2014 17:24

jorisvandenbossche reviewed Dec 3, 2014
View reviewed changes

jorisvandenbossche mentioned this pull request Dec 3, 2014

to_datetime with format string cannot read nanoseconds #8989

Closed

jreback force-pushed the timere branch 5 times, most recently from 2e165e2 to 5d972d0 Compare December 4, 2014 11:19

jreback added 2 commits December 4, 2014 22:06

PERF: add exact kw to to_datetime to enable faster regex format parsi…

ea2489d

…ng for datetimes (GH8904)

BUG: fix GH8989 to parse nanoseconds with %f format

d6e4337

jreback force-pushed the timere branch from 5d972d0 to d6e4337 Compare December 5, 2014 03:07

jreback added a commit that referenced this pull request Dec 5, 2014

Merge pull request #8904 from jreback/timere

6f7f5f8

PERF: add exact kw to to_datetime to enable faster regex format parsing

jreback merged commit 6f7f5f8 into pandas-dev:master Dec 5, 2014

jgehrcke mentioned this pull request Nov 7, 2019

Timestamp.strftime(): missing support for nanoseconds #29461

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

jreback commented Nov 27, 2014

jorisvandenbossche commented Nov 27, 2014

jreback commented Nov 27, 2014

jreback commented Dec 1, 2014

jreback commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 3, 2014

jreback commented Dec 3, 2014

jreback commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 5, 2014

jorisvandenbossche commented Dec 5, 2014

jreback commented Dec 5, 2014

PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

PERF: add exact kw to to_datetime to enable faster regex format parsing #8904

Conversation

jreback commented Nov 27, 2014

jorisvandenbossche commented Nov 27, 2014

jreback commented Nov 27, 2014

jreback commented Dec 1, 2014

jreback commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 3, 2014

jreback commented Dec 3, 2014

jreback commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 3, 2014

jorisvandenbossche Dec 3, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jorisvandenbossche commented Dec 3, 2014

jreback commented Dec 5, 2014

jorisvandenbossche commented Dec 5, 2014

jreback commented Dec 5, 2014