PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

lexual · 2013-11-11T07:46:16Z

I had a series containing strings like these:

"November 1, 2013"

Series length was about 500,000

A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!

My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).

Basically it looks like the code is always using date_util parser for case A.

My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).

In a nutshell:

figure out the date format of the first entry.
try to use that against the entire series, using the speedy code in tslib.array_strptime
if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.

Here's some pseudo-code::

datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)

try:
    # use the super speed code that pandas uses when you tell it what the format is.
    dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
    # date strings aren't consistent after all. Let's do it the old slow way.
    dt_series = tslib.array_to_datetime(s, format=None)

return dt_series

The text was updated successfully, but these errors were encountered:

jreback · 2013-11-11T12:43:16Z

you could allow format='infer' to support this, keeping in mind that some dayfirst/yearfirst will then matter (and potentially could ambiguously interpret theformat)

danbirken · 2014-01-10T22:24:58Z

This is a really smart idea, and I think this is a huge speed-up for a common use case. +1

cancan101 · 2014-01-10T23:31:31Z

You do want to deal with the first entry being NaT / None.

lexual · 2014-01-13T04:07:31Z

Perhaps the way to do would it would be a "format" parameter (better name suggestions welcome) that is either set to "heterogenous" or "homogenous" (i.e. different date formats, all date formats the same).

If heterogenous, then current behaviour occurs, if homogenous use speedy algorithm from above where once you infer the date of the first date, you use that for all datetime entries.

ghost · 2014-01-18T20:12:46Z

This a promising idea, worth persuing. So I did, in the context of pd.read_csv
which helpfuly supports a date_parser argument that let's you pass in a
tailored parser in the form of a call to strptime with a hardwired format string.

Disregarding the ~~stealth~~ (#5993) fast-path for is8601 strings, measuring show
a potential a 4x speedup for large data.

dateutil's parser is not amenable to producing a format string, as it consumes
and backtracks through the string, but it does provide a handy lexer function
(_timelex.split(tzstr)) for dt strings which could, by reverse-mapping the fields of the
datetime object returned onto the tokens, be used to reconstruct a format string which
can then be used to speed up the rest of the process, at least for a good-sized
chunk of cases.

PRs welcome.

N=10000
r=pd.date_range("2000-01-01","2013-01-01",freq="H")[:N]
df=pd.DataFrame(range(N),index=r)
df.to_csv("/tmp/1.csv")
def f1(s):
    return dt.datetime.strptime(s,"%Y-%m-%d %H:%M:%S")
def f2(s):
    return dateutil.parser.parse(s)
print "Measuring iso8601"
%timeit f1("2000-01-01 00:01:02")
%timeit f2("2000-01-01 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3
# now try with non-iso-8601

with open("/tmp/1.csv","wb") as f:
    f.write("12/1/2013 00:01:02,1\n"*N)

def f1a(s):
    return dt.datetime.strptime(s,"%d/%m/%Y %H:%M:%S")

print "\nMeasuring an innovative dt format string"
%timeit f1a("12/1/2013 00:01:02")
%timeit f2("12/1/2013 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1a)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3

10000 loops, best of 3: 24.2 µs per loop # strptime is faster then
10000 loops, best of 3: 95.2 µs per loop # dateutil.parse by 4x
100 loops, best of 3: 14.1 ms per loop # but read_csv has a fastpath for iso8601
1 loops, best of 3: 259 ms per loop
1 loops, best of 3: 956 ms per loop

Measuring an innovative dt format string
10000 loops, best of 3: 24 µs per loop # in the general case
10000 loops, best of 3: 96.8 µs per loop # 4x
1 loops, best of 3: 1.1 s per loop # and read_csv falls back to dateutil.parse() (*)
1 loops, best of 3: 256 ms per loop # in which case strptime, yields a nice 3-4x speedup.
1 loops, best of 3: 1.17 s per loop #(*)

jreback · 2014-01-18T21:20:02Z

FYI their is also a stealth fast-path for format of '%Y%m%d', e.g. '20130101', as can be converted directly to an int64 and procssed that way (their is a commit for it but don't remember)

In [4]: s = Series(date_range('20000101',periods=10000)).apply(lambda x: x.strftime('%Y%m%d'))

In [5]: %timeit pd.to_datetime(s)
1 loops, best of 3: 518 ms per loop

In [6]: %timeit pd.to_datetime(s,format='%Y%m%d')
100 loops, best of 3: 10.5 ms per loop

ghost · 2014-01-18T21:27:28Z

Code reference? can't see it in tslib.pyx,inference.pyx nor datetime.pxd, nor grep the string anywhere.
Show me where and I'll update the docs again. Users can't optimize to fastpaths without knowing
what they are.

danbirken · 2014-01-18T21:31:19Z

My guess is that if you can get the datetime string format, you could write a heavily optimized c/cython code path that would provide a healthy speed-up over the simple solution (similar to the iso8601 path). In fact this might already exist in tslib.array_strptime.

ghost · 2014-01-18T21:32:21Z

@danbirken... you know what I'm about to say. :)

ghost · 2014-01-18T21:54:34Z

I'll get the ball rolling. Let's start small.

def infer_df_format(s):
    import dateutil
    from  datetime import datetime as dt

    def maybe_int(v):
        try:
           return int(v)
        except:
           return v 

    dmap={'day':'%d',
          'month':'%m',
          'year':'%Y',
          'hour':'%H',
          'minute':'%M',
          'second':'%S'}

    tokens=map(maybe_int,dateutil.parser._timelex(s))       
    token_dict=dict([(v,i) for i,v in enumerate(tokens)])

    dt=dateutil.parser.parse(s)
    for k,fmt in dmap.items():
        val = getattr(dt,k)
        pos = token_dict.get(val)
        if pos is not None:
            tokens[pos]=fmt

    try:
        fmt = "".join(tokens)
        if  dt.strftime(fmt) == s:
            return fmt    
    except:
        pass

s="2013-12-01T11:58:59"
infer_df_format(s)

Out[15]: '%Y-%m-%dT%H:%M:%S'

your move.

jreback · 2014-01-18T21:55:38Z

#4826

This applies to %Y%m%d for strings or int types

danbirken · 2014-01-18T22:53:16Z

I feel a little worried about depending on a non-public interface for dateutil. At the same time getting the change into dateutil itself (where it theoretically belongs), seems pretty unsatisfactory as well.

Out of left field suggestion: What about a function that just tries 10 or 20 common datetime string formats, and then if one of them works it uses a fast code path to parse them all. If they all fail it just falls back to the slow method.

ghost · 2014-01-18T22:57:02Z

re non public interface, very true but we can always fallback if something raises, or
duplicate the lexer code inline. it's not that much code.

For now, assume that you have a function that reliably returns a format string or None.
The specifics we can sort out later and expand over time.

danbirken · 2014-01-19T01:44:58Z

Well I'm pretty familiar with this section of the code, so I'll take a stab at this on Monday provided nobody else does it between now and then :)

…das-dev#5490 Given an array of strings that represent datetimes, infer_format=True will attempt to guess the format of the datetimes, and if it can infer the format, it will use a faster function to convert/import the datetimes. In cases where this speed-up can be used, the function should be about 10x faster.

This allows read_csv() to attempt to infer the datetime format for any columns where parse_dates is enabled. In cases where the datetime format can be inferred, this should speed up processing datetimes by ~10x. Additionally add documentation and benchmarks for read_csv().

ghost mentioned this issue Jan 18, 2014

DOC: document read_csv fastpath for iso8601 datetime strings #5993

Merged

jreback mentioned this issue Jan 18, 2014

PERF: much faster to_datetime performance with a format of '%Y%m%d' #4826

Merged

ghost mentioned this issue Jan 20, 2014

Unify datetime parsing #6010

Closed

danbirken mentioned this issue Jan 21, 2014

Infer datetime format #6021

Merged

jreback closed this as completed in #6021 Jan 24, 2014

jreback mentioned this issue Feb 14, 2014

Cache repeated mapping and apply #4612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

lexual commented Nov 11, 2013

jreback commented Nov 11, 2013

danbirken commented Jan 10, 2014

cancan101 commented Jan 10, 2014

lexual commented Jan 13, 2014

ghost commented Jan 18, 2014

jreback commented Jan 18, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jreback commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 19, 2014

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

Comments

lexual commented Nov 11, 2013

jreback commented Nov 11, 2013

danbirken commented Jan 10, 2014

cancan101 commented Jan 10, 2014

lexual commented Jan 13, 2014

ghost commented Jan 18, 2014

jreback commented Jan 18, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jreback commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 19, 2014