Skip to content

PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lexual opened this issue Nov 11, 2013 · 14 comments · Fixed by #6021
Labels
API Design Performance Memory or execution speed performance
Milestone

Comments

@lexual
Copy link
Contributor

lexual commented Nov 11, 2013

I had a series containing strings like these:

"November 1, 2013"

Series length was about 500,000

A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!

My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).

Basically it looks like the code is always using date_util parser for case A.

My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).

In a nutshell:

  • figure out the date format of the first entry.
  • try to use that against the entire series, using the speedy code in tslib.array_strptime
  • if that works, we've saved heaps of time, if not fall back to the current slower behaviour of using dateutil parse each time.

Here's some pseudo-code::

datestr1 = s[0]
# I'm assuming dateutil has something like this, that can tell you what the format is for a given date string.
date_format = figure_out_datetime_format(datestr1)

try:
    # use the super speed code that pandas uses when you tell it what the format is.
    dt_series = tslib.array_strptime(s, format=datestr1, *, ...)
except:
    # date strings aren't consistent after all. Let's do it the old slow way.
    dt_series = tslib.array_to_datetime(s, format=None)

return dt_series

@jreback
Copy link
Contributor

jreback commented Nov 11, 2013

you could allow format='infer' to support this, keeping in mind that some dayfirst/yearfirst will then matter (and potentially could ambiguously interpret theformat)

@danbirken
Copy link
Contributor

This is a really smart idea, and I think this is a huge speed-up for a common use case. +1

@cancan101
Copy link
Contributor

You do want to deal with the first entry being NaT / None.

@lexual
Copy link
Contributor Author

lexual commented Jan 13, 2014

Perhaps the way to do would it would be a "format" parameter (better name suggestions welcome) that is either set to "heterogenous" or "homogenous" (i.e. different date formats, all date formats the same).

If heterogenous, then current behaviour occurs, if homogenous use speedy algorithm from above where once you infer the date of the first date, you use that for all datetime entries.

@ghost
Copy link

ghost commented Jan 18, 2014

This a promising idea, worth persuing. So I did, in the context of pd.read_csv
which helpfuly supports a date_parser argument that let's you pass in a
tailored parser in the form of a call to strptime with a hardwired format string.

Disregarding the stealth (#5993) fast-path for is8601 strings, measuring show
a potential a 4x speedup for large data.

dateutil's parser is not amenable to producing a format string, as it consumes
and backtracks through the string, but it does provide a handy lexer function
(_timelex.split(tzstr)) for dt strings which could, by reverse-mapping the fields of the
datetime object returned onto the tokens, be used to reconstruct a format string which
can then be used to speed up the rest of the process, at least for a good-sized
chunk of cases.

PRs welcome.

N=10000
r=pd.date_range("2000-01-01","2013-01-01",freq="H")[:N]
df=pd.DataFrame(range(N),index=r)
df.to_csv("/tmp/1.csv")
def f1(s):
    return dt.datetime.strptime(s,"%Y-%m-%d %H:%M:%S")
def f2(s):
    return dateutil.parser.parse(s)
print "Measuring iso8601"
%timeit f1("2000-01-01 00:01:02")
%timeit f2("2000-01-01 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3
# now try with non-iso-8601

with open("/tmp/1.csv","wb") as f:
    f.write("12/1/2013 00:01:02,1\n"*N)

def f1a(s):
    return dt.datetime.strptime(s,"%d/%m/%Y %H:%M:%S")

print "\nMeasuring an innovative dt format string"
%timeit f1a("12/1/2013 00:01:02")
%timeit f2("12/1/2013 00:01:02")
%timeit x1=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0)
%timeit x2=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f1a)
%timeit x3=pd.read_csv("/tmp/1.csv",parse_dates=True,index_col=0,date_parser=f2)
#assert x1==x2==x3

10000 loops, best of 3: 24.2 µs per loop # strptime is faster then
10000 loops, best of 3: 95.2 µs per loop # dateutil.parse by 4x
100 loops, best of 3: 14.1 ms per loop # but read_csv has a fastpath for iso8601
1 loops, best of 3: 259 ms per loop
1 loops, best of 3: 956 ms per loop

Measuring an innovative dt format string
10000 loops, best of 3: 24 µs per loop # in the general case
10000 loops, best of 3: 96.8 µs per loop # 4x
1 loops, best of 3: 1.1 s per loop # and read_csv falls back to dateutil.parse() (*)
1 loops, best of 3: 256 ms per loop # in which case strptime, yields a nice 3-4x speedup.
1 loops, best of 3: 1.17 s per loop #(*)

@jreback
Copy link
Contributor

jreback commented Jan 18, 2014

FYI their is also a stealth fast-path for format of '%Y%m%d', e.g. '20130101', as can be converted directly to an int64 and procssed that way (their is a commit for it but don't remember)

In [4]: s = Series(date_range('20000101',periods=10000)).apply(lambda x: x.strftime('%Y%m%d'))

In [5]: %timeit pd.to_datetime(s)
1 loops, best of 3: 518 ms per loop

In [6]: %timeit pd.to_datetime(s,format='%Y%m%d')
100 loops, best of 3: 10.5 ms per loop

@ghost
Copy link

ghost commented Jan 18, 2014

Code reference? can't see it in tslib.pyx,inference.pyx nor datetime.pxd, nor grep the string anywhere.
Show me where and I'll update the docs again. Users can't optimize to fastpaths without knowing
what they are.

@danbirken
Copy link
Contributor

My guess is that if you can get the datetime string format, you could write a heavily optimized c/cython code path that would provide a healthy speed-up over the simple solution (similar to the iso8601 path). In fact this might already exist in tslib.array_strptime.

@ghost
Copy link

ghost commented Jan 18, 2014

@danbirken... you know what I'm about to say. :)

@ghost
Copy link

ghost commented Jan 18, 2014

I'll get the ball rolling. Let's start small.

def infer_df_format(s):
    import dateutil
    from  datetime import datetime as dt

    def maybe_int(v):
        try:
           return int(v)
        except:
           return v 

    dmap={'day':'%d',
          'month':'%m',
          'year':'%Y',
          'hour':'%H',
          'minute':'%M',
          'second':'%S'}

    tokens=map(maybe_int,dateutil.parser._timelex(s))       
    token_dict=dict([(v,i) for i,v in enumerate(tokens)])

    dt=dateutil.parser.parse(s)
    for k,fmt in dmap.items():
        val = getattr(dt,k)
        pos = token_dict.get(val)
        if pos is not None:
            tokens[pos]=fmt

    try:
        fmt = "".join(tokens)
        if  dt.strftime(fmt) == s:
            return fmt    
    except:
        pass

s="2013-12-01T11:58:59"
infer_df_format(s)

Out[15]: '%Y-%m-%dT%H:%M:%S'

your move.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2014

#4826

This applies to %Y%m%d for strings or int types

@danbirken
Copy link
Contributor

I feel a little worried about depending on a non-public interface for dateutil. At the same time getting the change into dateutil itself (where it theoretically belongs), seems pretty unsatisfactory as well.

Out of left field suggestion: What about a function that just tries 10 or 20 common datetime string formats, and then if one of them works it uses a fast code path to parse them all. If they all fail it just falls back to the slow method.

@ghost
Copy link

ghost commented Jan 18, 2014

re non public interface, very true but we can always fallback if something raises, or
duplicate the lexer code inline. it's not that much code.

For now, assume that you have a function that reliably returns a format string or None.
The specifics we can sort out later and expand over time.

@danbirken
Copy link
Contributor

Well I'm pretty familiar with this section of the code, so I'll take a stab at this on Monday provided nobody else does it between now and then :)

@ghost ghost mentioned this issue Jan 20, 2014
danbirken added a commit to danbirken/pandas that referenced this issue Jan 24, 2014
…das-dev#5490

Given an array of strings that represent datetimes, infer_format=True
will attempt to guess the format of the datetimes, and if it can infer
the format, it will use a faster function to convert/import the
datetimes.  In cases where this speed-up can be used, the function
should be about 10x faster.
danbirken added a commit to danbirken/pandas that referenced this issue Jan 24, 2014
This allows read_csv() to attempt to infer the datetime format for any
columns where parse_dates is enabled.  In cases where the datetime
format can be inferred, this should speed up processing datetimes
by ~10x.

Additionally add documentation and benchmarks for read_csv().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants