-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: speed up pd.to_datetime and co. by extracting dt format from data and using strptime to parse #5490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you could allow |
This is a really smart idea, and I think this is a huge speed-up for a common use case. +1 |
You do want to deal with the first entry being NaT / None. |
Perhaps the way to do would it would be a "format" parameter (better name suggestions welcome) that is either set to "heterogenous" or "homogenous" (i.e. different date formats, all date formats the same). If heterogenous, then current behaviour occurs, if homogenous use speedy algorithm from above where once you infer the date of the first date, you use that for all datetime entries. |
This a promising idea, worth persuing. So I did, in the context of pd.read_csv Disregarding the dateutil's parser is not amenable to producing a format string, as it consumes PRs welcome.
|
FYI their is also a stealth fast-path for format of '%Y%m%d', e.g. '20130101', as can be converted directly to an int64 and procssed that way (their is a commit for it but don't remember)
|
Code reference? can't see it in tslib.pyx,inference.pyx nor datetime.pxd, nor grep the string anywhere. |
My guess is that if you can get the datetime string format, you could write a heavily optimized c/cython code path that would provide a healthy speed-up over the simple solution (similar to the iso8601 path). In fact this might already exist in |
@danbirken... you know what I'm about to say. :) |
I'll get the ball rolling. Let's start small.
your move. |
This applies to |
I feel a little worried about depending on a non-public interface for dateutil. At the same time getting the change into dateutil itself (where it theoretically belongs), seems pretty unsatisfactory as well. Out of left field suggestion: What about a function that just tries 10 or 20 common datetime string formats, and then if one of them works it uses a fast code path to parse them all. If they all fail it just falls back to the slow method. |
re non public interface, very true but we can always fallback if something raises, or For now, assume that you have a function that reliably returns a format string or None. |
Well I'm pretty familiar with this section of the code, so I'll take a stab at this on Monday provided nobody else does it between now and then :) |
…das-dev#5490 Given an array of strings that represent datetimes, infer_format=True will attempt to guess the format of the datetimes, and if it can infer the format, it will use a faster function to convert/import the datetimes. In cases where this speed-up can be used, the function should be about 10x faster.
This allows read_csv() to attempt to infer the datetime format for any columns where parse_dates is enabled. In cases where the datetime format can be inferred, this should speed up processing datetimes by ~10x. Additionally add documentation and benchmarks for read_csv().
I had a series containing strings like these:
"November 1, 2013"
Series length was about 500,000
A)
running pd.to_datetime(s) takes just over a minute.
B)
running pd.to_datetime(s, format="%B %d, %Y") takes about 7 seconds!
My suggestion is a way to make case A (where user doesn't specify the format type) take about as long as case B (user does specify).
Basically it looks like the code is always using date_util parser for case A.
My suggestion is based upon the idea that it's highly likely that the date strings are all in a consistent format (it's highly unlikely in this case that they would be in 500K separate formats!).
In a nutshell:
Here's some pseudo-code::
return dt_series
The text was updated successfully, but these errors were encountered: