-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: add datetime caching kw in to_datetime #11665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I recently stumbled across the exact same optimization. Converting 5 million strings to pd.to_datetime(df['date']) # takes 10 minutes
df['date'].map({k: pd.to_datetime(k) for k in df['date'].unique()}) # takes 2 seconds It seems like the caching/optimization strategy should live with It even seems as though you might want this optimization on by default. The only real overhead is the call to |
@stevenmanton yes this could be implemented internally in not saying this needs to be a kw to |
Note that the above method is also about 10x faster than using pd.to_datetime(df['date'], infer_datetime_format=True) # takes 25 seconds |
How about a simple recursive wrapper? def to_datetime(s, cache=True, **kwargs):
if cache:
return s.map({k: to_datetime(k, cache=False, **kwargs) for k in s.unique()})
else:
return pd.to_datetime(s, cache=False, **kwargs) |
you could do this inside |
note |
It looks like some of the parser tools call Not sure what you mean by using the def to_datetime(s, **kwargs):
return s.map({k: pd.to_datetime(k, **kwargs) for k in s.unique()}) |
@stevenmanton it could be in either place what I mean is you prob want |
@stevenmanton want to do a PR for this? |
@jreback I just got back from vacation, but I'll take a look and try to PR something. |
great this would be a nice perf enhancement! |
I took a stab at this but after a couple hours of looking through the code I'm not sure the best way to proceed. The small changes I made are failing for a number of reasons. It seems like a lot of complexity is due to the number of inputs and outputs (tuple, list, string, Series, Index, etc.) accepted by the to_datetime function. Any thoughts on a better approach? |
@stevenmanton the way to do this is to add a kw arg to then here |
@charles-cooper suggested this feature 8 months earlier in #9594, but was rejected by @jreback. Just to give credit where it's due. I'm 👍 for this. |
the suggested feature was not parsing the uniques in a vectorized way and broadcasting, rather memoizarion of repeated inefficient calls to be honest I don't care about the credit - I was quite clear on that other issue why it was closed |
Is there still interest in getting this optimization working? When it was originally reported, it looks like it was giving at 5x or better speed up. I'm testing again and seeing pretty consistently a 2x, at best, speed up even for very large series. I am willing to tackle it but wanted to check on the status first, since it appears that it's going to make the
|
oh i think we should always do this basically if you have say at least 1000 elements or less you can not do it otherwise it's always worth it (yes there is a degenerate case with a really long u issue series but detecting that is often not worth it) u can impelement then we an test a couple or of cases and see fyi we do something similar in pandas.core.util.hashing |
Okay; will dive in to it tomorrow. |
I'll propose a
cache_datetime=False
keyword as an addition toread_csv
andpd.to_datetime
this would use a lookup cache (a dict will probably work), to map datetime strings to Timestamp objects. For repeated dates this will lead to some dramatic speedups.
Care must be taken if a
format
kw is provided (into_datetime
as the cache will have to be exposed). This would be an optional (and defaultFalse
) as I think if you have unique dates this could modestly slow down things (but can be revisted if needed).This might need also want to accept a list of column names (like
parse_dates
) to enable per-column caching (e.g. you might want to apply to a column, but not the index of example).possibly we could overload
parse_dates='cache'
to mean this as welltrivial example
The text was updated successfully, but these errors were encountered: