-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas conversion of Timedelta is very slow #18092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
so the issue is that https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/timedeltas.py#L383 is not done in a vectorized way, IOW needs to simply construct the returned arrays and then use you can't actually divide by want to have a go at a PR? |
I took a look at this and re-factored the _get_field function to look as follows:
However, I didn't really see any tangible performance improvement. The np.vectorize docs mention that the function is for convenience and not necessarily performance, as it's essentially a for loop. @jreback - is the code above that I provided in line with what you were expecting? If that's the case I'm not sure that is really the root cause from my initial tests |
no this is not vectorization, see how other things are done in datetimes.py |
Thanks Jeff. I refactored to the below and did see significant speed improvements:
Do you have a point of view as to where to the put the freqs dict I have above? I've included it in the method here for visibility, but I was thinking it could be better served as a classmethod to map the properties to their appropriate frequency codes |
@Stanpol great! you shouldn't need the freq dict, simply pass in the correct value in the accessor itself (which I think is what it was doing). and pls add some asv's for this. |
Thanks for the tip on the accessors - that's easy enough. One issue I'm seeing now though is that I might need to be careful with handling dates vs time deltas. I noticed the test_fields method in test_timedelta.py is failing with the below:
The difference of one day I assume is due to me using the get_date_field method in libts while passing in Timedelta objects. Any tips on how to best handle that? I'll take a look at adding some asv's as you suggest |
what does this have to do with dates? |
you shouldn't be using this |
My mistake. Looking further into this I noticed that all of the logic for converting between days, hours, minutes, seconds, etc... is contained within the I'll keep plugging at it but I haven't done much in C / Cython before so it may be slow going on my end to figure out how to make timedelta field access work similar to date objects. If anyone else out there has thoughts on how to tackle then by all means. FWIW here's what I tried to implement in fields.pyx to mimic what exists for dates. I did this solely to check performance so there isn't any error handling. My very un-scientific tests weren't showing any improvement over existing code.
|
note that #18161 was merged so the code you are working is slightly moved around but substantially the same |
Thanks for the heads up. Already revised - hope to have something over in the next few days |
Code Sample, a copy-pastable example if possible
Problem description
For large Series it takes very long time to do a simple conversion, this should be optimised.
Expected Output
.dt.days should be as quick as dividing by np.timedelta64(1, 'D')
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: