Skip to content

Pandas conversion of Timedelta is very slow #18092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Stanpol opened this issue Nov 3, 2017 · 11 comments · Fixed by #18225
Closed

Pandas conversion of Timedelta is very slow #18092

Stanpol opened this issue Nov 3, 2017 · 11 comments · Fixed by #18225
Labels
Performance Memory or execution speed performance Timedelta Timedelta data type
Milestone

Comments

@Stanpol
Copy link

Stanpol commented Nov 3, 2017

Code Sample, a copy-pastable example if possible

a=pd.Series([pd.Timedelta(days=x) for x in np.random.randint(0, 10, 60000)])

# Let's convert Timedelta to days with Pandas
%time a.dt.days
# CPU times: user 457 ms, sys: 4.08 ms, total: 461 ms
# Wall time: 464 ms

# Let's convert Timedelta to days by division 
%time (a / np.timedelta64(1, 'D')).astype(np.int64)
# CPU times: user 3.19 ms, sys: 1.79 ms, total: 4.98 ms
# Wall time: 3.18 ms

# Make sure results are the same
((a / np.timedelta64(1, 'D')).astype(np.int64)==a.dt.days).value_counts()
# True    60000
# dtype: int64

Problem description

For large Series it takes very long time to do a simple conversion, this should be optimised.

Expected Output

.dt.days should be as quick as dividing by np.timedelta64(1, 'D')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Nov 3, 2017

so the issue is that https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/timedeltas.py#L383 is not done in a vectorized way, IOW needs to simply construct the returned arrays and then use _maybe_mask_results, rather than doing this in a list-comp with individual Timedelta construction.

you can't actually divide by np.timedelta64 (well strictly speaking for days you can but not for any other unit), so rather fix the general problem.

want to have a go at a PR?

@jreback jreback added Performance Memory or execution speed performance Timedelta Timedelta data type Difficulty Intermediate labels Nov 3, 2017
@jreback jreback added this to the Next Major Release milestone Nov 3, 2017
@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2017

I took a look at this and re-factored the _get_field function to look as follows:

def _get_field(self, m):
    def map_attr(x):
        nonlocal m
        return getattr(Timedelta(x), m)
    
    values = self.asi8
    hasnans = self.hasnans
    if hasnans:
        vfunc = np.vectorize(map_attr, otypes=[np.float])
    else:
        vfunc = np.vectorize(map_attr, otypes=[np.int])
    
    result = vfunc(values)
        
    return Index(result, name=self.name)

However, I didn't really see any tangible performance improvement. The np.vectorize docs mention that the function is for convenience and not necessarily performance, as it's essentially a for loop.

@jreback - is the code above that I provided in line with what you were expecting? If that's the case I'm not sure that is really the root cause from my initial tests

@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

no this is not vectorization, see how other things are done in datetimes.py

@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2017

Thanks Jeff. I refactored to the below and did see significant speed improvements:

def _get_field(self, m):
    freqs = {
        'days' : 'D',
        'seconds' : 's',
        'microseconds' : 'us',
        'nanoseconds' : 'ns'
        }
    freq = freqs[m]
    
    values = self.asi8
    hasnans = self.hasnans
    result = libts.get_date_field(values, freq)
    if hasnans:
        result = self._maybe_mask_results(result, convert='float64')
    
    return Index(result, name=self.name)

Do you have a point of view as to where to the put the freqs dict I have above? I've included it in the method here for visibility, but I was thinking it could be better served as a classmethod to map the properties to their appropriate frequency codes

@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

@Stanpol great!

you shouldn't need the freq dict, simply pass in the correct value in the accessor itself (which I think is what it was doing). and pls add some asv's for this.

@WillAyd
Copy link
Member

WillAyd commented Nov 4, 2017

Thanks for the tip on the accessors - that's easy enough. One issue I'm seeing now though is that I might need to be careful with handling dates vs time deltas. I noticed the test_fields method in test_timedelta.py is failing with the below:

E       Index values are different (100.0 %)
E       [left]:  Int64Index([2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
E       [right]: Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

The difference of one day I assume is due to me using the get_date_field method in libts while passing in Timedelta objects. Any tips on how to best handle that?

I'll take a look at adding some asv's as you suggest

@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

what does this have to do with dates?

@jreback
Copy link
Contributor

jreback commented Nov 4, 2017

you shouldn't be using this get_date_field, you are solely calling routings on timedeltas.

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2017

My mistake. Looking further into this I noticed that all of the logic for converting between days, hours, minutes, seconds, etc... is contained within the _ensure_components method of the _Timedelta class in timedeltas.pyx, so it's difficult to avoid some kind of individual Timedelta construction as the code currently stands.

I'll keep plugging at it but I haven't done much in C / Cython before so it may be slow going on my end to figure out how to make timedelta field access work similar to date objects. If anyone else out there has thoughts on how to tackle then by all means.

FWIW here's what I tried to implement in fields.pyx to mimic what exists for dates. I did this solely to check performance so there isn't any error handling. My very un-scientific tests weren't showing any improvement over existing code.

@cython.wraparound(False)
@cython.boundscheck(False)
def get_timedelta_field(ndarray[int64_t] tdindex, object field):
    cdef:
        Py_ssize_t i, count = 0
        ndarray[int32_t] out

    count = len(tdindex)
    out = np.empty(count, dtype='i4')

    for i in range(count):
        if tdindex[i] == NPY_NAT:
            out[i] = -1
            continue

        out[i] = getattr(_Timedelta(microseconds = tdindex[i] / 1000), field)
    
  return out

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

note that #18161 was merged so the code you are working is slightly moved around but substantially the same

@WillAyd
Copy link
Member

WillAyd commented Nov 9, 2017

Thanks for the heads up. Already revised - hope to have something over in the next few days

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Nov 12, 2017
WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants