Pandas conversion of Timedelta is very slow #18092

Stanpol · 2017-11-03T12:49:26Z

Code Sample, a copy-pastable example if possible

a=pd.Series([pd.Timedelta(days=x) for x in np.random.randint(0, 10, 60000)])

# Let's convert Timedelta to days with Pandas
%time a.dt.days
# CPU times: user 457 ms, sys: 4.08 ms, total: 461 ms
# Wall time: 464 ms

# Let's convert Timedelta to days by division 
%time (a / np.timedelta64(1, 'D')).astype(np.int64)
# CPU times: user 3.19 ms, sys: 1.79 ms, total: 4.98 ms
# Wall time: 3.18 ms

# Make sure results are the same
((a / np.timedelta64(1, 'D')).astype(np.int64)==a.dt.days).value_counts()
# True    60000
# dtype: int64

Problem description

For large Series it takes very long time to do a simple conversion, this should be optimised.

Expected Output

.dt.days should be as quick as dividing by np.timedelta64(1, 'D')

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.26
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-11-03T12:57:27Z

so the issue is that https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/timedeltas.py#L383 is not done in a vectorized way, IOW needs to simply construct the returned arrays and then use _maybe_mask_results, rather than doing this in a list-comp with individual Timedelta construction.

you can't actually divide by np.timedelta64 (well strictly speaking for days you can but not for any other unit), so rather fix the general problem.

want to have a go at a PR?

WillAyd · 2017-11-04T03:24:47Z

I took a look at this and re-factored the _get_field function to look as follows:

def _get_field(self, m):
    def map_attr(x):
        nonlocal m
        return getattr(Timedelta(x), m)
    
    values = self.asi8
    hasnans = self.hasnans
    if hasnans:
        vfunc = np.vectorize(map_attr, otypes=[np.float])
    else:
        vfunc = np.vectorize(map_attr, otypes=[np.int])
    
    result = vfunc(values)
        
    return Index(result, name=self.name)

However, I didn't really see any tangible performance improvement. The np.vectorize docs mention that the function is for convenience and not necessarily performance, as it's essentially a for loop.

@jreback - is the code above that I provided in line with what you were expecting? If that's the case I'm not sure that is really the root cause from my initial tests

jreback · 2017-11-04T15:12:21Z

no this is not vectorization, see how other things are done in datetimes.py

WillAyd · 2017-11-04T17:04:28Z

Thanks Jeff. I refactored to the below and did see significant speed improvements:

def _get_field(self, m):
    freqs = {
        'days' : 'D',
        'seconds' : 's',
        'microseconds' : 'us',
        'nanoseconds' : 'ns'
        }
    freq = freqs[m]
    
    values = self.asi8
    hasnans = self.hasnans
    result = libts.get_date_field(values, freq)
    if hasnans:
        result = self._maybe_mask_results(result, convert='float64')
    
    return Index(result, name=self.name)

Do you have a point of view as to where to the put the freqs dict I have above? I've included it in the method here for visibility, but I was thinking it could be better served as a classmethod to map the properties to their appropriate frequency codes

jreback · 2017-11-04T18:08:39Z

@Stanpol great!

you shouldn't need the freq dict, simply pass in the correct value in the accessor itself (which I think is what it was doing). and pls add some asv's for this.

WillAyd · 2017-11-04T21:18:14Z

Thanks for the tip on the accessors - that's easy enough. One issue I'm seeing now though is that I might need to be careful with handling dates vs time deltas. I noticed the test_fields method in test_timedelta.py is failing with the below:

E       Index values are different (100.0 %)
E       [left]:  Int64Index([2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='int64')
E       [right]: Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype='int64')

The difference of one day I assume is due to me using the get_date_field method in libts while passing in Timedelta objects. Any tips on how to best handle that?

I'll take a look at adding some asv's as you suggest

jreback · 2017-11-04T21:37:18Z

what does this have to do with dates?

jreback · 2017-11-04T21:38:06Z

you shouldn't be using this get_date_field, you are solely calling routings on timedeltas.

WillAyd · 2017-11-07T02:54:33Z

My mistake. Looking further into this I noticed that all of the logic for converting between days, hours, minutes, seconds, etc... is contained within the _ensure_components method of the _Timedelta class in timedeltas.pyx, so it's difficult to avoid some kind of individual Timedelta construction as the code currently stands.

I'll keep plugging at it but I haven't done much in C / Cython before so it may be slow going on my end to figure out how to make timedelta field access work similar to date objects. If anyone else out there has thoughts on how to tackle then by all means.

FWIW here's what I tried to implement in fields.pyx to mimic what exists for dates. I did this solely to check performance so there isn't any error handling. My very un-scientific tests weren't showing any improvement over existing code.

@cython.wraparound(False)
@cython.boundscheck(False)
def get_timedelta_field(ndarray[int64_t] tdindex, object field):
    cdef:
        Py_ssize_t i, count = 0
        ndarray[int32_t] out

    count = len(tdindex)
    out = np.empty(count, dtype='i4')

    for i in range(count):
        if tdindex[i] == NPY_NAT:
            out[i] = -1
            continue

        out[i] = getattr(_Timedelta(microseconds = tdindex[i] / 1000), field)
    
  return out

jreback · 2017-11-08T11:13:16Z

note that #18161 was merged so the code you are working is slightly moved around but substantially the same

WillAyd · 2017-11-09T12:28:44Z

Thanks for the heads up. Already revised - hope to have something over in the next few days

…-dev#18225)

jreback added Performance Memory or execution speed performance Timedelta Timedelta data type Difficulty Intermediate labels Nov 3, 2017

jreback added this to the Next Major Release milestone Nov 3, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

4e51ee0

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

993ea6b

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

e369c1f

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

9003258

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

1be6911

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

3c5d782

WillAyd mentioned this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (#18092) #18225

Merged

4 tasks

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

20b7477

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

35cb876

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 11, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

fb543ec

jreback modified the milestones: Next Major Release, 0.22.0 Nov 12, 2017

WillAyd added a commit to WillAyd/pandas that referenced this issue Nov 12, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092)

08a13fb

jreback closed this as completed in #18225 Nov 12, 2017

jreback pushed a commit that referenced this issue Nov 12, 2017

PERF: Vectorized Timedelta property access (#18092) (#18225)

cd80f08

No-Stream pushed a commit to No-Stream/pandas that referenced this issue Nov 28, 2017

PERF: Vectorized Timedelta property access (pandas-dev#18092) (pandas…

de4aa61

…-dev#18225)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas conversion of Timedelta is very slow #18092

Pandas conversion of Timedelta is very slow #18092

Stanpol commented Nov 3, 2017

INSTALLED VERSIONS

jreback commented Nov 3, 2017

WillAyd commented Nov 4, 2017

jreback commented Nov 4, 2017

WillAyd commented Nov 4, 2017 •

edited

Loading

jreback commented Nov 4, 2017

WillAyd commented Nov 4, 2017

jreback commented Nov 4, 2017

jreback commented Nov 4, 2017

WillAyd commented Nov 7, 2017

jreback commented Nov 8, 2017

WillAyd commented Nov 9, 2017

Pandas conversion of Timedelta is very slow #18092

Pandas conversion of Timedelta is very slow #18092

Comments

Stanpol commented Nov 3, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Nov 3, 2017

WillAyd commented Nov 4, 2017

jreback commented Nov 4, 2017

WillAyd commented Nov 4, 2017 • edited Loading

jreback commented Nov 4, 2017

WillAyd commented Nov 4, 2017

jreback commented Nov 4, 2017

jreback commented Nov 4, 2017

WillAyd commented Nov 7, 2017

jreback commented Nov 8, 2017

WillAyd commented Nov 9, 2017

Output of `pd.show_versions()`

WillAyd commented Nov 4, 2017 •

edited

Loading