PERF: cythonize vectorized string routines #16542

chris-b1 · 2017-05-30T22:02:19Z

Not a night and day improvement since all we're doing is removing some python overhead, but there does seem to be 2x+ performance to be picked up. Possibly could use some of the template machinery to make these easy to write.

I wouldn't consider this high priority given long term plans to replace the string dtype, but could be worth it.

import cython
%load_ext cython

s = pd.Series(np.random.choice(['aaaaaaaaaa', 'bbbbbbbb', 'ccccc' ,
                                'dddd'], size=20000).astype('O'))

%%cython
from numpy cimport *
import numpy as np

def fast_upper(ndarray values):
    cdef:
        Py_ssize_t i, n = values.shape[0]
        ndarray output = np.empty_like(values)
        str val
    for i in range(n):
        val = values[0]
        output[i] = val.upper()
    return output

%timeit s.str.upper()
100 loops, best of 3: 4.94 ms per loop

%timeit pd.Series(fast_upper(s.values), index=s.index)
100 loops, best of 3: 2.02 ms per loop

jreback · 2017-05-30T22:46:39Z

you can actually get even better perf by using c-functions and maybe even release the GIL (though this is a bit trickier code).

jreback · 2017-05-30T22:48:19Z

xref to #4694

chris-b1 · 2017-05-30T22:54:09Z

Yeah, looks like the cythonization isn't really what's helping in my example, it's the avoidance of na checks.

In [27]: %timeit pd.Series([x.upper() for x in s], index=s.index)
100 loops, best of 3: 2.74 ms per loop

jorisvandenbossche · 2022-01-21T09:51:59Z

Now users have the option to use the Arrow-backed string dtype if they want better performance, it might not be needed to keep this issue open?

jbrockmendel · 2023-02-11T21:13:29Z

I agree with Joris, closing as "supported via pyarrow"

chris-b1 added Difficulty Intermediate Performance Memory or execution speed performance Strings String extension data type and string data labels May 30, 2017

chris-b1 added this to the Next Major Release milestone May 30, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel closed this as completed Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: cythonize vectorized string routines #16542

PERF: cythonize vectorized string routines #16542

chris-b1 commented May 30, 2017

jreback commented May 30, 2017

jreback commented May 30, 2017

chris-b1 commented May 30, 2017

jorisvandenbossche commented Jan 21, 2022

jbrockmendel commented Feb 11, 2023

PERF: cythonize vectorized string routines #16542

PERF: cythonize vectorized string routines #16542

Comments

chris-b1 commented May 30, 2017

jreback commented May 30, 2017

jreback commented May 30, 2017

chris-b1 commented May 30, 2017

jorisvandenbossche commented Jan 21, 2022

jbrockmendel commented Feb 11, 2023