Skip to content

PERF: cythonize vectorized string routines #16542

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chris-b1 opened this issue May 30, 2017 · 5 comments
Closed

PERF: cythonize vectorized string routines #16542

chris-b1 opened this issue May 30, 2017 · 5 comments
Labels
Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@chris-b1
Copy link
Contributor

Not a night and day improvement since all we're doing is removing some python overhead, but there does seem to be 2x+ performance to be picked up. Possibly could use some of the template machinery to make these easy to write.

I wouldn't consider this high priority given long term plans to replace the string dtype, but could be worth it.

import cython
%load_ext cython

s = pd.Series(np.random.choice(['aaaaaaaaaa', 'bbbbbbbb', 'ccccc' ,
                                'dddd'], size=20000).astype('O'))

%%cython
from numpy cimport *
import numpy as np

def fast_upper(ndarray values):
    cdef:
        Py_ssize_t i, n = values.shape[0]
        ndarray output = np.empty_like(values)
        str val
    for i in range(n):
        val = values[0]
        output[i] = val.upper()
    return output

%timeit s.str.upper()
100 loops, best of 3: 4.94 ms per loop

%timeit pd.Series(fast_upper(s.values), index=s.index)
100 loops, best of 3: 2.02 ms per loop
@chris-b1 chris-b1 added Difficulty Intermediate Performance Memory or execution speed performance Strings String extension data type and string data labels May 30, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone May 30, 2017
@jreback
Copy link
Contributor

jreback commented May 30, 2017

you can actually get even better perf by using c-functions and maybe even release the GIL (though this is a bit trickier code).

@jreback
Copy link
Contributor

jreback commented May 30, 2017

xref to #4694

@chris-b1
Copy link
Contributor Author

Yeah, looks like the cythonization isn't really what's helping in my example, it's the avoidance of na checks.

In [27]: %timeit pd.Series([x.upper() for x in s], index=s.index)
100 loops, best of 3: 2.74 ms per loop

@jorisvandenbossche
Copy link
Member

Now users have the option to use the Arrow-backed string dtype if they want better performance, it might not be needed to keep this issue open?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

I agree with Joris, closing as "supported via pyarrow"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

5 participants