Skip to content

(WIP) PERF: improve .str perf for all-string values (about 2x-) #10135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented May 14, 2015

Related to #10081. Make a short path using numpy's string funcs when all the target values are strings. Otherwise, use current path.

Followings are current comparison results:

import pandas as pd
import numpy as np
import string
import random

np.random.seed(1)
s = [''.join([random.choice(string.ascii_letters + string.digits) for i in range(3)]) for i in range(1000000)]

# s_str uses short path
s_str = pd.Series(s)

# set object 
s[-1] = 1

# s_obj uses current path
s_obj = pd.Series(s)
%timeit s_str.str.lower()
#1 loops, best of 3: 696 ms per loop
%timeit s_obj.str.lower()
#1 loops, best of 3: 1.46 s per loop

%timeit s_str.str.split('a')
#1 loops, best of 3: 1.55 s per loop
%timeit s_obj.str.split('a')
#1 loops, best of 3: 3.52 s per loop

The logic has an overhead to check whether target values are all-string using lib.is_string_array. But this should be speed-up in most cases because it takes relatively shorter time than string ops, and (I believe) values should be all-string in most cases.

%timeit pd.lib.is_string_array(s_str.values)
#10 loops, best of 3: 21.9 ms per loop

If it looks OK, I'll work on all the funcs which is supported by numpy.

@sinhrks sinhrks added Performance Memory or execution speed performance Strings String extension data type and string data labels May 14, 2015
@sinhrks sinhrks added this to the 0.17.0 milestone May 14, 2015
@sinhrks sinhrks force-pushed the str_perf branch 2 times, most recently from 8b39ce5 to 80f436e Compare May 14, 2015 15:14
@sinhrks
Copy link
Member Author

sinhrks commented May 14, 2015

Ah noticed above comparison is not fair, preparing valid ones...

@sinhrks
Copy link
Member Author

sinhrks commented May 14, 2015

I misunderstood the differnce of str and object is caused by numpy logic. As numpy looks to use similar logic as pandas thus cannot expect such a performance gain... Allow me to close this.

@sinhrks sinhrks closed this May 14, 2015
@jorisvandenbossche
Copy link
Member

Where did the initial speed-up come from then?

@sinhrks
Copy link
Member Author

sinhrks commented May 17, 2015

The above difference exists on current master depending on dtypes handled. I've misunderstand it is caused be the logic i've changed, but not.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.16.2, No action Jun 2, 2015
@sinhrks sinhrks deleted the str_perf branch November 13, 2015 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants