PERF: Vectorized string operations are slower than for-loops #35864

Nick-Morgan · 2020-08-23T15:18:57Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

In [1]:

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

def for_loop(series):
    return pd.Series([str(zipcode).zfill(5) for zipcode in series])

def vectorized(series):
    return series.astype(str).str.zfill(5)

%timeit for_loop(non_padded)
%timeit vectorized(non_padded)

Out [1]:

0.25.1
3.32 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.18 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

In most cases, using a for-loop in pandas is much slower than its vectorized equivalent. However, the above operations takes over twice as long when using vectorization. I have replicated this issue on MacOS & Ubuntu.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-08-23T22:09:20Z

This is an old version of pandas, if you upgrade the difference doesn't seem to be as much.

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
   ...:
   ...: %timeit pd.Series([str(zipcode).zfill(5) for zipcode in non_padded])
   ...: %timeit non_padded.astype(str).str.zfill(5)
   ...:
1.1.1
3.44 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.44 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's worth pointing out though that the string accessor method is doing a bit more work than the pure Python version, e.g., trying to handle any missing values that it might encounter.

In [3]: ser = pd.Series([None, "1", "2"])

In [4]: ser.str.zfill(5)
Out[4]:
0     None
1    00001
2    00002
dtype: object

In [5]: pd.Series([x.zfill(5) for x in ser])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-35047bd45c0a> in <module>
----> 1 pd.Series([x.zfill(5) for x in ser])

<ipython-input-5-35047bd45c0a> in <listcomp>(.0)
----> 1 pd.Series([x.zfill(5) for x in ser])

AttributeError: 'NoneType' object has no attribute 'zfill'

asishm · 2020-08-24T00:14:59Z

On 1.1.0 and I still see a big perf difference with the same test code

In [65]: pd.__version__
Out[65]: '1.1.0'

In [66]: import pandas as pd
    ...: import numpy as np

In [67]: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
    ...: 
    ...: def for_loop(series):
    ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])
    ...: 
    ...: def vectorized(series):
    ...:     return series.astype(str).str.zfill(5)
    ...:

In [68]: %timeit for_loop(non_padded)
5.96 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [69]: %timeit vectorized(non_padded)
12.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upgraded to 1.1.1 and I seem to get results similar to @dsaxton - so there might have been some changes

In [1]: import pandas as pd^M
   ...: import numpy as np^M
   ...: print(pd.__version__)^M
   ...: ^M
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))^M
   ...: ^M
   ...: def for_loop(series):^M
   ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])^M
   ...: ^M
   ...: def vectorized(series):^M
   ...:     return series.astype(str).str.zfill(5)^M
   ...: ^M
   ...: %timeit for_loop(non_padded)^M
   ...: %timeit vectorized(non_padded)
1.1.1
6.06 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.4 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

dsaxton · 2020-08-24T02:34:55Z

I think #35519 may have caused the performance boost. It makes sense since the main bottleneck looks like it was the astype (which was patched in that change):

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)  # Do the astype first
   ...:
   ...: %timeit non_padded.str.zfill(5)
   ...: %timeit pd.Series([z.zfill(5) for z in non_padded])
   ...:
1.1.0
2.14 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 84.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

bashtage · 2020-08-24T16:08:15Z

If you remove the astype

def vectorized2(series):
    return series.str.zfill(5)

np2 = non_padded.astype(str)

%timeit vectorized2(np2)
1.68 ms ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit for_loop(non_padded)
2.63 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so there appears to be no issue with string operations.

asishm · 2020-09-12T22:34:03Z

@bashtage I don't think that is an equivalent comparison. You are comparing str.zfill on a string series with a for loop that also does the string conversion. If you convert to str first, the for loop still beats out str.zfill (albeit that difference being small)

I get similar results in master as well.

In [20]: %timeit pd.Series([k.zfill(5) for k in np2])
    ...: %timeit np2.str.zfill(5)
2.35 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

topper-123 · 2020-09-13T17:29:23Z

I can verify OP's results.

However, doing just the type conversion (not zfill), is faster vectorized (in master, it's probabably slower in v.1.1.1):

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

# type conversion and zfill, vectorized slower
%timeit pd.Series([str(zipcode).zfill(5) for zipcode in series])
5.2 ms ± 91.5 µs per loop
%timeit series.astype(str).str.zfill(5)
6.52 ms ± 78.5 µs per loop

# only type conversion, vectorized faster
%timeit pd.Series([str(zipcode) for zipcode in non_padded])
4.11 ms ± 58.4 µs per loop
%timeit non_padded.astype(str)
2.47 ms ± 21.6 µs per loop

# only zfill, vectorized slower
non_padded_str = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)
%timeit pd.Series([zipcode.zfill(5) for zipcode in non_padded_str])
2.71 ms ± 22.4 µs per loop
%timeit non_padded_str.str.zfill(5)
3.18 ms ± 27.4 µs per loop

So must be something in the pandas Series.str.zfill method, not Series.astype that should be optimized.

PRs welcome, of course.

itamarst · 2020-12-01T18:52:12Z

Note that I have seen the same thing with other .str. methods, so I don't think this specific to zfill.

import pandas as pd

STRINGS = ["{} hello world how are you".format(i) for i in range(1_000_000)]
SERIES = pd.Series(STRINGS)

def measure(what, f):
    start = time.time()
    f()
    print(f"{what} elapsed: {time.time() - start}")

def replace_then_upper(s: str) -> str:
    return s.replace("l", "").upper()

measure("Pandas apply()", lambda: SERIES.apply(replace_then_upper))
measure(
    "Pandas .str",
    lambda: SERIES.str.replace("l", "").str.upper(),
)

Results:

Pandas apply() elapsed: 0.39374780654907227
Pandas .str elapsed: 0.7356271743774414

This is with Pandas 1.1.4.

If if I do just replace() it's not quite as bad, but .str. is still slower.

gwerbin · 2021-08-26T21:58:37Z

For what it's worth, I can't replicate this issue with .zfill alone on MacOS with Pandas 1.3.2 and Python 3.9.6: https://gist.github.com/gwerbin/263e92f9c2fca9ff6487ce3e1ac3d7f7

The code is kind of ugly, but the TLDR is that, to obtain behavior that is equivalent to .str, for the zfill operation in the original report, using the .str accessor is faster than using .tolist, .array, .map, or .apply.

% python ./pd_str_slow.py

.tolist, dtype="object" 0.52
.array, dtype="object" 1.04
.apply, dtype="object" 0.72
.map, dtype="object" 0.70
.str accessor, dtype="object" 0.35

.tolist, dtype="string" 1.00
.array, dtype="string" 2.10
.apply, dtype="string" 0.68
.map, dtype="string" 0.85
.str accessor, dtype="string" 0.54

Based on this benchmark, I'd conclude that str.zfill is fast and that .astype(str) is slow. This appears to contradict the findings of #35864 (comment), so it's hard to draw a strong conclusion.

itamarst · 2021-08-28T13:35:47Z

The script I ran just used apply(), so focusing on that: the reason this second example from @gwerbin shows .str as faster than .apply() is because of the contents of the function passed to apply(). In particular, it's doing an extra if statement on each call:

def one_zfill(x):
    return x if is_null(x) else x.zfill(5)

time('.apply, dtype="string"', lambda: data_s.apply(one_zfill))

If however I run without an if statement:

time('.apply, dtype="string"', lambda: data_s.apply(lambda s: s.zfill(5)))

then that is faster than .str.zfill().

That explains the discrepancy. Since it's possible to remove the possibility of nulls in advance, avoiding the if when benchmarking seems a reasonable apples to apples comparison to me, but I could see the argument going the other way.

To summarize the numbers on my computer:

apply() of x if is_null(x) else x.zfill(5) statement: 0.54 sec
.str.zfill(5): 0.21 sec
apply() of x.zfill(5): 0.16 sec

gwerbin · 2021-08-28T14:41:22Z

Indeed, I probably wasn't clear enough in my post. I added the extra logic specifically to try and emulate what the .str accessor does. I actually made a mistake in the .map version, because I used both the if and na_action='ignore'. Removing the former and using only the latter, .map is about as fast as the .str accessor.

I updated the Gist to reflect this.

New timings:

% python ./pd_str_slow.py

.tolist, dtype="object" 0.61
.array, dtype="object" 1.01
.apply, dtype="object" 0.55
.map, dtype="object" 0.20
.str accessor, dtype="object" 0.29

.tolist, dtype="string" 0.94
.array, dtype="string" 1.00
.apply, dtype="string" 0.53
.map, dtype="string" 0.28
.str accessor, dtype="string" 0.27

Nick-Morgan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2020

dsaxton added Performance Memory or execution speed performance Strings String extension data type and string data and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2020

dsaxton changed the title ~~BUG: Vectorized string operations are slower than for-loops~~ PERF: Vectorized string operations are slower than for-loops Aug 23, 2020

bashtage added Dtype Conversions Unexpected or buggy dtype conversions and removed Strings String extension data type and string data labels Aug 24, 2020

lithomas1 added Strings String extension data type and string data and removed Dtype Conversions Unexpected or buggy dtype conversions labels Apr 1, 2021

jbrockmendel mentioned this issue Apr 20, 2023

API/DEPR: dtype=(str|bytes) interpret as pyarrow #52429

Open

asarnow mentioned this issue Apr 25, 2023

Fix(parse cs): Remove '>' char from path fields asarnow/pyem#104

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Vectorized string operations are slower than for-loops #35864

PERF: Vectorized string operations are slower than for-loops #35864

Nick-Morgan commented Aug 23, 2020

INSTALLED VERSIONS

dsaxton commented Aug 23, 2020 •

edited

Loading

asishm commented Aug 24, 2020 •

edited

Loading

dsaxton commented Aug 24, 2020

bashtage commented Aug 24, 2020

asishm commented Sep 12, 2020

topper-123 commented Sep 13, 2020 •

edited

Loading

itamarst commented Dec 1, 2020

gwerbin commented Aug 26, 2021 •

edited

Loading

itamarst commented Aug 28, 2021 •

edited

Loading

gwerbin commented Aug 28, 2021 •

edited

Loading

PERF: Vectorized string operations are slower than for-loops #35864

PERF: Vectorized string operations are slower than for-loops #35864

Comments

Nick-Morgan commented Aug 23, 2020

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

dsaxton commented Aug 23, 2020 • edited Loading

asishm commented Aug 24, 2020 • edited Loading

dsaxton commented Aug 24, 2020

bashtage commented Aug 24, 2020

asishm commented Sep 12, 2020

topper-123 commented Sep 13, 2020 • edited Loading

itamarst commented Dec 1, 2020

gwerbin commented Aug 26, 2021 • edited Loading

itamarst commented Aug 28, 2021 • edited Loading

gwerbin commented Aug 28, 2021 • edited Loading

Output of `pd.show_versions()`

dsaxton commented Aug 23, 2020 •

edited

Loading

asishm commented Aug 24, 2020 •

edited

Loading

topper-123 commented Sep 13, 2020 •

edited

Loading

gwerbin commented Aug 26, 2021 •

edited

Loading

itamarst commented Aug 28, 2021 •

edited

Loading

gwerbin commented Aug 28, 2021 •

edited

Loading