Skip to content

PERF: Vectorized string operations are slower than for-loops #35864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
Nick-Morgan opened this issue Aug 23, 2020 · 10 comments
Open
3 tasks done

PERF: Vectorized string operations are slower than for-loops #35864

Nick-Morgan opened this issue Aug 23, 2020 · 10 comments
Labels
Performance Memory or execution speed performance Strings String extension data type and string data

Comments

@Nick-Morgan
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

In [1]:

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

def for_loop(series):
    return pd.Series([str(zipcode).zfill(5) for zipcode in series])

def vectorized(series):
    return series.astype(str).str.zfill(5)

%timeit for_loop(non_padded)
%timeit vectorized(non_padded)

Out [1]:

0.25.1
3.32 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.18 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

In most cases, using a for-loop in pandas is much slower than its vectorized equivalent. However, the above operations takes over twice as long when using vectorization. I have replicated this issue on MacOS & Ubuntu.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

@Nick-Morgan Nick-Morgan added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2020
@dsaxton dsaxton added Performance Memory or execution speed performance Strings String extension data type and string data and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 23, 2020
@dsaxton
Copy link
Member

dsaxton commented Aug 23, 2020

This is an old version of pandas, if you upgrade the difference doesn't seem to be as much.

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
   ...:
   ...: %timeit pd.Series([str(zipcode).zfill(5) for zipcode in non_padded])
   ...: %timeit non_padded.astype(str).str.zfill(5)
   ...:
1.1.1
3.44 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.44 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

It's worth pointing out though that the string accessor method is doing a bit more work than the pure Python version, e.g., trying to handle any missing values that it might encounter.

In [3]: ser = pd.Series([None, "1", "2"])

In [4]: ser.str.zfill(5)
Out[4]:
0     None
1    00001
2    00002
dtype: object

In [5]: pd.Series([x.zfill(5) for x in ser])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-35047bd45c0a> in <module>
----> 1 pd.Series([x.zfill(5) for x in ser])

<ipython-input-5-35047bd45c0a> in <listcomp>(.0)
----> 1 pd.Series([x.zfill(5) for x in ser])

AttributeError: 'NoneType' object has no attribute 'zfill'

@dsaxton dsaxton changed the title BUG: Vectorized string operations are slower than for-loops PERF: Vectorized string operations are slower than for-loops Aug 23, 2020
@asishm
Copy link
Contributor

asishm commented Aug 24, 2020

On 1.1.0 and I still see a big perf difference with the same test code

In [65]: pd.__version__
Out[65]: '1.1.0'

In [66]: import pandas as pd
    ...: import numpy as np

In [67]: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
    ...: 
    ...: def for_loop(series):
    ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])
    ...: 
    ...: def vectorized(series):
    ...:     return series.astype(str).str.zfill(5)
    ...:

In [68]: %timeit for_loop(non_padded)
5.96 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [69]: %timeit vectorized(non_padded)
12.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upgraded to 1.1.1 and I seem to get results similar to @dsaxton - so there might have been some changes

In [1]: import pandas as pd^M
   ...: import numpy as np^M
   ...: print(pd.__version__)^M
   ...: ^M
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))^M
   ...: ^M
   ...: def for_loop(series):^M
   ...:     return pd.Series([str(zipcode).zfill(5) for zipcode in series])^M
   ...: ^M
   ...: def vectorized(series):^M
   ...:     return series.astype(str).str.zfill(5)^M
   ...: ^M
   ...: %timeit for_loop(non_padded)^M
   ...: %timeit vectorized(non_padded)
1.1.1
6.06 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.4 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@dsaxton
Copy link
Member

dsaxton commented Aug 24, 2020

I think #35519 may have caused the performance boost. It makes sense since the main bottleneck looks like it was the astype (which was patched in that change):

In [1]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: print(pd.__version__)
   ...:
   ...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)  # Do the astype first
   ...:
   ...: %timeit non_padded.str.zfill(5)
   ...: %timeit pd.Series([z.zfill(5) for z in non_padded])
   ...:
1.1.0
2.14 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 84.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@bashtage
Copy link
Contributor

If you remove the astype

def vectorized2(series):
    return series.str.zfill(5)

np2 = non_padded.astype(str)

%timeit vectorized2(np2)
1.68 ms ± 8.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit for_loop(non_padded)
2.63 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so there appears to be no issue with string operations.

@bashtage bashtage added Dtype Conversions Unexpected or buggy dtype conversions and removed Strings String extension data type and string data labels Aug 24, 2020
@asishm
Copy link
Contributor

asishm commented Sep 12, 2020

@bashtage I don't think that is an equivalent comparison. You are comparing str.zfill on a string series with a for loop that also does the string conversion. If you convert to str first, the for loop still beats out str.zfill (albeit that difference being small)

I get similar results in master as well.

In [20]: %timeit pd.Series([k.zfill(5) for k in np2])
    ...: %timeit np2.str.zfill(5)
2.35 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@topper-123
Copy link
Contributor

topper-123 commented Sep 13, 2020

I can verify OP's results.

However, doing just the type conversion (not zfill), is faster vectorized (in master, it's probabably slower in v.1.1.1):

import pandas as pd
import numpy as np
print(pd.__version__)

non_padded = pd.Series(np.random.randint(100, 99999, size=10000))

# type conversion and zfill, vectorized slower
%timeit pd.Series([str(zipcode).zfill(5) for zipcode in series])
5.2 ms ± 91.5 µs per loop
%timeit series.astype(str).str.zfill(5)
6.52 ms ± 78.5 µs per loop

# only type conversion, vectorized faster
%timeit pd.Series([str(zipcode) for zipcode in non_padded])
4.11 ms ± 58.4 µs per loop
%timeit non_padded.astype(str)
2.47 ms ± 21.6 µs per loop

# only zfill, vectorized slower
non_padded_str = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)
%timeit pd.Series([zipcode.zfill(5) for zipcode in non_padded_str])
2.71 ms ± 22.4 µs per loop
%timeit non_padded_str.str.zfill(5)
3.18 ms ± 27.4 µs per loop

So must be something in the pandas Series.str.zfill method, not Series.astype that should be optimized.

PRs welcome, of course.

@itamarst
Copy link

itamarst commented Dec 1, 2020

Note that I have seen the same thing with other .str. methods, so I don't think this specific to zfill.

import pandas as pd

STRINGS = ["{} hello world how are you".format(i) for i in range(1_000_000)]
SERIES = pd.Series(STRINGS)

def measure(what, f):
    start = time.time()
    f()
    print(f"{what} elapsed: {time.time() - start}")

def replace_then_upper(s: str) -> str:
    return s.replace("l", "").upper()

measure("Pandas apply()", lambda: SERIES.apply(replace_then_upper))
measure(
    "Pandas .str",
    lambda: SERIES.str.replace("l", "").str.upper(),
)

Results:

Pandas apply() elapsed: 0.39374780654907227
Pandas .str elapsed: 0.7356271743774414

This is with Pandas 1.1.4.

If if I do just replace() it's not quite as bad, but .str. is still slower.

@lithomas1 lithomas1 added Strings String extension data type and string data and removed Dtype Conversions Unexpected or buggy dtype conversions labels Apr 1, 2021
@gwerbin
Copy link

gwerbin commented Aug 26, 2021

For what it's worth, I can't replicate this issue with .zfill alone on MacOS with Pandas 1.3.2 and Python 3.9.6: https://gist.github.com/gwerbin/263e92f9c2fca9ff6487ce3e1ac3d7f7

The code is kind of ugly, but the TLDR is that, to obtain behavior that is equivalent to .str, for the zfill operation in the original report, using the .str accessor is faster than using .tolist, .array, .map, or .apply.

% python ./pd_str_slow.py

.tolist, dtype="object" 0.52
.array, dtype="object" 1.04
.apply, dtype="object" 0.72
.map, dtype="object" 0.70
.str accessor, dtype="object" 0.35

.tolist, dtype="string" 1.00
.array, dtype="string" 2.10
.apply, dtype="string" 0.68
.map, dtype="string" 0.85
.str accessor, dtype="string" 0.54

Based on this benchmark, I'd conclude that str.zfill is fast and that .astype(str) is slow. This appears to contradict the findings of #35864 (comment), so it's hard to draw a strong conclusion.

@itamarst
Copy link

itamarst commented Aug 28, 2021

The script I ran just used apply(), so focusing on that: the reason this second example from @gwerbin shows .str as faster than .apply() is because of the contents of the function passed to apply(). In particular, it's doing an extra if statement on each call:

def one_zfill(x):
    return x if is_null(x) else x.zfill(5)

time('.apply, dtype="string"', lambda: data_s.apply(one_zfill))

If however I run without an if statement:

time('.apply, dtype="string"', lambda: data_s.apply(lambda s: s.zfill(5)))

then that is faster than .str.zfill().

That explains the discrepancy. Since it's possible to remove the possibility of nulls in advance, avoiding the if when benchmarking seems a reasonable apples to apples comparison to me, but I could see the argument going the other way.

To summarize the numbers on my computer:

  • apply() of x if is_null(x) else x.zfill(5) statement: 0.54 sec
  • .str.zfill(5): 0.21 sec
  • apply() of x.zfill(5): 0.16 sec

@gwerbin
Copy link

gwerbin commented Aug 28, 2021

Indeed, I probably wasn't clear enough in my post. I added the extra logic specifically to try and emulate what the .str accessor does. I actually made a mistake in the .map version, because I used both the if and na_action='ignore'. Removing the former and using only the latter, .map is about as fast as the .str accessor.

I updated the Gist to reflect this.

New timings:

% python ./pd_str_slow.py

.tolist, dtype="object" 0.61
.array, dtype="object" 1.01
.apply, dtype="object" 0.55
.map, dtype="object" 0.20
.str accessor, dtype="object" 0.29

.tolist, dtype="string" 0.94
.array, dtype="string" 1.00
.apply, dtype="string" 0.53
.map, dtype="string" 0.28
.str accessor, dtype="string" 0.27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

8 participants