-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Vectorized string operations are slower than for-loops #35864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is an old version of pandas, if you upgrade the difference doesn't seem to be as much. In [1]: import pandas as pd
...: import numpy as np
...:
...: print(pd.__version__)
...:
...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
...:
...: %timeit pd.Series([str(zipcode).zfill(5) for zipcode in non_padded])
...: %timeit non_padded.astype(str).str.zfill(5)
...:
1.1.1
3.44 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.44 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) It's worth pointing out though that the string accessor method is doing a bit more work than the pure Python version, e.g., trying to handle any missing values that it might encounter. In [3]: ser = pd.Series([None, "1", "2"])
In [4]: ser.str.zfill(5)
Out[4]:
0 None
1 00001
2 00002
dtype: object
In [5]: pd.Series([x.zfill(5) for x in ser])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-35047bd45c0a> in <module>
----> 1 pd.Series([x.zfill(5) for x in ser])
<ipython-input-5-35047bd45c0a> in <listcomp>(.0)
----> 1 pd.Series([x.zfill(5) for x in ser])
AttributeError: 'NoneType' object has no attribute 'zfill' |
On 1.1.0 and I still see a big perf difference with the same test code In [65]: pd.__version__
Out[65]: '1.1.0'
In [66]: import pandas as pd
...: import numpy as np
In [67]: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
...:
...: def for_loop(series):
...: return pd.Series([str(zipcode).zfill(5) for zipcode in series])
...:
...: def vectorized(series):
...: return series.astype(str).str.zfill(5)
...:
In [68]: %timeit for_loop(non_padded)
5.96 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [69]: %timeit vectorized(non_padded)
12.1 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Upgraded to 1.1.1 and I seem to get results similar to @dsaxton - so there might have been some changes In [1]: import pandas as pd^M
...: import numpy as np^M
...: print(pd.__version__)^M
...: ^M
...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000))^M
...: ^M
...: def for_loop(series):^M
...: return pd.Series([str(zipcode).zfill(5) for zipcode in series])^M
...: ^M
...: def vectorized(series):^M
...: return series.astype(str).str.zfill(5)^M
...: ^M
...: %timeit for_loop(non_padded)^M
...: %timeit vectorized(non_padded)
1.1.1
6.06 ms ± 109 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.4 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
I think #35519 may have caused the performance boost. It makes sense since the main bottleneck looks like it was the astype (which was patched in that change): In [1]: import pandas as pd
...: import numpy as np
...:
...: print(pd.__version__)
...:
...: non_padded = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str) # Do the astype first
...:
...: %timeit non_padded.str.zfill(5)
...: %timeit pd.Series([z.zfill(5) for z in non_padded])
...:
1.1.0
2.14 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.69 ms ± 84.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) |
If you remove the
so there appears to be no issue with string operations. |
@bashtage I don't think that is an equivalent comparison. You are comparing I get similar results in master as well. In [20]: %timeit pd.Series([k.zfill(5) for k in np2])
...: %timeit np2.str.zfill(5)
2.35 ms ± 48.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.75 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
I can verify OP's results. However, doing just the type conversion (not import pandas as pd
import numpy as np
print(pd.__version__)
non_padded = pd.Series(np.random.randint(100, 99999, size=10000))
# type conversion and zfill, vectorized slower
%timeit pd.Series([str(zipcode).zfill(5) for zipcode in series])
5.2 ms ± 91.5 µs per loop
%timeit series.astype(str).str.zfill(5)
6.52 ms ± 78.5 µs per loop
# only type conversion, vectorized faster
%timeit pd.Series([str(zipcode) for zipcode in non_padded])
4.11 ms ± 58.4 µs per loop
%timeit non_padded.astype(str)
2.47 ms ± 21.6 µs per loop
# only zfill, vectorized slower
non_padded_str = pd.Series(np.random.randint(100, 99999, size=10000)).astype(str)
%timeit pd.Series([zipcode.zfill(5) for zipcode in non_padded_str])
2.71 ms ± 22.4 µs per loop
%timeit non_padded_str.str.zfill(5)
3.18 ms ± 27.4 µs per loop So must be something in the pandas PRs welcome, of course. |
Note that I have seen the same thing with other import pandas as pd
STRINGS = ["{} hello world how are you".format(i) for i in range(1_000_000)]
SERIES = pd.Series(STRINGS)
def measure(what, f):
start = time.time()
f()
print(f"{what} elapsed: {time.time() - start}")
def replace_then_upper(s: str) -> str:
return s.replace("l", "").upper()
measure("Pandas apply()", lambda: SERIES.apply(replace_then_upper))
measure(
"Pandas .str",
lambda: SERIES.str.replace("l", "").str.upper(),
) Results:
This is with Pandas 1.1.4. If if I do just |
For what it's worth, I can't replicate this issue with The code is kind of ugly, but the TLDR is that, to obtain behavior that is equivalent to
Based on this benchmark, I'd conclude that |
The script I ran just used def one_zfill(x):
return x if is_null(x) else x.zfill(5)
time('.apply, dtype="string"', lambda: data_s.apply(one_zfill)) If however I run without an if statement: time('.apply, dtype="string"', lambda: data_s.apply(lambda s: s.zfill(5))) then that is faster than That explains the discrepancy. Since it's possible to remove the possibility of nulls in advance, avoiding the To summarize the numbers on my computer:
|
Indeed, I probably wasn't clear enough in my post. I added the extra logic specifically to try and emulate what the I updated the Gist to reflect this. New timings:
|
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
In [1]:
Out [1]:
Problem description
In most cases, using a
for-loop
in pandas is much slower than its vectorized equivalent. However, the above operations takes over twice as long when using vectorization. I have replicated this issue on MacOS & Ubuntu.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 20.0.2
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : 0.14.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
The text was updated successfully, but these errors were encountered: