(df==0) vs. (df.values==0) performance penalty #28056

trendelkampschroer · 2019-08-21T08:09:12Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
T = 5000
S = 1000
X = np.random.randint(0, 10, size=(T, S))
df = pd.DataFrame(X, index=pd.date_range("2001-01-01", periods=T, freq="D"))
%timeit (df == 0)
>>> 327 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (df.values == 0)
2.85 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

I am using the boolean frame df == 0 to find and set zeros in a given DataFrame df, i.e.
df[df == 0] = FOO, where FOO is some nonzero integer.

I found that using the expression df[df.values == 0] = FOO gives a significant speed-up.

I am wondering if the large overhead of (df == 0) compared to (df.values == 0) is to be expected. If so, then in which situation should I prefer the expression df[df == 0] over the much faster df[df.values == 0]?

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.6.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2019-08-21T14:06:38Z

The df == 0 operation took a performance hit as part of a bugfix a few months ago. If all goes well it will be fixed in the next few weeks

trendelkampschroer · 2019-08-21T14:13:23Z

Ok, thanks for the quick response.

TomAugspurger · 2019-08-21T16:15:06Z

xref #24990 for that issue.

jorisvandenbossche · 2019-08-21T19:36:04Z

So I revived my experiment from some time ago related to #24990, and briefly looked into this example.

So it is a slowdown compared to 0.23 and before. So a timing with that version:

In [10]: pd.__version__  
Out[10]: '0.23.4'

In [11]: %timeit df == 0
15.2 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Then let's do some timings on master:

For me, the operation takes around 220 ms (so a bit less as the OP), but clearly slower as 0.23:

In [2]: %timeit df == 0
201 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to numpy on a 2D array or a set of 1D arrays:

In [4]: arr = df.values 

In [5]: %timeit arr == 0 
2.77 ms ± 38.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: arrays = [df[col].values for col in df.columns]

In [7]: %timeit [a == 0 for a in arrays] 
29.2 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it is clear that doing things column wise gives (also on the low level without all additional pandas overhead) a slowdown. But the 29 ms is of course still a lot less than the 200ms of the pandas operation. In a "column-wise-ops" world, the 29ms is the lower bound we can target at (with current pandas).

Now with a few tweaks to the ops code (a6c0d02), I get:

In [2]: %timeit df == 0
38.8 ms ± 948 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So that is not that much on top of the numpy column-wise timing, and compared to 0.23 "only" a 2.5 slowdown (but of course, still a slowdown).

I don't think it can get much better than the above 39ms (when using column-wise ops), but the tweaks I did now to get it from 200 -> 39 ms are relatively simple (iterate over arrays instead of columns + disable checks when recreating (we know, in this case, we have all same sized proper arrays)). See a6c0d02
There is only one "cheat" that I did (targetted for this case), which is moving a bool check in get_block_type to the front of a if/elif series, but we might be able to get this faster in general with a bit more effort (dtype checks are expensive ..).

jbrockmendel · 2019-08-21T20:15:54Z

the bool check looks like it would be a valid optimization regardless, right?

TomAugspurger closed this as completed Aug 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(df==0) vs. (df.values==0) performance penalty #28056

(df==0) vs. (df.values==0) performance penalty #28056

trendelkampschroer commented Aug 21, 2019 •

edited

Loading

INSTALLED VERSIONS

jbrockmendel commented Aug 21, 2019

trendelkampschroer commented Aug 21, 2019

TomAugspurger commented Aug 21, 2019

jorisvandenbossche commented Aug 21, 2019

jbrockmendel commented Aug 21, 2019

(df==0) vs. (df.values==0) performance penalty #28056

(df==0) vs. (df.values==0) performance penalty #28056

Comments

trendelkampschroer commented Aug 21, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Aug 21, 2019

trendelkampschroer commented Aug 21, 2019

TomAugspurger commented Aug 21, 2019

jorisvandenbossche commented Aug 21, 2019

jbrockmendel commented Aug 21, 2019

trendelkampschroer commented Aug 21, 2019 •

edited

Loading

Output of `pd.show_versions()`