Skip to content

(df==0) vs. (df.values==0) performance penalty #28056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
trendelkampschroer opened this issue Aug 21, 2019 · 5 comments
Closed

(df==0) vs. (df.values==0) performance penalty #28056

trendelkampschroer opened this issue Aug 21, 2019 · 5 comments

Comments

@trendelkampschroer
Copy link

trendelkampschroer commented Aug 21, 2019

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd
T = 5000
S = 1000
X = np.random.randint(0, 10, size=(T, S))
df = pd.DataFrame(X, index=pd.date_range("2001-01-01", periods=T, freq="D"))
%timeit (df == 0)
>>> 327 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit (df.values == 0)
2.85 ms ± 36.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

I am using the boolean frame df == 0 to find and set zeros in a given DataFrame df, i.e.
df[df == 0] = FOO, where FOO is some nonzero integer.

I found that using the expression df[df.values == 0] = FOO gives a significant speed-up.

I am wondering if the large overhead of (df == 0) compared to (df.values == 0) is to be expected. If so, then in which situation should I prefer the expression df[df == 0] over the much faster df[df.values == 0]?

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.6.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jbrockmendel
Copy link
Member

The df == 0 operation took a performance hit as part of a bugfix a few months ago. If all goes well it will be fixed in the next few weeks

@trendelkampschroer
Copy link
Author

Ok, thanks for the quick response.

@TomAugspurger
Copy link
Contributor

xref #24990 for that issue.

@jorisvandenbossche
Copy link
Member

So I revived my experiment from some time ago related to #24990, and briefly looked into this example.

So it is a slowdown compared to 0.23 and before. So a timing with that version:

In [10]: pd.__version__  
Out[10]: '0.23.4'

In [11]: %timeit df == 0
15.2 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Then let's do some timings on master:

For me, the operation takes around 220 ms (so a bit less as the OP), but clearly slower as 0.23:

In [2]: %timeit df == 0
201 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compared to numpy on a 2D array or a set of 1D arrays:

In [4]: arr = df.values 

In [5]: %timeit arr == 0 
2.77 ms ± 38.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [6]: arrays = [df[col].values for col in df.columns]

In [7]: %timeit [a == 0 for a in arrays] 
29.2 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

So it is clear that doing things column wise gives (also on the low level without all additional pandas overhead) a slowdown. But the 29 ms is of course still a lot less than the 200ms of the pandas operation. In a "column-wise-ops" world, the 29ms is the lower bound we can target at (with current pandas).

Now with a few tweaks to the ops code (a6c0d02), I get:

In [2]: %timeit df == 0
38.8 ms ± 948 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So that is not that much on top of the numpy column-wise timing, and compared to 0.23 "only" a 2.5 slowdown (but of course, still a slowdown).

I don't think it can get much better than the above 39ms (when using column-wise ops), but the tweaks I did now to get it from 200 -> 39 ms are relatively simple (iterate over arrays instead of columns + disable checks when recreating (we know, in this case, we have all same sized proper arrays)). See a6c0d02
There is only one "cheat" that I did (targetted for this case), which is moving a bool check in get_block_type to the front of a if/elif series, but we might be able to get this faster in general with a bit more effort (dtype checks are expensive ..).

@jbrockmendel
Copy link
Member

the bool check looks like it would be a valid optimization regardless, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants