-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
(df==0) vs. (df.values==0) performance penalty #28056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The |
Ok, thanks for the quick response. |
xref #24990 for that issue. |
So I revived my experiment from some time ago related to #24990, and briefly looked into this example. So it is a slowdown compared to 0.23 and before. So a timing with that version:
Then let's do some timings on master: For me, the operation takes around 220 ms (so a bit less as the OP), but clearly slower as 0.23:
Compared to numpy on a 2D array or a set of 1D arrays:
So it is clear that doing things column wise gives (also on the low level without all additional pandas overhead) a slowdown. But the 29 ms is of course still a lot less than the 200ms of the pandas operation. In a "column-wise-ops" world, the 29ms is the lower bound we can target at (with current pandas). Now with a few tweaks to the ops code (a6c0d02), I get:
So that is not that much on top of the numpy column-wise timing, and compared to 0.23 "only" a 2.5 slowdown (but of course, still a slowdown). I don't think it can get much better than the above 39ms (when using column-wise ops), but the tweaks I did now to get it from 200 -> 39 ms are relatively simple (iterate over arrays instead of columns + disable checks when recreating (we know, in this case, we have all same sized proper arrays)). See a6c0d02 |
the bool check looks like it would be a valid optimization regardless, right? |
Code Sample, a copy-pastable example if possible
Problem description
I am using the boolean frame
df == 0
to find and set zeros in a given DataFramedf
, i.e.df[df == 0] = FOO, where FOO is some nonzero integer.
I found that using the expression df[df.values == 0] = FOO gives a significant speed-up.
I am wondering if the large overhead of (df == 0) compared to (df.values == 0) is to be expected. If so, then in which situation should I prefer the expression
df[df == 0]
over the much fasterdf[df.values == 0]
?Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.6
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.6.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: