-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: apply
on boolean
-dtype DataFrame is exceedingly slow
#44172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Side note: I had to uninstall the feather package to get
(This is using the master branch of pandas.) |
show running times using timeit make the example minimal eg don't use apply |
@jreback It's as minimal as possible already, I think. I've updated the example above with |
do not use apply as this has nothing to do with the astype timing (not does the count) |
Why though? It is reasonably performant in most other situations, including one above. The alternative is inconvenient at best |
you claim the astype is slow so prove it |
@jreback Indeed, my original claim was only partially correct. It is somewhat slow, but the main source of slowness is the |
(I'll update the title to reflect the above.) |
DataFrame.astype("boolean")
is very slowapply
on boolean
-dtype DataFrame is exceedingly slow
Sorry, I realise that some copy-pasting of comments may have caused confusion too — I've fixed those now. |
@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great. |
@jreback I'll have a look, sure. Any tips on what functions/methods to look at to get me going? |
Okay, it seems like what's taking most of the time is the branch for |
@jbrockmendel The relevant code seems to be from your PR #38272. Any advice? |
This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.
The .apply with axis=1 is iterating over rows, and
|
@jbrockmendel 'shocked face' |
@jbrockmendel @jreback Makes sense. Is there work underway to support 2D EAs? |
A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: #43203 (reverting that speeds up the example of |
Another possible patch (less impressive at only 5x) is
(havent run the test suite with this) |
@jbrockmendel I actually get the same speed-up for your patch as I do for @jorisvandenbossche's one (about 4-5x). Is there maybe a good patch for |
PR would be welcome. |
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
Reproducible Example
There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make
apply
efficient.Installed Versions
commit : aced6ee
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.4.0.dev0+970.gaced6eedf9
numpy : 1.20.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.2.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.4
sphinx : 4.2.0
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: 0.10.0
bs4 : 4.10.0
bottleneck : None
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.10.1
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1
Prior Performance
Not applicable.
The text was updated successfully, but these errors were encountered: