PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow #44172

alexreg · 2021-10-24T23:14:12Z

I have checked that this issue has not already been reported.
I have confirmed this issue exists on the latest version of pandas.
I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

from timeit import timeit
import pandas as pd

# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")

# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s

# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s

# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s

There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply efficient.

Installed Versions

commit : aced6ee
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.4.0.dev0+970.gaced6eedf9
numpy : 1.20.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.2.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.4
sphinx : 4.2.0
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: 0.10.0
bs4 : 4.10.0
bottleneck : None
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.10.1
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1

Prior Performance

Not applicable.

The text was updated successfully, but these errors were encountered:

alexreg · 2021-10-24T23:17:16Z

Side note: I had to uninstall the feather package to get pd.show_versions() to succeed. See the following error.

Traceback (most recent call last):
  File "/Users/alex/Software/foo.py", line 4, in <module>
    pd.show_versions()
  File "/Users/alex/Software/pandas/pandas/util/_print_versions.py", line 109, in show_versions
    deps = _get_dependency_info()
  File "/Users/alex/Software/pandas/pandas/util/_print_versions.py", line 89, in _get_dependency_info
    result[modname] = get_version(mod) if mod else None
  File "/Users/alex/Software/pandas/pandas/compat/_optional.py", line 60, in get_version
    raise ImportError(f"Can't determine version for {module.__name__}")
ImportError: Can't determine version for feather

(This is using the master branch of pandas.)

jreback · 2021-10-24T23:17:50Z

show running times using timeit make the example minimal eg don't use apply

alexreg · 2021-10-25T00:06:34Z

@jreback It's as minimal as possible already, I think. I've updated the example above with timeit results, however, and shown the split after the initial .astype("boolean") on the whole dataframe.

jreback · 2021-10-25T00:15:02Z

do not use apply as this has nothing to do with the astype timing (not does the count)

alexreg · 2021-10-25T00:16:56Z

Why though? It is reasonably performant in most other situations, including one above. The alternative is inconvenient at best

jreback · 2021-10-25T01:01:23Z

you claim the astype is slow so prove it

alexreg · 2021-10-25T14:32:00Z

@jreback Indeed, my original claim was only partially correct. It is somewhat slow, but the main source of slowness is the apply on a boolean dtype dataframe. Is there a reason apply on such a dtype is particularly slow? The above example makes clear that casting the individual rows when applying is far quicker than casting the entire dataframe first.

alexreg · 2021-10-25T14:32:25Z

(I'll update the title to reflect the above.)

alexreg · 2021-10-25T14:35:04Z

Sorry, I realise that some copy-pasting of comments may have caused confusion too — I've fixed those now.

jreback · 2021-10-25T14:36:42Z

@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great.

alexreg · 2021-10-25T14:38:04Z

@jreback I'll have a look, sure. Any tips on what functions/methods to look at to get me going?

alexreg · 2021-10-25T15:47:22Z

Okay, it seems like what's taking most of the time is the branch for is_extension_array_dtype returning true in FrameColumnApply.series_generator.

alexreg · 2021-10-25T16:05:50Z

@jbrockmendel The relevant code seems to be from your PR #38272. Any advice?

jbrockmendel · 2021-10-25T18:04:14Z

This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.

df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
df2 = df.astype("boolean")

df2.apply(lambda row: row.count(), axis = 1)

The .apply with axis=1 is iterating over rows, and df2.iloc[i] is super-slow bc it isn't just slicing an existing array.

%prun -s cumtime df2.apply(lambda row: row.count(), axis = 1)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.026    6.026 {built-in method builtins.exec}
        1    0.000    0.000    6.026    6.026 <string>:1(<module>)
        1    0.000    0.000    6.026    6.026 frame.py:8719(apply)
        1    0.000    0.000    6.026    6.026 apply.py:695(apply)
        1    0.000    0.000    6.026    6.026 apply.py:851(apply_standard)
        1    0.001    0.001    6.025    6.025 apply.py:857(apply_series_generator)
      201    0.000    0.000    6.016    0.030 apply.py:977(series_generator)
      201    0.002    0.000    5.988    0.030 frame.py:3469(_ixs)
      201    0.268    0.001    5.963    0.030 managers.py:951(fast_xs)
   120600    0.206    0.000    4.348    0.000 masked.py:211(__setitem__)

jreback · 2021-10-25T18:54:05Z

@jbrockmendel 'shocked face'

alexreg · 2021-10-25T22:48:44Z

@jbrockmendel @jreback Makes sense. Is there work underway to support 2D EAs?

jorisvandenbossche · 2021-10-25T23:05:14Z

A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: #43203 (reverting that speeds up the example of df2.apply(lambda row: row.count(), axis = 1) almost 10x; after that it's still slower than the block version of course, as expected)

jbrockmendel · 2021-10-25T23:40:54Z

Another possible patch (less impressive at only 5x) is BooleanArray.__setitem__

def __setitem__(self, key, value):
        if lib.is_bool(value):
            self._data[key] = value
            self._mask[key] = False
        elif value is libmissing.NA:
            self._mask[key] = True
        else:
            super().__setitem__(key, value)

(havent run the test suite with this)

alexreg · 2021-10-26T00:02:24Z

@jbrockmendel I actually get the same speed-up for your patch as I do for @jorisvandenbossche's one (about 4-5x). Is there maybe a good patch for BaseMaskedArray.__setitem__? (Since this problem also seems to apply to other nullable dtypes like Int32.)

jbrockmendel · 2021-10-26T01:22:00Z

Is there maybe a good patch for BaseMaskedArray.__setitem__?

PR would be welcome.

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

alexreg added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 24, 2021

alexreg changed the title ~~PERF: DataFrame.astype("boolean") is very slow~~ PERF: apply on boolean-dtype DataFrame is exceedingly slow Oct 25, 2021

alexreg added a commit to alexreg/pandas that referenced this issue Oct 26, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__

e0a5472

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

alexreg mentioned this issue Oct 26, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__ #44186

Closed

4 tasks

alexreg added a commit to alexreg/pandas that referenced this issue Oct 26, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__

015917c

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

lithomas1 added Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 28, 2021

alexreg added a commit to alexreg/pandas that referenced this issue Oct 28, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__

d468b36

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

alexreg added a commit to alexreg/pandas that referenced this issue Oct 28, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__

fd645db

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

alexreg added a commit to alexreg/pandas that referenced this issue Nov 28, 2021

PERF: improve efficiency of BaseMaskedArray.__setitem__

fb2b5fe

This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).

jbrockmendel mentioned this issue Jan 16, 2022

BUG/PERF: MaskedArray.__setitem__ validation #45404

Merged

5 tasks

jreback added this to the 1.5 milestone Jan 17, 2022

jreback closed this as completed in #45404 Jan 17, 2022

This was referenced Mar 1, 2024

Potential performance issue: .apply slow in pandas below 1.5 version microsoft/ContextualSP#65

Open

Potential performance issue: .apply slow in pandas 1.4.0 MannLabs/alphamap#72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow #44172

PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow #44172

alexreg commented Oct 24, 2021 •

edited

Loading

alexreg commented Oct 24, 2021

jreback commented Oct 24, 2021

alexreg commented Oct 25, 2021

jreback commented Oct 25, 2021 •

edited

Loading

alexreg commented Oct 25, 2021

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021 •

edited

Loading

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021 •

edited

Loading

alexreg commented Oct 25, 2021

jbrockmendel commented Oct 25, 2021

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

jorisvandenbossche commented Oct 25, 2021

jbrockmendel commented Oct 25, 2021

alexreg commented Oct 26, 2021

jbrockmendel commented Oct 26, 2021

PERF: apply on boolean-dtype DataFrame is exceedingly slow #44172

PERF: apply on boolean-dtype DataFrame is exceedingly slow #44172

Comments

alexreg commented Oct 24, 2021 • edited Loading

Reproducible Example

Installed Versions

Prior Performance

alexreg commented Oct 24, 2021

jreback commented Oct 24, 2021

alexreg commented Oct 25, 2021

jreback commented Oct 25, 2021 • edited Loading

alexreg commented Oct 25, 2021

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021 • edited Loading

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

alexreg commented Oct 25, 2021 • edited Loading

alexreg commented Oct 25, 2021

jbrockmendel commented Oct 25, 2021

jreback commented Oct 25, 2021

alexreg commented Oct 25, 2021

jorisvandenbossche commented Oct 25, 2021

jbrockmendel commented Oct 25, 2021

alexreg commented Oct 26, 2021

jbrockmendel commented Oct 26, 2021

PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow #44172

PERF: `apply` on `boolean`-dtype DataFrame is exceedingly slow #44172

alexreg commented Oct 24, 2021 •

edited

Loading

jreback commented Oct 25, 2021 •

edited

Loading

alexreg commented Oct 25, 2021 •

edited

Loading

alexreg commented Oct 25, 2021 •

edited

Loading