Skip to content

PERF: apply on boolean-dtype DataFrame is exceedingly slow #44172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
alexreg opened this issue Oct 24, 2021 · 20 comments · Fixed by #45404
Closed
3 tasks done

PERF: apply on boolean-dtype DataFrame is exceedingly slow #44172

alexreg opened this issue Oct 24, 2021 · 20 comments · Fixed by #45404
Labels
Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Milestone

Comments

@alexreg
Copy link
Contributor

alexreg commented Oct 24, 2021

  • I have checked that this issue has not already been reported.
  • I have confirmed this issue exists on the latest version of pandas.
  • I have confirmed this issue exists on the master branch of pandas.

Reproducible Example

from timeit import timeit
import pandas as pd

# Create a dummy DataFrame of bools and null values.
df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")

# This runs very slowly!
print(timeit(lambda: df.astype("boolean").apply(lambda row: row.count(), axis = 1), number = 10)) # 112s

# (As can be easily seen, this is due partly to the `astype` on the entire DataFrame, but mainly to the subsequent `apply` being particularly slow for a boolean-dtype DataFrame).
print(timeit(lambda: df.astype("boolean"), number = 10)) # 3.98s

# This *equivalent* statement runs fast. There seems to be no good reason why the call to `apply` in the previous statement (overall equivalent) must be so slow.
print(timeit(lambda: df.apply(lambda row: row.astype("boolean").count(), axis = 1), number = 10)) # 0.95s

There is no problem using the first method in some cases, but it is inconvenient in other situations to cast back and forth between object-dtype and boolean-dtype just to make apply efficient.

Installed Versions

commit : aced6ee
python : 3.9.7.final.0
python-bits : 64
OS : Darwin
OS-release : 20.6.0
Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:31 PDT 2021; root:xnu-7195.141.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.4.0.dev0+970.gaced6eedf9
numpy : 1.20.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.2.0
Cython : 0.29.24
pytest : 6.2.5
hypothesis : 6.23.4
sphinx : 4.2.0
blosc : 1.10.6
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: 0.10.0
bs4 : 4.10.0
bottleneck : None
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : 2021.10.1
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1

Prior Performance

Not applicable.

@alexreg alexreg added Needs Triage Issue that has not been reviewed by a pandas team member Performance Memory or execution speed performance labels Oct 24, 2021
@alexreg
Copy link
Contributor Author

alexreg commented Oct 24, 2021

Side note: I had to uninstall the feather package to get pd.show_versions() to succeed. See the following error.

Traceback (most recent call last):
  File "/Users/alex/Software/foo.py", line 4, in <module>
    pd.show_versions()
  File "/Users/alex/Software/pandas/pandas/util/_print_versions.py", line 109, in show_versions
    deps = _get_dependency_info()
  File "/Users/alex/Software/pandas/pandas/util/_print_versions.py", line 89, in _get_dependency_info
    result[modname] = get_version(mod) if mod else None
  File "/Users/alex/Software/pandas/pandas/compat/_optional.py", line 60, in get_version
    raise ImportError(f"Can't determine version for {module.__name__}")
ImportError: Can't determine version for feather

(This is using the master branch of pandas.)

@jreback
Copy link
Contributor

jreback commented Oct 24, 2021

show running times using timeit make the example minimal eg don't use apply

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

@jreback It's as minimal as possible already, I think. I've updated the example above with timeit results, however, and shown the split after the initial .astype("boolean") on the whole dataframe.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2021

do not use apply as this has nothing to do with the astype timing (not does the count)

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

Why though? It is reasonably performant in most other situations, including one above. The alternative is inconvenient at best

@jreback
Copy link
Contributor

jreback commented Oct 25, 2021

you claim the astype is slow so prove it

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

@jreback Indeed, my original claim was only partially correct. It is somewhat slow, but the main source of slowness is the apply on a boolean dtype dataframe. Is there a reason apply on such a dtype is particularly slow? The above example makes clear that casting the individual rows when applying is far quicker than casting the entire dataframe first.

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

(I'll update the title to reflect the above.)

@alexreg alexreg changed the title PERF: DataFrame.astype("boolean") is very slow PERF: apply on boolean-dtype DataFrame is exceedingly slow Oct 25, 2021
@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

Sorry, I realise that some copy-pasting of comments may have caused confusion too — I've fixed those now.

@jreback
Copy link
Contributor

jreback commented Oct 25, 2021

@alexreg certainly could be some unwanted coercion going on into apply. if you can investigate would be great.

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

@jreback I'll have a look, sure. Any tips on what functions/methods to look at to get me going?

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

Okay, it seems like what's taking most of the time is the branch for is_extension_array_dtype returning true in FrameColumnApply.series_generator.

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

@jbrockmendel The relevant code seems to be from your PR #38272. Any advice?

@jbrockmendel
Copy link
Member

This is going to come as a shock to @jreback: it looks like the lack of 2D EA support is to blame.

df = pd.DataFrame([[True, False, pd.NA] * 200] * 200, dtype = "object")
df2 = df.astype("boolean")

df2.apply(lambda row: row.count(), axis = 1)

The .apply with axis=1 is iterating over rows, and df2.iloc[i] is super-slow bc it isn't just slicing an existing array.

%prun -s cumtime df2.apply(lambda row: row.count(), axis = 1)
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    6.026    6.026 {built-in method builtins.exec}
        1    0.000    0.000    6.026    6.026 <string>:1(<module>)
        1    0.000    0.000    6.026    6.026 frame.py:8719(apply)
        1    0.000    0.000    6.026    6.026 apply.py:695(apply)
        1    0.000    0.000    6.026    6.026 apply.py:851(apply_standard)
        1    0.001    0.001    6.025    6.025 apply.py:857(apply_series_generator)
      201    0.000    0.000    6.016    0.030 apply.py:977(series_generator)
      201    0.002    0.000    5.988    0.030 frame.py:3469(_ixs)
      201    0.268    0.001    5.963    0.030 managers.py:951(fast_xs)
   120600    0.206    0.000    4.348    0.000 masked.py:211(__setitem__)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2021

@jbrockmendel 'shocked face'

@alexreg
Copy link
Contributor Author

alexreg commented Oct 25, 2021

@jbrockmendel @jreback Makes sense. Is there work underway to support 2D EAs?

@jorisvandenbossche
Copy link
Member

A large part of the slowdown is actually coming from a small refactor commit from @jbrockmendel: #43203 (reverting that speeds up the example of df2.apply(lambda row: row.count(), axis = 1) almost 10x; after that it's still slower than the block version of course, as expected)

@jbrockmendel
Copy link
Member

Another possible patch (less impressive at only 5x) is BooleanArray.__setitem__

def __setitem__(self, key, value):
        if lib.is_bool(value):
            self._data[key] = value
            self._mask[key] = False
        elif value is libmissing.NA:
            self._mask[key] = True
        else:
            super().__setitem__(key, value)

(havent run the test suite with this)

@alexreg
Copy link
Contributor Author

alexreg commented Oct 26, 2021

@jbrockmendel I actually get the same speed-up for your patch as I do for @jorisvandenbossche's one (about 4-5x). Is there maybe a good patch for BaseMaskedArray.__setitem__? (Since this problem also seems to apply to other nullable dtypes like Int32.)

@jbrockmendel
Copy link
Member

Is there maybe a good patch for BaseMaskedArray.__setitem__?

PR would be welcome.

alexreg added a commit to alexreg/pandas that referenced this issue Oct 26, 2021
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
alexreg added a commit to alexreg/pandas that referenced this issue Oct 26, 2021
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
@lithomas1 lithomas1 added Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 28, 2021
alexreg added a commit to alexreg/pandas that referenced this issue Oct 28, 2021
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
alexreg added a commit to alexreg/pandas that referenced this issue Oct 28, 2021
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
alexreg added a commit to alexreg/pandas that referenced this issue Nov 28, 2021
This somewhat deals with pandas-dev#44172, though that won't be fully resolved until 2D `ExtensionArray`s are supported (per the comments there).
@jreback jreback added this to the 1.5 milestone Jan 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map ExtensionArray Extending pandas with custom dtypes or arrays. Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants