PERF: Faster SparseArray.getitem for boolean SparseArray mask (#23122) #44841

bdrum · 2021-12-10T12:43:50Z

closes Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122
tests added / passed
Ensure all linting tests pass, see here for how to run them
The issue Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122 contains recommendation from @TomAugspurger to avoid densifying when boolean sparse array is used as index. I have tried few solutions including usage direct sp_index.indices in a case of fill_value is False and np.setdiff1d for opposite case.
Usage of direct indexing is working very fast, but calculating setdiff1d is working very slow, the good solution for both case (much faster than setdiff1d and bit slower than direct indices) using masks. Here is the plot of comparison old (densifying) implementation and new one (masks):

I understand that using mask will take more memory, but I don't think that it is crucial point here especially for bool8 type, but let's discuss.
We can combine both approaches and use direct indices when fill value is False and use mask in the opposite case.
One more question to discuss related with case when True value of Boolean sparse array lead to the fill value of original array. I assume that in this case fill value should be return.

I've got benchmark results via pytest-benchmark framework.

Code for benchmarking:

import pytest
from pandas.arrays import SparseArray
import numpy as np


class TestPerf:
    def setup_method(self, method):
        d = 0.8
        self.n = 1_000_000
        self.data = np.random.rand(self.n)
        genInd = lambda d: np.unique(
            np.random.randint(
                low=0, high=self.n - 1, size=int(self.n * d), dtype=np.int32
            )
        )
        self.inds = genInd(d)
        barrs = np.full(shape=self.n, fill_value=True, dtype=np.bool8)
        arr_dens_inds = genInd(d)

        self.data[self.inds] = np.nan
        barrs[arr_dens_inds] = False

        self.sp_arrs = SparseArray(self.data)
        self.sp_barrs_false = SparseArray(barrs, dtype=np.bool8, fill_value=False)
        self.sp_barrs_true = SparseArray(~barrs[d], dtype=np.bool8, fill_value=True)

    def perf_sparse_boolean_indexing_fv_true(self):
        return self.sp_arrs[self.sp_barrs_true]

    def perf_sparse_boolean_indexing_fv_false(self):
        return self.sp_arrs[self.sp_barrs_false]

    def test_perf_sparse_boolean_indexing_fv_true(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_true)

    def test_perf_sparse_boolean_indexing_fv_false(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_false)

…3122)

PERF: Faster SparseArray.__get_item__ for boolean masks (pandas-dev#2…

a8d517a

…3122)

bdrum closed this Dec 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Faster SparseArray.getitem for boolean SparseArray mask (#23122) #44841

PERF: Faster SparseArray.getitem for boolean SparseArray mask (#23122) #44841

bdrum commented Dec 10, 2021

PERF: Faster SparseArray.__getitem__ for boolean SparseArray mask (#23122) #44841

PERF: Faster SparseArray.__getitem__ for boolean SparseArray mask (#23122) #44841

Conversation

bdrum commented Dec 10, 2021

PERF: Faster SparseArray.getitem for boolean SparseArray mask (#23122) #44841

PERF: Faster SparseArray.getitem for boolean SparseArray mask (#23122) #44841