Skip to content

PERF: Faster SparseArray.__getitem__ for boolean SparseArray mask (#23122) #44841

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

bdrum
Copy link
Contributor

@bdrum bdrum commented Dec 10, 2021

  • closes Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • The issue Efficiency of SparseArray.__getitem__(SparseArray[bool]) #23122 contains recommendation from @TomAugspurger to avoid densifying when boolean sparse array is used as index. I have tried few solutions including usage direct sp_index.indices in a case of fill_value is False and np.setdiff1d for opposite case.
    Usage of direct indexing is working very fast, but calculating setdiff1d is working very slow, the good solution for both case (much faster than setdiff1d and bit slower than direct indices) using masks. Here is the plot of comparison old (densifying) implementation and new one (masks):

    I understand that using mask will take more memory, but I don't think that it is crucial point here especially for bool8 type, but let's discuss.
    We can combine both approaches and use direct indices when fill value is False and use mask in the opposite case.
    One more question to discuss related with case when True value of Boolean sparse array lead to the fill value of original array. I assume that in this case fill value should be return.

I've got benchmark results via pytest-benchmark framework.

Code for benchmarking:

import pytest
from pandas.arrays import SparseArray
import numpy as np


class TestPerf:
    def setup_method(self, method):
        d = 0.8
        self.n = 1_000_000
        self.data = np.random.rand(self.n)
        genInd = lambda d: np.unique(
            np.random.randint(
                low=0, high=self.n - 1, size=int(self.n * d), dtype=np.int32
            )
        )
        self.inds = genInd(d)
        barrs = np.full(shape=self.n, fill_value=True, dtype=np.bool8)
        arr_dens_inds = genInd(d)

        self.data[self.inds] = np.nan
        barrs[arr_dens_inds] = False

        self.sp_arrs = SparseArray(self.data)
        self.sp_barrs_false = SparseArray(barrs, dtype=np.bool8, fill_value=False)
        self.sp_barrs_true = SparseArray(~barrs[d], dtype=np.bool8, fill_value=True)

    def perf_sparse_boolean_indexing_fv_true(self):
        return self.sp_arrs[self.sp_barrs_true]

    def perf_sparse_boolean_indexing_fv_false(self):
        return self.sp_arrs[self.sp_barrs_false]

    def test_perf_sparse_boolean_indexing_fv_true(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_true)

    def test_perf_sparse_boolean_indexing_fv_false(self, benchmark):
        benchmark(self.perf_sparse_boolean_indexing_fv_false)

@bdrum bdrum closed this Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Efficiency of SparseArray.__getitem__(SparseArray[bool])
1 participant