Skip to content

PERF: BoolHashTable #15751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks
chris-b1 opened this issue Mar 20, 2017 · 2 comments
Closed
3 tasks

PERF: BoolHashTable #15751

chris-b1 opened this issue Mar 20, 2017 · 2 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance

Comments

@chris-b1
Copy link
Contributor

chris-b1 commented Mar 20, 2017

xref #15738 (comment)

Currently bool data is passed to the generic python object hashtable for the following methods, making them all very slow.

  • value_counts
  • rank
  • unique

Linked PR casts to int for factorize, duplicated, drop_duplicates

We could skip the hashing altogether and take advantage of the fixed set of values, e.g. below is a fastpath for unique.

%%cython
from numpy cimport uint8_t
cimport cython

import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def unique(uint8_t[:] data):
    cdef:
        bint seen_true = False
        bint seen_false = False
        Py_ssize_t i, N = len(data)
        uint8_t val
    with nogil:
        for i in range(N):
            val = data[i]
            if val == 0:
                seen_false = True
            elif val == 1:
                seen_true = True
            if seen_true and seen_false:
                break
    if seen_true and seen_false:
        return np.array([True, False])
    elif seen_true:
        return np.array([True])
    elif seen_false:
        return np.array([False])
    else:
        return np.array([], dtype=bool)

In [35]: bools = np.array([False, True] * 100000)

In [36]: %timeit pd.unique(bools.view('uint8')).astype(bool)
1000 loops, best of 3: 1.74 ms per loop

In [37]: %timeit unique(bools.view('uint8'))
100000 loops, best of 3: 3.47 µs per loop
@chris-b1 chris-b1 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions Performance Memory or execution speed performance labels Mar 20, 2017
@chris-b1 chris-b1 added this to the Next Major Release milestone Mar 20, 2017
@jreback
Copy link
Contributor

jreback commented Mar 20, 2017

can create a new issue or maybe add checkboxes for perf for bools for .value_counts, rank, duplicated?

@mroeschke mroeschke added Internals Related to non-user accessible pandas implementation and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels May 8, 2021
@jbrockmendel
Copy link
Member

We now have a UInt8HashTable which core.algorithms functions dispatch bool dtype to. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants