Skip to content

PERF: short-circuit (left == right).all() comparisons #32339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue Feb 28, 2020 · 4 comments
Open

PERF: short-circuit (left == right).all() comparisons #32339

jbrockmendel opened this issue Feb 28, 2020 · 4 comments
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance

Comments

@jbrockmendel
Copy link
Member

In places like equals methods and array_equivalent, we do things like (left == right).all() or ((left == right) | (isna(left) & isna(right))).all(). For large arrays that are not equal, we can do much better with something like:

def all_match(left, right) -> bool:
    if left.dtype.kind != "i":
        # viewing as i8 will make NaNs be treated as equal
        return _all_match_i8(left.view("i8"), right.view("i8"))

    return _all_match_i8(left, right)

cdef bint _all_match_i8(const int64_t[:] left, const int64_t[:] right):
    cdef:
        Py_ssize_t i, n = len(left)

    for i in range(n):
        if left[i] != right[i]:
            return False

    return True

Some profiling results:

In [2]: arr = np.arange(10**6)                                                                                                                                                      
In [3]: arr2 = arr.copy()                                                                                                                                                           
In [4]: arr2[0] = -1                                                                                                                                                                

In [5]: %timeit np.array_equal(arr, arr2)                                                                                                                                           
831 µs ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [6]: %timeit all_match(arr, arr2)                                                                                                                                                
1.27 µs ± 58.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit np.array_equal(arr, arr)                                                                                                                                            
416 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit all_match(arr, arr)                                                                                                                                                 
812 µs ± 5.84 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So in cases that short circuit early, we can get massive speedups, but this implementation is actually 2x slower in cases that dont short-circuit (for reasons that are not clear to me).

@jbrockmendel
Copy link
Member Author

cc @seberg any idea why this is so much slower in the not-short-circuit case?

@seberg
Copy link
Contributor

seberg commented Feb 28, 2020

@jbrockmendel my guess is: We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.

Are you sure what you are doing is a good idea though? First, -0 and 0 are not different normally. Second, there are many many NaNs, and you are making some of them return True.

@jbrockmendel
Copy link
Member Author

We are using SIMD loops in this case (at least sometimes), and cython probably does not manage to do that.

@scoder thoughts?

@scoder
Copy link

scoder commented Feb 29, 2020

SIMD

Might make a difference here, yes. Make sure your CFLAGS allow for auto-vectorisation, and that your C compiler is modern enough to figure it out for you.

It sometimes helps the compiler to make your algorithm redundant, i.e. instead of just a[i] != b[i] make sure the arrays are long enough, take two steps at once and write a[i] != b[i] or a[i+1] != b[i+1], or even using the uglier | instead of or (try to avoid that if you don't need it). That way, the code makes it clear to the C compiler that it can safely read ahead far enough to use wider SIMD instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

4 participants