ENH: Add fast array equal function for indexers #50592

phofl · 2023-01-05T21:58:51Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The all equal case is as fast as numpy on 2 million entries, the early exist is significantly faster obviously

pandas/_libs/lib.pyx

lithomas1 · 2023-01-06T01:10:12Z

Scrolling through issues, it looks like this is basically #32339.
(Takeaway from there was that perf for non short-circuit case needed to be checked. It's pretty hard to beat numpy's SIMD extensions.)

phofl · 2023-01-06T07:36:23Z

Commented on this in the initial post. Both are running around 700micro seconds if the whole array is checked (2 million entries). Ours is actually a bit faster, but not by much.

Numpy is doing some things that are more expensive, e.g. allocating a Boolean array and then calling all on it. +

Edit: Looks like NumPy is faster with less elements, but not by that much:

10**6
%timeit np.array_equal(left, right)
269 µs ± 5.64 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit array_equal_fast(left, right)
303 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

and with 2 Million:

%timeit array_equal_fast(left, right)
701 µs ± 79.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit np.array_equal(left, right)
729 µs ± 7.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Edit2: I'd try using this only for indexers in CoW in the beginning. The advantage is, that if an indexer is equal, we avoid a copy which brings a big speedup. And if it's not equal, we should find different values early on.

jorisvandenbossche · 2023-01-06T08:21:31Z

pandas/_libs/lib.pyx

+@cython.wraparound(False)
+@cython.boundscheck(False)
+def array_equal_fast(
+    ndarray[int6432_t, ndim=1] left, ndarray[int6432_t, ndim=1] right,


I check locally a little bit, and using a memoryview (int[:]) instead of ndarray doesn't seem to give a significant speed up here (I seemed to remember that those could sometimes be faster)

jorisvandenbossche

Looks good to me. Also for me locally it more or less gives the exact same timing as np.array_equal (for fully equal arrays)

lithomas1

LGTM. Nice speedup!

phofl added 2 commits January 5, 2023 22:58

ENH: Add fast array equal function for indexers

43eb356

Add gh ref

3ec714e

phofl requested a review from jorisvandenbossche January 5, 2023 21:59

lithomas1 reviewed Jan 5, 2023

View reviewed changes

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved

phofl added 2 commits January 5, 2023 23:47

Fix cython code

c9bf2ea

Remove nogil

b359e49

jorisvandenbossche reviewed Jan 6, 2023

View reviewed changes

jorisvandenbossche approved these changes Jan 6, 2023

View reviewed changes

phofl and others added 2 commits January 6, 2023 13:45

Merge branch 'main' into enh_array_equal_fast

bd0eaaa

Fix types

0145e9a

lithomas1 approved these changes Jan 6, 2023

View reviewed changes

phofl added this to the 2.0 milestone Jan 6, 2023

phofl added the Copy / view semantics label Jan 6, 2023

phofl merged commit 857bf37 into pandas-dev:main Jan 6, 2023

phofl deleted the enh_array_equal_fast branch January 6, 2023 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add fast array equal function for indexers #50592

ENH: Add fast array equal function for indexers #50592

phofl commented Jan 5, 2023 •

edited

Loading

lithomas1 commented Jan 6, 2023

phofl commented Jan 6, 2023 •

edited

Loading

jorisvandenbossche Jan 6, 2023

jorisvandenbossche left a comment

lithomas1 left a comment

ENH: Add fast array equal function for indexers #50592

ENH: Add fast array equal function for indexers #50592

Conversation

phofl commented Jan 5, 2023 • edited Loading

lithomas1 commented Jan 6, 2023

phofl commented Jan 6, 2023 • edited Loading

jorisvandenbossche Jan 6, 2023

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

lithomas1 left a comment

Choose a reason for hiding this comment

phofl commented Jan 5, 2023 •

edited

Loading

phofl commented Jan 6, 2023 •

edited

Loading