PERF: improve IntegerArray fast constructor #33359

jorisvandenbossche · 2020-04-07T09:07:13Z

Trying to improve the IntegerArray constructor from its constituents.

Example test case:

a = pd.array([1, 2, 3], dtype="Int64")
values = a._data
mask = a._mask 
pd.arrays.IntegerArray(values, mask)

In [5]: %timeit pd.arrays.IntegerArray(values, mask)  
2.9 µs ± 45.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # master
775 ns ± 41.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

So what is in the PR is one possible solution, to show the impact of using "general dtype checking functions" (that eg also need to deal with EA dtypes, getting the dtype of values being passed, etc). It is clear that here we can do a much more specialized check, and that this gives a considerable performance boost (in eg block/column-wise operations on IntegerArray, this is reconstructed many times). Some options:

do an inline specialized check as done here
define separate helper functions for those specialized checks (eg restricting to numpy dtypes)
improve performance of is_bool_dtype / is_integer_dtype for those cases (maybe putting a check like this as a "fastpath" at the beginning of those functions might avoid more costly checks done in those functions for this specific case (eg "is_bool_np_dtype")
simply don't do any validation at all in the IntegerArray constructor, and assume this is the responsibility of the caller
provide a keyword in the IntegerArray constructor to turn off validation
add a private constructor (similar as the _simple_new we have for others) that uses this fastpath

For optimal performance, having a way to don't do any validation at all (so also don't check the passed values are ndarray, or their dtypes) would actually even be better than what this PR does. But I also like the clean constructor we have now (eg no separate private constructor)

TomAugspurger

What prevents us from applying these changes to the is_<dtype>_dtype methods? Is it just that non-dtype inputs are accepted, or are there other behavior differences?

jorisvandenbossche · 2020-04-07T10:54:25Z

Eg is_bool_dtype does:

check the value (if it is an array, get the dtype, if it is a string, convert to a dtype object, etc)
check if it is a CategoricalDtype, and in that case check the dtype of the categories
check whether the passed object was an Index, in which case object dtype can be inferred to be boolean
check if it is an extension dtype, in which case we check the _is_boolean property
finally check issubclass(dtype.type, np.bool_) (which is also still slightly slower than dtype == np.bool_ or dtype.kind == "b", but I am not fully sure there are practical differences)

So we need to keep all those checks I think, but we might be able to start that function with a short-cut check for the special case of a numpy dtype (see my third bullet point above)

TomAugspurger · 2020-04-07T11:17:52Z

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

simonjayhawkins · 2020-04-07T13:41:58Z

pandas/core/arrays/integer.py

            raise TypeError(
                "values should be integer numpy array. Use "
                "the 'integer_array' function instead"
            )
-        if not (isinstance(mask, np.ndarray) and is_bool_dtype(mask.dtype)):
+        if not (isinstance(mask, np.ndarray) and mask.dtype == np.bool_):


this check is common with boolean array (boolean already uses mask.dtype == np.bool_) (and maybe string array constructor should allow a mask). can this check be moved to the base class to avoid duplication.

Also, in boolean array, we check the dim of values and mask, should we also check dim for integer array (may affect perfomance?) or in the base masked array class.

simonjayhawkins · 2020-04-07T13:49:03Z

For optimal performance, having a way to don't do any validation at all (so also don't check the passed values are ndarray, or their dtypes) would actually even be better than what this PR does. But I also like the clean constructor we have now (eg no separate private constructor)

IIUC constructing integer array from the IntegerArray constructor is not part of the public api, so just removing the validation is also an option?

and rely on mypy to check internal uses.

TomAugspurger · 2020-04-07T13:53:27Z

IIUC constructing integer array from the IntegerArray constructor is not part of the public api, so just removing the validation is also an option?

This is a bit hard, since you can do things like type(pd.array([1, 2, 3]))(...), which uses just public APIs to call the IntegerArray constructor. It's kinda hard to make a __init__ private.

jorisvandenbossche · 2020-04-07T14:20:19Z

You can't make the init private, but that doesn't mean you need to accept all kinds of values or do validation of the input in the init. It can be the documented behaviour (now, I suppose in the public init we do want some basic validation, to avoid people shooting themselves in the foot. But it doesn't need to be, or it certainly can be the bare minimum)

jorisvandenbossche · 2020-04-07T14:21:11Z

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

Do you think any of those items I listed can be deprecated?
Passing an array instead of dtype, yes (which also means the object dtype Index can't be checked anymore), but for the rest I think we want to keep those behaviours?

jbrockmendel · 2020-04-07T14:55:05Z

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

Stumbled across this comment moments after opening #33368

jbrockmendel · 2020-04-07T14:57:47Z

LGTM. The dtype.kind checks are way more performant than the is_foo_dtype, especially when checking multiple possibilities like ["i", "u"]

jorisvandenbossche · 2020-04-10T08:27:15Z

Are we ok with moving forward with this?
Even if we make the is_.._dtype checks faster (xref the other PRs/discussions), they probably can never beat dtype.kind in ["i", "u"] in case you already checked dtype to be a numpy dtype (eg is_integer_dtype will also need to check for extension integer dtype).

I am personally fine with having the more specialized check here in the constructor.

…constructor

TomAugspurger · 2020-04-10T13:27:32Z

I'm fine with the specialized / high-performance checks here.

jreback · 2020-04-10T17:35:22Z

thanks @jorisvandenbossche yep a specialized check a-ok with me

jorisvandenbossche added 2 commits April 7, 2020 08:50

PERF: improve IntegerArray fast constructor

acaa8c4

add benchmarks

357a59f

jorisvandenbossche added NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Apr 7, 2020

jorisvandenbossche added this to the 1.1 milestone Apr 7, 2020

TomAugspurger reviewed Apr 7, 2020

View reviewed changes

simonjayhawkins reviewed Apr 7, 2020

View reviewed changes

jorisvandenbossche added 2 commits April 10, 2020 10:30

Merge remote-tracking branch 'upstream/master' into perf-maskedarray-…

fd86c5f

…constructor

consolidate common checks in the base class

d1cc90e

jreback merged commit efa85af into pandas-dev:master Apr 10, 2020

jorisvandenbossche deleted the perf-maskedarray-constructor branch April 10, 2020 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve IntegerArray fast constructor #33359

PERF: improve IntegerArray fast constructor #33359

jorisvandenbossche commented Apr 7, 2020

TomAugspurger left a comment

jorisvandenbossche commented Apr 7, 2020

TomAugspurger commented Apr 7, 2020

simonjayhawkins Apr 7, 2020

simonjayhawkins commented Apr 7, 2020 •

edited

Loading

TomAugspurger commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

jorisvandenbossche commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020

jreback commented Apr 10, 2020

PERF: improve IntegerArray fast constructor #33359

PERF: improve IntegerArray fast constructor #33359

Conversation

jorisvandenbossche commented Apr 7, 2020

TomAugspurger left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 7, 2020

TomAugspurger commented Apr 7, 2020

simonjayhawkins Apr 7, 2020

Choose a reason for hiding this comment

simonjayhawkins commented Apr 7, 2020 • edited Loading

TomAugspurger commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jorisvandenbossche commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

jbrockmendel commented Apr 7, 2020

jorisvandenbossche commented Apr 10, 2020

TomAugspurger commented Apr 10, 2020

jreback commented Apr 10, 2020

simonjayhawkins commented Apr 7, 2020 •

edited

Loading