Skip to content

PERF: improve IntegerArray fast constructor #33359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

Trying to improve the IntegerArray constructor from its constituents.

Example test case:

a = pd.array([1, 2, 3], dtype="Int64")
values = a._data
mask = a._mask 
pd.arrays.IntegerArray(values, mask)
In [5]: %timeit pd.arrays.IntegerArray(values, mask)  
2.9 µs ± 45.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # master
775 ns ± 41.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # PR

So what is in the PR is one possible solution, to show the impact of using "general dtype checking functions" (that eg also need to deal with EA dtypes, getting the dtype of values being passed, etc). It is clear that here we can do a much more specialized check, and that this gives a considerable performance boost (in eg block/column-wise operations on IntegerArray, this is reconstructed many times). Some options:

  • do an inline specialized check as done here
  • define separate helper functions for those specialized checks (eg restricting to numpy dtypes)
  • improve performance of is_bool_dtype / is_integer_dtype for those cases (maybe putting a check like this as a "fastpath" at the beginning of those functions might avoid more costly checks done in those functions for this specific case (eg "is_bool_np_dtype")
  • simply don't do any validation at all in the IntegerArray constructor, and assume this is the responsibility of the caller
  • provide a keyword in the IntegerArray constructor to turn off validation
  • add a private constructor (similar as the _simple_new we have for others) that uses this fastpath

For optimal performance, having a way to don't do any validation at all (so also don't check the passed values are ndarray, or their dtypes) would actually even be better than what this PR does. But I also like the clean constructor we have now (eg no separate private constructor)

@jorisvandenbossche jorisvandenbossche added NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance labels Apr 7, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1 milestone Apr 7, 2020
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What prevents us from applying these changes to the is_<dtype>_dtype methods? Is it just that non-dtype inputs are accepted, or are there other behavior differences?

@jorisvandenbossche
Copy link
Member Author

Eg is_bool_dtype does:

  • check the value (if it is an array, get the dtype, if it is a string, convert to a dtype object, etc)
  • check if it is a CategoricalDtype, and in that case check the dtype of the categories
  • check whether the passed object was an Index, in which case object dtype can be inferred to be boolean
  • check if it is an extension dtype, in which case we check the _is_boolean property
  • finally check issubclass(dtype.type, np.bool_) (which is also still slightly slower than dtype == np.bool_ or dtype.kind == "b", but I am not fully sure there are practical differences)

So we need to keep all those checks I think, but we might be able to start that function with a short-cut check for the special case of a numpy dtype (see my third bullet point above)

@TomAugspurger
Copy link
Contributor

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

raise TypeError(
"values should be integer numpy array. Use "
"the 'integer_array' function instead"
)
if not (isinstance(mask, np.ndarray) and is_bool_dtype(mask.dtype)):
if not (isinstance(mask, np.ndarray) and mask.dtype == np.bool_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is common with boolean array (boolean already uses mask.dtype == np.bool_) (and maybe string array constructor should allow a mask). can this check be moved to the base class to avoid duplication.

Also, in boolean array, we check the dim of values and mask, should we also check dim for integer array (may affect perfomance?) or in the base masked array class.

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Apr 7, 2020

For optimal performance, having a way to don't do any validation at all (so also don't check the passed values are ndarray, or their dtypes) would actually even be better than what this PR does. But I also like the clean constructor we have now (eg no separate private constructor)

IIUC constructing integer array from the IntegerArray constructor is not part of the public api, so just removing the validation is also an option?

and rely on mypy to check internal uses.

@TomAugspurger
Copy link
Contributor

IIUC constructing integer array from the IntegerArray constructor is not part of the public api, so just removing the validation is also an option?

This is a bit hard, since you can do things like type(pd.array([1, 2, 3]))(...), which uses just public APIs to call the IntegerArray constructor. It's kinda hard to make a __init__ private.

@jorisvandenbossche
Copy link
Member Author

You can't make the init private, but that doesn't mean you need to accept all kinds of values or do validation of the input in the init. It can be the documented behaviour (now, I suppose in the public init we do want some basic validation, to avoid people shooting themselves in the foot. But it doesn't need to be, or it certainly can be the bare minimum)

@jorisvandenbossche
Copy link
Member Author

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

Do you think any of those items I listed can be deprecated?
Passing an array instead of dtype, yes (which also means the object dtype Index can't be checked anymore), but for the rest I think we want to keep those behaviours?

@jbrockmendel
Copy link
Member

I'd be happy with deprecating some of those behaviors, but that's a separate issue :)

Stumbled across this comment moments after opening #33368

@jbrockmendel
Copy link
Member

LGTM. The dtype.kind checks are way more performant than the is_foo_dtype, especially when checking multiple possibilities like ["i", "u"]

@jorisvandenbossche
Copy link
Member Author

Are we ok with moving forward with this?
Even if we make the is_.._dtype checks faster (xref the other PRs/discussions), they probably can never beat dtype.kind in ["i", "u"] in case you already checked dtype to be a numpy dtype (eg is_integer_dtype will also need to check for extension integer dtype).

I am personally fine with having the more specialized check here in the constructor.

@TomAugspurger
Copy link
Contributor

I'm fine with the specialized / high-performance checks here.

@jreback jreback merged commit efa85af into pandas-dev:master Apr 10, 2020
@jreback
Copy link
Contributor

jreback commented Apr 10, 2020

thanks @jorisvandenbossche yep a specialized check a-ok with me

@jorisvandenbossche jorisvandenbossche deleted the perf-maskedarray-constructor branch April 10, 2020 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NA - MaskedArrays Related to pd.NA and nullable extension arrays Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants