-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: improve IntegerArray fast constructor #33359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: improve IntegerArray fast constructor #33359
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What prevents us from applying these changes to the is_<dtype>_dtype
methods? Is it just that non-dtype inputs are accepted, or are there other behavior differences?
Eg
So we need to keep all those checks I think, but we might be able to start that function with a short-cut check for the special case of a numpy dtype (see my third bullet point above) |
I'd be happy with deprecating some of those behaviors, but that's a separate issue :) |
pandas/core/arrays/integer.py
Outdated
raise TypeError( | ||
"values should be integer numpy array. Use " | ||
"the 'integer_array' function instead" | ||
) | ||
if not (isinstance(mask, np.ndarray) and is_bool_dtype(mask.dtype)): | ||
if not (isinstance(mask, np.ndarray) and mask.dtype == np.bool_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is common with boolean array (boolean already uses mask.dtype == np.bool_) (and maybe string array constructor should allow a mask). can this check be moved to the base class to avoid duplication.
Also, in boolean array, we check the dim of values and mask, should we also check dim for integer array (may affect perfomance?) or in the base masked array class.
IIUC constructing integer array from the IntegerArray constructor is not part of the public api, so just removing the validation is also an option? and rely on mypy to check internal uses. |
This is a bit hard, since you can do things like |
You can't make the init private, but that doesn't mean you need to accept all kinds of values or do validation of the input in the init. It can be the documented behaviour (now, I suppose in the public init we do want some basic validation, to avoid people shooting themselves in the foot. But it doesn't need to be, or it certainly can be the bare minimum) |
Do you think any of those items I listed can be deprecated? |
Stumbled across this comment moments after opening #33368 |
LGTM. The dtype.kind checks are way more performant than the is_foo_dtype, especially when checking multiple possibilities like |
Are we ok with moving forward with this? I am personally fine with having the more specialized check here in the constructor. |
I'm fine with the specialized / high-performance checks here. |
thanks @jorisvandenbossche yep a specialized check a-ok with me |
Trying to improve the IntegerArray constructor from its constituents.
Example test case:
So what is in the PR is one possible solution, to show the impact of using "general dtype checking functions" (that eg also need to deal with EA dtypes, getting the dtype of values being passed, etc). It is clear that here we can do a much more specialized check, and that this gives a considerable performance boost (in eg block/column-wise operations on IntegerArray, this is reconstructed many times). Some options:
is_bool_dtype
/is_integer_dtype
for those cases (maybe putting a check like this as a "fastpath" at the beginning of those functions might avoid more costly checks done in those functions for this specific case (eg "is_bool_np_dtype")_simple_new
we have for others) that uses this fastpathFor optimal performance, having a way to don't do any validation at all (so also don't check the passed values are ndarray, or their dtypes) would actually even be better than what this PR does. But I also like the clean constructor we have now (eg no separate private constructor)