-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: fastpaths in is_foo_dtype checks #33400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain: The current def is_categorical_dtype(arr_or_dtype) -> bool:
if arr_or_dtype is None:
return False
return CategoricalDtype.is_dtype(arr_or_dtype) where So for this case, we can put the fast check as in this PR first: def is_categorical_dtype(arr_or_dtype) -> bool:
if isinstance(dtype, ExtensionDtype):
# fast check for extension dtype
return dtype.name == "category"
elif isinstance(dtype, np.dtype):
# fast check for numpy dtype (always False)
return False
elif arr_or_dtype is None:
return False
# keep the slow check if the fast ones didn't return
return CategoricalDtype.is_dtype(arr_or_dtype) And in that way, we don't have to change every occurrence of It might be that categorical is a simple example, but I would think that a similar pattern can be followed for the others as well. |
Also as discussed on the chat, the advantage of the dtype-only versions is that we dont risk silently going down the slow path. |
yep i would start by using the approach outlined above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be changed to use the approach @jorisvandenbossche outlined, e.g. not changing the names of the current is_* functions, just enhancing the impl with a fast path.
I'm happy to update the existing functions to be faster, but we should also have the strict versions internally to make sure we actually use the fastpath |
Actually, adding the fastpath to the existing checks will actually slow them down slightly in cases where we're not passing a dtype obj |
I think I'm OK with slowing down the non-dtype case. I'm also OK with deprecating non-dtype entirely, but that's a larger task that can be saved for later. Short-term, I think we can rely on reviewers to catch places where we're passing non-dtype's to Edit: all of the above is contingent with actually being able to quickly do an |
agreed here |
One more doomed pitch for the strict versions: they're really nice dependency-structure-wise. ATM dtypes.common depends on dtypes.dtypes, which has runtime imports and depends on a bunch of the code. With just the strict versions, we can get dtypes.common (which we import from basically everywhere) to depend only on things "above" it in the dependency structure. I know others dont care about the dependency structure as much as I do, will update. |
updated+green |
* PERF: implement dtype-only dtype checks * remove strict versions
* PERF: implement dtype-only dtype checks * remove strict versions
xref #33364, partially addresses #33368
For exposition I used the new type checking functions in parsers.pyx.
For the implementation I chose
.type
and.kind
checks in order to avoid needing the imports fromdtypes.dtypes
, which has dependencies on a bunch of other parts of the code.