PERF: fastpaths in is_foo_dtype checks #33400

jbrockmendel · 2020-04-08T16:39:15Z

xref #33364, partially addresses #33368

For exposition I used the new type checking functions in parsers.pyx.

For the implementation I chose .type and .kind checks in order to avoid needing the imports from dtypes.dtypes, which has dependencies on a bunch of other parts of the code.

In [1]: import pandas as pd                                                                                                                                                                                                                                                                                                    
In [2]: from pandas.core.dtypes.common import *                                                                                                                                                                                                                                                                                
In [3]: cat = pd.Categorical([])                                       
In [4]: arr = np.arange(5)
                                                                                                                                                                         
In [5]: %timeit is_categorical_dtype(cat.dtype)                                                                                                                                                                                                                                                                                
1.57 µs ± 34.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [6]: %timeit is_cat_dtype(cat.dtype)                                                                                                                                                                                                                                                                                        
316 ns ± 3.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit is_extension_array_dtype(cat.dtype)                                                                                                                                                                                                                                                                            
364 ns ± 5.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit is_ea_dtype(cat.dtype)                                                                                                                                                                                                                                                                                         
270 ns ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]: %timeit is_extension_array_dtype(arr.dtype)                                                                                                                                                                                                                                                                           
759 ns ± 5.16 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [10]: %timeit is_ea_dtype(arr.dtype)                                                                                                                                                                                                                                                                                        
199 ns ± 8.77 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

…rf-dtype-checks

jorisvandenbossche · 2020-04-10T08:23:09Z

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

The current is_categorical_dtype is defined as:

def is_categorical_dtype(arr_or_dtype) -> bool:
    if arr_or_dtype is None:
        return False
    return CategoricalDtype.is_dtype(arr_or_dtype)

where CategoricalDtype.is_dtype() is the catch-all / check-all slow check.

So for this case, we can put the fast check as in this PR first:

def is_categorical_dtype(arr_or_dtype) -> bool:
    if isinstance(dtype, ExtensionDtype):
        # fast check for extension dtype
        return dtype.name == "category"
    elif isinstance(dtype, np.dtype):
        # fast check for numpy dtype (always False)
        return False
    elif arr_or_dtype is None:
        return False
    # keep the slow check if the fast ones didn't return
    return CategoricalDtype.is_dtype(arr_or_dtype)

And in that way, we don't have to change every occurrence of is_categorical_dtype internally to the fast function.

It might be that categorical is a simple example, but I would think that a similar pattern can be followed for the others as well.

jbrockmendel · 2020-04-10T14:59:35Z

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

Also as discussed on the chat, the advantage of the dtype-only versions is that we dont risk silently going down the slow path.

jreback · 2020-04-10T15:54:55Z

Related to the discussion we had about this on the chat, to give an example of the strategy that I tried to explain:

Also as discussed on the chat, the advantage of the dtype-only versions is that we dont risk silently going down the slow path.

yep i would start by using the approach outlined above.

jreback

this needs to be changed to use the approach @jorisvandenbossche outlined, e.g. not changing the names of the current is_* functions, just enhancing the impl with a fast path.

…rf-dtype-checks

jbrockmendel · 2020-04-16T18:49:31Z

I'm happy to update the existing functions to be faster, but we should also have the strict versions internally to make sure we actually use the fastpath

jbrockmendel · 2020-04-16T18:57:34Z

Actually, adding the fastpath to the existing checks will actually slow them down slightly in cases where we're not passing a dtype obj

TomAugspurger · 2020-04-16T19:41:42Z

I think I'm OK with slowing down the non-dtype case.

I'm also OK with deprecating non-dtype entirely, but that's a larger task that can be saved for later. Short-term, I think we can rely on reviewers to catch places where we're passing non-dtype's to is_foo_dtype.

Edit: all of the above is contingent with actually being able to quickly do an is_foo_dtype on a dtype as a fastpath. I only checked for categorical. Hopefully it's doable for the rest.

jreback · 2020-04-16T20:42:41Z

I think I'm OK with slowing down the non-dtype case.

I'm also OK with deprecating non-dtype entirely, but that's a larger task that can be saved for later. Short-term, I think we can rely on reviewers to catch places where we're passing non-dtype's to is_foo_dtype.

Edit: all of the above is contingent with actually being able to quickly do an is_foo_dtype on a dtype as a fastpath. I only checked for categorical. Hopefully it's doable for the rest.

agreed here

jbrockmendel · 2020-04-16T20:47:40Z

One more doomed pitch for the strict versions: they're really nice dependency-structure-wise. ATM dtypes.common depends on dtypes.dtypes, which has runtime imports and depends on a bunch of the code. With just the strict versions, we can get dtypes.common (which we import from basically everywhere) to depend only on things "above" it in the dependency structure.

I know others dont care about the dependency structure as much as I do, will update.

…rf-dtype-checks

jbrockmendel · 2020-04-16T22:50:55Z

updated+green

* PERF: implement dtype-only dtype checks * remove strict versions

jbrockmendel added 2 commits April 7, 2020 18:21

PERF: implement dtype-only dtype checks

c25cad9

Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…

4280236

…rf-dtype-checks

jbrockmendel added the Performance Memory or execution speed performance label Apr 9, 2020

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Apr 10, 2020

jbrockmendel mentioned this pull request Apr 16, 2020

PERF: statically define classes for is_dtype checks #33364

Closed

jreback requested changes Apr 16, 2020

View reviewed changes

Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…

a68a579

…rf-dtype-checks

jbrockmendel added 2 commits April 16, 2020 13:50

remove strict versions

a56482b

Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…

e8a6a36

…rf-dtype-checks

jbrockmendel changed the title ~~PERF: dtype-only is_foo_dtype checks (up to 5x faster)~~ PERF: fastpaths in is_foo_dtype checks Apr 16, 2020

jreback approved these changes Apr 17, 2020

View reviewed changes

jreback added this to the 1.1 milestone Apr 17, 2020

jreback merged commit 1fa0635 into pandas-dev:master Apr 17, 2020

jbrockmendel deleted the perf-dtype-checks branch April 17, 2020 03:01

CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020

PERF: fastpaths in is_foo_dtype checks (pandas-dev#33400)

fb582e5

* PERF: implement dtype-only dtype checks * remove strict versions

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

PERF: fastpaths in is_foo_dtype checks (pandas-dev#33400)

e2525f5

* PERF: implement dtype-only dtype checks * remove strict versions

xhochy mentioned this pull request Jun 25, 2020

BUG: Cannot create third-party ExtensionArrays for datetime types #34986

Closed

3 tasks

dsaxton mentioned this pull request Aug 21, 2020

BUG: Sparse[datetime64[ns]] TypeError: data type not understood #35762

Closed

3 tasks

simonjayhawkins mentioned this pull request Sep 7, 2020

REGR: Series.__repr__ is broken for SparseDtype("datetime64[ns]") #35843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: fastpaths in is_foo_dtype checks #33400

PERF: fastpaths in is_foo_dtype checks #33400

Uh oh!

jbrockmendel commented Apr 8, 2020

Uh oh!

jorisvandenbossche commented Apr 10, 2020

Uh oh!

jbrockmendel commented Apr 10, 2020

Uh oh!

jreback commented Apr 10, 2020

Uh oh!

jreback left a comment

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

TomAugspurger commented Apr 16, 2020 •

edited

Loading

Uh oh!

jreback commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

Uh oh!

Uh oh!

PERF: fastpaths in is_foo_dtype checks #33400

PERF: fastpaths in is_foo_dtype checks #33400

Uh oh!

Conversation

jbrockmendel commented Apr 8, 2020

Uh oh!

jorisvandenbossche commented Apr 10, 2020

Uh oh!

jbrockmendel commented Apr 10, 2020

Uh oh!

jreback commented Apr 10, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

TomAugspurger commented Apr 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

jbrockmendel commented Apr 16, 2020

Uh oh!

Uh oh!

TomAugspurger commented Apr 16, 2020 •

edited

Loading