PERF: improve performance of infer_dtype #51054

topper-123 · 2023-01-29T12:20:05Z

Improves performance of api.types.infer_dtype:

>>> import numpy as np
>>> import pandas as pd
>>> from pandas.api.types import infer_dtype
>>>
>>> base_arr = np.arange(10_000_000)
>>> x = np.array(base_arr)
>>> %timeit infer_dtype(x)
596 ns ± 128 ns per loop  # main
146 ns ± 8.73 ns per loop  # this PR
>>> x = pd.Series(base_arr)
>>> %timeit infer_dtype(x)
6.83 µs ± 33.7 ns per loop  # main
583 ns ± 5.4 ns per loop  # this PR
>>> x = pd.Series(base_arr, dtype="Int32")
>>> %timeit infer_dtype(x)
3.92 µs ± 9.9 ns per loop  # main
816 ns ± 4.75 ns per loop  # this PR
>>> x = pd.array(base_arr)
>>> %timeit infer_dtype(x)
310 ns ± 0.704 ns per loop  # main
582 ns ± 4.73 ns per loop  # this PR , a bit slower

jbrockmendel · 2023-01-29T20:47:28Z

pandas/_libs/lib.pyx

@@ -1475,11 +1482,17 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
        bint seen_val = False
        flatiter it

+    if not util.is_array(value):
+        if isinstance(value, (ABCSeries, ABCExtensionArray, ABCPandasArray)):


any way to do this without relying on the ABCFoos?

I agree it's not ideal, but it was the best solution I could find that keeps the performance:

Importing Series, Index etc. into the global namespace is not possible because of circular import issues

Importing Series, Index etc. inside infer_dtype is possible, but costs 300 ns per imported class for each function call, so it adds up to something very noticable

I thought of checking class names as a guard before importing inside the function, but that gives problems with subclasses.

The advantage of importing ABCFoo into the global namespace is that it's a onetime import, so no cost for each function call and it's a nice speed boost. But I also see the problem of importing from pythonland.

IMO the speedup is worth this, but I can also see it's not ideal, so no biggie if it doesn't get accepted.

definitely not suggesting using the non-ABC versions. my thought is that without adding this here, in the elif hasattr(value, "dtype") block we could do the _try_infer_map(value.dtype) check before the "Unwrap Series/Index" bit

topper-123 · 2023-02-05T13:02:17Z

I've made a new version without ABCs.

topper-123 · 2023-02-05T13:02:50Z

pandas/core/indexes/base.py

-                "u": "integer",
-                "f": "floating",
-                "c": "complex",
-            }[self.dtype.kind]


This is no longer needed for a speed-up.

topper-123 · 2023-02-05T13:04:13Z

pandas/_libs/lib.pyx

@@ -1484,23 +1484,17 @@ def infer_dtype(value: object, skipna: bool = True) -> str:

    if util.is_array(value):
        values = value
-    elif hasattr(value, "inferred_type") and skipna is False:
+    elif hasattr(type(value), "inferred_type") and skipna is False:


The issue in the old version was that using hasattr on a cache_readonly is slow. By doing this we get a good speed-up.

topper-123 · 2023-02-05T19:09:16Z

Rerunning the tests gives these result:

>>> import numpy as np
>>> import pandas as pd
>>> from pandas.api.types import infer_dtype
>>>
>>> base_arr = np.arange(10_000_000)
>>> x = np.array(base_arr)
>>> %timeit infer_dtype(x)
596 ns ± 128 ns per loop  # main
146 ns ± 8.73 ns per loop  # this PR, v1
140 ns ± 0.0592 ns per loop  # this PR, v2
>>> x = pd.Series(base_arr)
>>> %timeit infer_dtype(x)
6.83 µs ± 33.7 ns per loop  # main
583 ns ± 5.4 ns per loop  # this PR, v1
725 ns ± 5.24 ns per loop  # this PR, v2
>>> x = pd.Series(base_arr, dtype="Int32")
>>> %timeit infer_dtype(x)
3.92 µs ± 9.9 ns per loop  # main
816 ns ± 4.75 ns per loop  # this PR, v1
765 ns ± 4.81 ns per loop  # This PR, v2
>>> x = pd.array(base_arr)
>>> %timeit infer_dtype(x)
310 ns ± 0.704 ns per loop  # main
582 ns ± 4.73 ns per loop  # this PR, v1
426 ns ± 2.96 ns per loop # this PR, v2

jbrockmendel

LGTM

topper-123 · 2023-02-09T13:42:12Z

Gentle ping.

mroeschke · 2023-02-09T17:30:46Z

Thanks @topper-123

topper-123 added 2 commits January 29, 2023 12:17

PERF: improve performance of infer_dtype

803cc68

adds GH-number

d983942

jbrockmendel reviewed Jan 29, 2023

View reviewed changes

mroeschke added Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Jan 30, 2023

topper-123 added 2 commits February 5, 2023 10:32

Merge branch 'master' into infer_dtype_performance

fda03bd

version without ABCs

027f290

topper-123 commented Feb 5, 2023

View reviewed changes

jbrockmendel approved these changes Feb 7, 2023

View reviewed changes

topper-123 mentioned this pull request Feb 8, 2023

BUG: bug in Index._should_fallback_to_positional #51241

Merged

5 tasks

mroeschke approved these changes Feb 9, 2023

View reviewed changes

mroeschke added this to the 2.0 milestone Feb 9, 2023

mroeschke merged commit 12faa2e into pandas-dev:main Feb 9, 2023

topper-123 deleted the infer_dtype_performance branch February 9, 2023 21:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improve performance of infer_dtype #51054

PERF: improve performance of infer_dtype #51054

topper-123 commented Jan 29, 2023 •

edited

Loading

jbrockmendel Jan 29, 2023

topper-123 Jan 29, 2023

jbrockmendel Jan 30, 2023

topper-123 commented Feb 5, 2023

topper-123 Feb 5, 2023

topper-123 Feb 5, 2023

jbrockmendel Feb 7, 2023

topper-123 commented Feb 5, 2023 •

edited

Loading

jbrockmendel left a comment

topper-123 commented Feb 9, 2023

mroeschke commented Feb 9, 2023

PERF: improve performance of infer_dtype #51054

PERF: improve performance of infer_dtype #51054

Conversation

topper-123 commented Jan 29, 2023 • edited Loading

jbrockmendel Jan 29, 2023

Choose a reason for hiding this comment

topper-123 Jan 29, 2023

Choose a reason for hiding this comment

jbrockmendel Jan 30, 2023

Choose a reason for hiding this comment

topper-123 commented Feb 5, 2023

topper-123 Feb 5, 2023

Choose a reason for hiding this comment

topper-123 Feb 5, 2023

Choose a reason for hiding this comment

jbrockmendel Feb 7, 2023

Choose a reason for hiding this comment

topper-123 commented Feb 5, 2023 • edited Loading

jbrockmendel left a comment

Choose a reason for hiding this comment

topper-123 commented Feb 9, 2023

mroeschke commented Feb 9, 2023

topper-123 commented Jan 29, 2023 •

edited

Loading

topper-123 commented Feb 5, 2023 •

edited

Loading