PERF: ensure_string_array with non-numpy input array #37371

topper-123 · 2020-10-23T21:11:34Z

Currently, if the input array to ensure_string_array is a python object (e.g. an ExtensionArray), the function will constantly switch between python and cython level code, which is slow.

This PR fixes that by ensuring we always have a numpy array, avoiding the trips to python level code.

>>> n = 50_000
>>> cat = pd.Categorical([*['a', pd.NA] * n])
>>> %timeit cat.astype("string")
447 ms ± 11.9 ms per loop  # master
5.43 ms ± 80.5 µs per loop  # this PR

xref #35519, #36464.

jreback

wow! assume we have sufficient benchmarks?

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

jreback · 2020-10-24T02:56:50Z

pandas/_libs/lib.pyx

    for i in range(n):
-        val = arr[i]
+        arr_val = arr[i]


does arr_val and result_val need to be in cdef?

jorisvandenbossche · 2020-10-24T07:43:37Z

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:

In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

(this is also something that can work in general for other dtypes as well)

topper-123 · 2020-10-24T07:50:47Z

The underlying problem is in the string conversion, so should be fixed there:

In [1]: n = 50_000
...:    cat = pd.Categorical([*['a', pd.NA] * n])

In [2]: %timeit pd.arrray(cat, dtype="string")  # same underlying problem 
508 ms ± 9.6 ms per loop  # master
6.03 ms ± 184 µs per loop  # this PR

topper-123 · 2020-10-24T08:58:38Z

I think the failures are unrelated. I'll update later with the whatsnew and timing runs, if needed.

topper-123 · 2020-10-25T07:34:56Z

pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.

I've added this in the whatsnet, but this actually fixes a perf. regression coming from #36464 (the change pandas/_libs/lib.pyx).

jreback · 2020-10-26T03:10:37Z

Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:
In [16]: n = 50_000
    ...: cat = pd.Categorical([*['a', pd.NA] * n])

In [17]: %timeit cat.astype("string")
313 ms ± 8.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit cat.categories.array.astype("string").take(cat.codes, allow_fill=True)
1.86 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
(this is also something that can work in general for other dtypes as well)

possibly but this is a separate issue and orthogonal here.

jreback · 2020-10-26T03:11:38Z

pandas/_libs/lib.pyx

@@ -651,6 +651,13 @@ cpdef ndarray[object] ensure_string_array(
    cdef:
        Py_ssize_t i = 0, n = len(arr)

+    from pandas.core.dtypes.common import is_extension_array_dtype
+
+    if is_extension_array_dtype(arr):


can we avoid doing this as it circularizs things, you can check if it has a .to_numpy method or maybe @jbrockmendel has a better way here.

Ok, I've changed to use hasattr(arr, "to_numpy").

jreback · 2020-10-26T17:37:58Z

thanks @topper-123

jbrockmendel · 2020-10-26T22:45:48Z

pandas/_libs/lib.pyx

+    if hasattr(arr, "to_numpy"):
+        arr = arr.to_numpy()
+    elif not isinstance(arr, np.ndarray):
+        arr = np.array(arr, dtype="object")


this should probably be asarray

not a big deal, but my preference would have been to type the input arr as an ndarray and handle non-ndarray in the calling function

jreback added Categorical Categorical Data Type Performance Memory or execution speed performance Strings String extension data type and string data labels Oct 23, 2020

jreback added this to the 1.2 milestone Oct 23, 2020

jreback requested changes Oct 23, 2020

View reviewed changes

jreback mentioned this pull request Oct 23, 2020

PERF/ENH: add fast astyping for Categorical #37355

Merged

6 tasks

jreback reviewed Oct 24, 2020

View reviewed changes

pandas/_libs/lib.pyx Outdated

for i in range(n):

val = arr[i]

arr_val = arr[i]

Copy link

Contributor

jreback Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does arr_val and result_val need to be in cdef?

topper-123 force-pushed the ensure_string_array_perf branch from b4ee319 to 0a58696 Compare October 25, 2020 07:21

REGR/PERF: Index.is_

1cd53fa

jreback requested changes Oct 26, 2020

View reviewed changes

topper-123 added 6 commits October 26, 2020 12:55

Merge branch 'master' of https://github.com/pandas-dev/pandas

986f97a

PERF: ensure_string_array with non-numpy input array

a0c1ec0

fix conversion of nan to string

7b4928c

fix conversion

f28792d

add whatsnew, ASVs

3262d8f

is_extension_dtype -> hasattr

d9f8e6e

topper-123 force-pushed the ensure_string_array_perf branch from 0a58696 to d9f8e6e Compare October 26, 2020 12:58

jreback approved these changes Oct 26, 2020

View reviewed changes

jreback merged commit 3c8c4c9 into pandas-dev:master Oct 26, 2020

topper-123 deleted the ensure_string_array_perf branch October 26, 2020 17:49

jbrockmendel reviewed Oct 26, 2020

View reviewed changes

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

PERF: ensure_string_array with non-numpy input array (pandas-dev#37371)

8e77cdf

ngoldbaum mentioned this pull request Mar 24, 2023

PERF: avoid exceptions in string.Construction benchmark setup #52176

Merged

jorisvandenbossche mentioned this pull request Feb 19, 2024

REGR: astype introducing decimals when casting from int with na to string #57489

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: ensure_string_array with non-numpy input array #37371

PERF: ensure_string_array with non-numpy input array #37371

topper-123 commented Oct 23, 2020 •

edited

Loading

jreback left a comment

jreback Oct 24, 2020

jorisvandenbossche commented Oct 24, 2020

topper-123 commented Oct 24, 2020 •

edited

Loading

topper-123 commented Oct 24, 2020

topper-123 commented Oct 25, 2020

jreback commented Oct 26, 2020

jreback Oct 26, 2020

topper-123 Oct 26, 2020

jreback commented Oct 26, 2020

jbrockmendel Oct 26, 2020

PERF: ensure_string_array with non-numpy input array #37371

PERF: ensure_string_array with non-numpy input array #37371

Conversation

topper-123 commented Oct 23, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback Oct 24, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 24, 2020

topper-123 commented Oct 24, 2020 • edited Loading

topper-123 commented Oct 24, 2020

topper-123 commented Oct 25, 2020

jreback commented Oct 26, 2020

jreback Oct 26, 2020

Choose a reason for hiding this comment

topper-123 Oct 26, 2020

Choose a reason for hiding this comment

jreback commented Oct 26, 2020

jbrockmendel Oct 26, 2020

Choose a reason for hiding this comment

topper-123 commented Oct 23, 2020 •

edited

Loading

topper-123 commented Oct 24, 2020 •

edited

Loading