-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: ensure_string_array with non-numpy input array #37371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: ensure_string_array with non-numpy input array #37371
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow! assume we have sufficient benchmarks?
pls either add a whatsnew (or since this is similar to others that that you have recently done for this type of function, ok to just add this PR number there). ping on ngreen.
pandas/_libs/lib.pyx
Outdated
for i in range(n): | ||
val = arr[i] | ||
arr_val = arr[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does arr_val and result_val need to be in cdef?
Specifically for Categorical, we can actually get something (possibly) faster by special casing it to astype the categories only:
(this is also something that can work in general for other dtypes as well) |
The underlying problem is in the string conversion, so should be fixed there: In [1]: n = 50_000
...: cat = pd.Categorical([*['a', pd.NA] * n])
In [2]: %timeit pd.arrray(cat, dtype="string") # same underlying problem
508 ms ± 9.6 ms per loop # master
6.03 ms ± 184 µs per loop # this PR |
I think the failures are unrelated. I'll update later with the whatsnew and timing runs, if needed. |
b4ee319
to
0a58696
Compare
I've added this in the whatsnet, but this actually fixes a perf. regression coming from #36464 (the change pandas/_libs/lib.pyx). |
possibly but this is a separate issue and orthogonal here. |
pandas/_libs/lib.pyx
Outdated
@@ -651,6 +651,13 @@ cpdef ndarray[object] ensure_string_array( | |||
cdef: | |||
Py_ssize_t i = 0, n = len(arr) | |||
|
|||
from pandas.core.dtypes.common import is_extension_array_dtype | |||
|
|||
if is_extension_array_dtype(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid doing this as it circularizs things, you can check if it has a .to_numpy
method or maybe @jbrockmendel has a better way here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've changed to use hasattr(arr, "to_numpy")
.
0a58696
to
d9f8e6e
Compare
thanks @topper-123 |
if hasattr(arr, "to_numpy"): | ||
arr = arr.to_numpy() | ||
elif not isinstance(arr, np.ndarray): | ||
arr = np.array(arr, dtype="object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be asarray
not a big deal, but my preference would have been to type the input arr as an ndarray and handle non-ndarray in the calling function
Currently, if the input array to
ensure_string_array
is a python object (e.g. anExtensionArray
), the function will constantly switch between python and cython level code, which is slow.This PR fixes that by ensuring we always have a numpy array, avoiding the trips to python level code.
xref #35519, #36464.