-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support min/max on ArrowStringArray #42597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Mrocklin, I am pretty new to open contributions, I can give a try on this! |
Support min/max on ArrowStringArray pandas-dev#42597
Looks like pyarrow doesn't know how to do min / max on string data yet: In [12]: import pyarrow.compute
In [13]: pyarrow.compute.min_max(s.array._data)
---------------------------------------------------------------------------
ArrowNotImplementedError Traceback (most recent call last)
<ipython-input-13-22066b22cdbd> in <module>
----> 1 pyarrow.compute.min_max(s.array._data)
~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/compute.py in min_max(array, options, memory_pool, **kwargs)
~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/_compute.pyx in pyarrow._compute.Function.call()
~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/miniconda3/envs/pandas=1.3.0/lib/python3.9/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowNotImplementedError: Function min_max has no kernel matching input types (array[string]) @simonjayhawkins remind me, do we have a general policy on whether Arrow StringDtype will cast and fall back to a slow mode, or do we require users to do it explicitly? |
Is there someone on the Arrow side that we should ping? |
I was going to file a JIRA at https://issues.apache.org/jira/login.jsp but I don't have my login saved on this machine :) |
yes, we basically make ArrowStringArray match StringArray behaviour, falling back where needed. Users do not need to do anything special. There are a few xfailed tests for ArrowStringArray that did not get fixed in time for 1.3.0. min/max is one of those. IIRC I might have fixed in #40962. will look soon. |
I'll relabel as a bug and we will backport the fix. |
does pyarrow 4.0.1 supporot this? |
mistaken, that PR only changed the error to be a |
Hmm, I might open an issue to revisit that. Those kinds of silent performance cliffs are hard to debug. IMO, we should wait / help PyArrow implement the efficient kernels and then wrap them here, without the expensive conersion to Python objects. |
+1 . For my very narrow use case I'm very much in favor of focusing on adding min/max than on fallback behavior. I'm not looking at the whole picture though. |
I think we discussed somewhere about adding performance warnings when object fallback was in play. will try to find it. |
our min supported version of PyArrow for ArrowStringArray is 1.0.0 and so the kernels guaranteed available are limited. |
at one point we had #39908 (comment) to make this apparent to the users but was removed from the release notes and not yet added elsewhere in the documentation. |
it does not support
I've not done a comprehensive performance analysis, but |
i suspect we could improve the performance for larger arrays using pc.partition_nth_indices |
IMO, it's not worth rushing a short-term fix. I opened https://issues.apache.org/jira/browse/ARROW-13410. I have no sense for how difficult this would be to implement in Arrow, but if there's already support for sorting then perhaps min / max won't be too difficult. |
I'll change the milestone to "no action" for now as there now does not seem to be an appetite to continue down the object fallback path #42613. (personally I think the object fallback would help adoption) |
This works now on main and 1.5. I believe this was fixed by #47730 when |
Motivation
In order for Dask to perform large shuffles (set_index, join on a non-index column, ...) on a column it needs to be able to compute quantiles.
To do this it is useful to compute min/max values.
What actually breaks
When I try to do this on columns of type
string[pyarrow]
I get the following exceptionSolution
I am hopeful that Arrow maybe already has an min/max implementation and they just haven't been hooked up yet.
The text was updated successfully, but these errors were encountered: