-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[ArrowStringArray] Use utf8_is_*
functions from Apache Arrow if available
#41041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ArrowStringArray] Use utf8_is_*
functions from Apache Arrow if available
#41041
Conversation
Cool!
I opened #41051. In the end it's only a modest speed-up (3x), but I don't think we can improve further ourselves. Also compared to the actual string operation, I assume this conversion will be fast enough. |
with the changes in #41051 now actually see an improvement.
|
@jorisvandenbossche i'll leave this as draft till #41051 is merged. |
I merged #41051, so you can update this now |
Thanks @jorisvandenbossche Have also paramaterised some more tests for the |
|
@@ -758,6 +759,69 @@ def _str_map(self, f, na_value=None, dtype: Dtype | None = None): | |||
# -> We don't know the result type. E.g. `.get` can return anything. | |||
return lib.map_infer_mask(arr, f, mask.view("uint8")) | |||
|
|||
def _str_isalnum(self): | |||
if hasattr(pc, "utf8_is_alnum"): | |||
result = pc.utf8_is_alnum(self._data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point (not necessarily this PR), it might be worth benchmarking to see if calling pc.string_is_ascii
first to then potentially use pc.ascii_is_alnum
instead of pc.utf8_is_alnum
could be worth it (which would be assuming that testing whether it's all ascii takes much less time than the benefit from using the faster ascii algorithm vs the utf8 one)
""" | ||
from pandas.core.arrays.string_arrow import ArrowStringDtype # noqa: F401 | ||
|
||
return request.param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar fixture does not yet exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet. this is being added to a couple of PRs and can be promoted to a higher level conftest once one of the PRs has been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good
Looking good, thanks for the benchmarks! |
}: | ||
reason = "extract/extractall does not yet dispatch to array" | ||
mark = pytest.mark.xfail(reason=reason) | ||
request.node.add_marker(mark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is extract
fixed now? (but not in this PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the tests no longer fail. there is a change in this PR that "fixes" by special casing ArrowStringArray (like StringArray). extract/extractall will still need to be updated to dispatch to the array. not in this PR. see #41041 (comment)
the change to pandas/core/strings/accessor.py makes ArrowStringArray work like StringArray. If we don't add the change I would need to xfail test_empty_str_methods which totally defeats the purpose of parameterising the tests to get extra test coverage for the is_methods. The alternative is to split test_empty_str_methods and xfail the extract/extractall tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks for the explanation. No need to split off, I was just wondering how the changes in this PR "fixed" it ;)
I don't expect any problem (since you check the presence of the attribute), but can you merge latest master (I just merged a PR to fix CI build with pyarrow 0.15- |
Thanks! |
xref xhochy/fletcher#203
marked as draft since AFAICT there is performance issues with BooleanDtype().from_arrow