-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrowStringArray] Use utf8_is_*
functions from Apache Arrow if available
#41041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2e26503
0987b0e
68ec0ba
1842826
9e2c11b
7fb72f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,19 +13,11 @@ | |
) | ||
|
||
|
||
def test_string_array(nullable_string_dtype, any_string_method, request): | ||
def test_string_array(nullable_string_dtype, any_string_method): | ||
method_name, args, kwargs = any_string_method | ||
if method_name == "decode": | ||
pytest.skip("decode requires bytes.") | ||
|
||
if nullable_string_dtype == "arrow_string" and method_name in { | ||
"extract", | ||
"extractall", | ||
}: | ||
reason = "extract/extractall does not yet dispatch to array" | ||
mark = pytest.mark.xfail(reason=reason) | ||
request.node.add_marker(mark) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the tests no longer fail. there is a change in this PR that "fixes" by special casing ArrowStringArray (like StringArray). extract/extractall will still need to be updated to dispatch to the array. not in this PR. see #41041 (comment) the change to pandas/core/strings/accessor.py makes ArrowStringArray work like StringArray. If we don't add the change I would need to xfail test_empty_str_methods which totally defeats the purpose of parameterising the tests to get extra test coverage for the is_methods. The alternative is to split test_empty_str_methods and xfail the extract/extractall tests There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, thanks for the explanation. No need to split off, I was just wondering how the changes in this PR "fixed" it ;) |
||
|
||
data = ["a", "bb", np.nan, "ccc"] | ||
a = Series(data, dtype=object) | ||
b = Series(data, dtype=nullable_string_dtype) | ||
|
@@ -93,15 +85,10 @@ def test_string_array_boolean_array(nullable_string_dtype, method, expected): | |
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_string_array_extract(nullable_string_dtype, request): | ||
def test_string_array_extract(nullable_string_dtype): | ||
# https://github.com/pandas-dev/pandas/issues/30969 | ||
# Only expand=False & multiple groups was failing | ||
|
||
if nullable_string_dtype == "arrow_string": | ||
reason = "extract does not yet dispatch to array" | ||
mark = pytest.mark.xfail(reason=reason) | ||
request.node.add_marker(mark) | ||
|
||
a = Series(["a1", "b2", "cc"], dtype=nullable_string_dtype) | ||
b = Series(["a1", "b2", "cc"], dtype="object") | ||
pat = r"(\w)(\d)" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point (not necessarily this PR), it might be worth benchmarking to see if calling
pc.string_is_ascii
first to then potentially usepc.ascii_is_alnum
instead ofpc.utf8_is_alnum
could be worth it (which would be assuming that testing whether it's all ascii takes much less time than the benefit from using the faster ascii algorithm vs the utf8 one)