-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: str.extract
Method Not Implemented for pd.ArrowDtype(pa.string())
#56268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @mroeschke what's the status quo here? |
Yeah at the time never got around to implementing this one (this one was tricky since it can return a Series or DataFrame), but should be available from the pyarrow side with |
In 2.2rc0 I get this error: >>> print(make.str.extract(r'([^a-z A-Z])'))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[114], line 1
----> 1 print(make.str.extract(r'([^a-z A-Z])'))
File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
132 msg = (
133 f"Cannot use .str.{func_name} with values of "
134 f"inferred dtype '{self._inferred_dtype}'."
135 )
136 raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)
File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/strings/accessor.py:2758, in StringMethods.extract(self, pat, flags, expand)
2755 result = DataFrame(columns=columns, dtype=result_dtype)
2757 else:
-> 2758 result_list = self._data.array._str_extract(
2759 pat, flags=flags, expand=returns_df
2760 )
2762 result_index: Index | None
2763 if isinstance(obj, ABCSeries):
File ~/.envs/pd22rc/lib/python3.11/site-packages/pandas/core/arrays/arrow/array.py:2399, in ArrowExtensionArray._str_extract(self, pat, flags, expand)
2397 groups = re.compile(pat).groupindex.keys()
2398 if len(groups) == 0:
-> 2399 raise ValueError(f"{pat=} must contain a symbolic group name.")
2400 result = pc.extract_regex(self._pa_array, pat)
2401 if expand:
ValueError: pat='([^a-z A-Z])' must contain a symbolic group name. I got around it with this:
I think this is a regression, as I now have to include a regex group name (which is one of those things I have to look up everytime I need to do it.) |
Not necessarily a regression since it didn't work before at all? Am I missing something? cc @mroeschke is this expected with your implementation? |
@phofl I mean it worked in Pandas 1. You are correct, it didn't work in 2. |
Yeah unfortunately this is a limitation with pyarrow's extract_regex compute method: https://arrow.apache.org/docs/python/generated/pyarrow.compute.extract_regex.html#pyarrow.compute.extract_regex This probably worked in pandas 1 with the object-backed string type as that iterates through the values using |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The
str.extract
method raises aNotImplementedError
when used on a series with thepd.ArrowDtype(pa.string())
data type.Expected Behavior
Execute like it does with Pandas 1 strings.
Installed Versions
the
str.extract
method raises aNotImplementedError
when used on a series with thepd.ArrowDtype(pa.string())
data type.The text was updated successfully, but these errors were encountered: