Skip to content

ENH: Allow ArrowDtype(pa.string()) to be compatable with str accessor #51207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Feb 16, 2023

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Feb 7, 2023

Reboot of #50325

Notable change was that pandas.core.strings.BaseStringArrayMethods in the __init__.py was causing a circular import when used in ArrowExtensionArray, so needed to remove the import into __init__.py. This breaks one of our dask tests but it has been fixed downstream dask/dask#9907

@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Feb 7, 2023
@jbrockmendel
Copy link
Member

Looks like dask-related doctests are failing. do we need to pin a version somewhere?

@@ -405,7 +406,7 @@ def _reconstruct_ea_result(
"""
dtype: BaseMaskedDtype | StringDtype

if isinstance(values.dtype, StringDtype):
if is_string_dtype(values.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesnt make a difference here bc we know we have EA, but we should have a func like is_string_dtype that excludes object

@@ -96,6 +96,7 @@
)

if TYPE_CHECKING:
from pandas.core.arrays.string_ import StringDtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the annotation that uses this may no longer be accurate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added to | ArrowDtype to the annotation where this is used

@mroeschke
Copy link
Member Author

I think the docbuild will fail until dask releases a new version.

@jrbourbeau typically how often does dask release a new version?

@jrbourbeau
Copy link
Contributor

@jrbourbeau typically how often does dask release a new version?

We release every two weeks. In fact, the latest release was pushed out a couple of hours ago

@mroeschke
Copy link
Member Author

Oh perfect looks like this PR picked up the dask update and the tests are passing now!

@mroeschke mroeschke added this to the 2.0 milestone Feb 11, 2023
@mroeschke
Copy link
Member Author

mroeschke commented Feb 15, 2023

Any other comments here? (Failures are unrelated)

)
def test_str_unsupported_methods(method, args):
ser = pd.Series(["abc", None], dtype=ArrowDtype(pa.string()))
with pytest.raises(NotImplementedError, match=f"str.{method} not supported."):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the message be string[pyarrow]-specific?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I noted it specifically with pd.ArrowDtype(pa.string())

def _str_fullmatch(
self, pat, case: bool = True, flags: int = 0, na: Scalar | None = None
):
if not pat.endswith("$") or pat.endswith("//$"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see this is already in the string_arrow code, but why "//$"?

Copy link
Member Author

@mroeschke mroeschke Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, as you mentioned I just copied it over from the string_arrow code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Best guess is it should be backslashes, meant to exclude escaped dollar signs. OK to consider out of scope.

Copy link
Member

@jbrockmendel jbrockmendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mroeschke mroeschke merged commit 0796d9d into pandas-dev:main Feb 16, 2023
@mroeschke mroeschke deleted the enh/str/arrow branch February 16, 2023 02:07
@phofl
Copy link
Member

phofl commented Feb 16, 2023

Nice! thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants