-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF (string): de-duplicate ArrowStringArray methods #59555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Could we do it the other way around? i.e. have the method in ArrowExtensionArray point to the one defined in ArrowStringArray? |
I think the thing to do is put it in the ArrowStringArrayMixin. I'm vaguely planning to move all the relevant stuff up there after the current crop of branches (some still local) are done. |
If we're doing this moving of methods away from ArrowExtensionArray, using |
This REF is behavior-neutral
…On Wed, Aug 21, 2024 at 10:38 AM Matthew Roeschke ***@***.***> wrote:
If we're doing this moving of methods away from ArrowExtensionArray, using
pandas.ArrowDtype(pyarrow.string()) will continue to use NA semantics
correct?
—
Reply to this email directly, view it on GitHub
<#59555 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6BAA2VILEGLUIJ6H33ZSTGA5AVCNFSM6AAAAABMYUEVH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGYZDIMBUGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
If that's the plan, my preference would be do to it at once, instead of spread over multiple PRs. I know we generally want smaller more granular PRs, but given that this will need to be backported, doing this in one go instead of multiple steps will make that easier. |
Also a question on the mixin. What's the reason we need the mixin? (compared to keeping things in ArrowStringArray itself) Right now ArrowStringArray doesn't actually inherit from the mixin, but uses explicit Or is the idea that ArrowStringArray(NumpySemantics) eventually actually inherits from ArrowStringArrayMixin? |
It is not really needed. ArrowEA subclasses the mixin and ArrowStringArray subclasses ArrowEA. So all the methods could go directly on ArrowEA, its just that class is already really bulky. Putting the methods directly on ArrowStringArray and reversing the inheritance would surely be possible but that would be more invasive. |
Ah, yes, I forgot about ArrowStringArray subclassing ArrowExtensionArray and therefore also subclassing the mixin. That means we should either add the methods explicitly with |
Yes, one of my local de-dup branches does this for the relevant methods. |
pandas/core/arrays/string_arrow.py
Outdated
_str_strip = ArrowExtensionArray._str_strip | ||
_str_lstrip = ArrowExtensionArray._str_lstrip | ||
_str_rstrip = ArrowExtensionArray._str_rstrip | ||
_str_removesuffix = ArrowStringArrayMixin._str_removesuffix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_str_removesuffix
is also assigned in ArrowStringArrayNumpySemantics
below. That can then be removed there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will update
If we deduplicate those methods, my preference would be to directly move those methods from ArrowEA to the mixin, and get them from there in ArrowStringArray. Or is there a reason we do not do that here, but use that pattern (shared methods in the mixin instead of in ArrowEA) for other methods? |
b32e828
to
3d9366c
Compare
839f8d2
to
2bbf515
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some small comments
pandas/core/arrays/string_arrow.py
Outdated
else: | ||
result = pc.utf8_rtrim(self._pa_array, characters=to_strip) | ||
return type(self)(result) | ||
return ArrowExtensionArray._str_slice(self, start=start, stop=stop, step=step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ArrowExtensionArray._str_slice(self, start=start, stop=stop, step=step) | |
return ArrowStringArrayMixin._str_slice(self, start=start, stop=stop, step=step) |
?
I see you didn't actually move _str_slice
so the above wouldn't work. But could you move that as well? (most other methods you are touching here are now in the mixin class I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also leave out slice because you are fixing that in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update after the slice PR goes in
pandas/core/arrays/string_arrow.py
Outdated
removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix)) | ||
result = pc.if_else(starts_with, removed, self._pa_array) | ||
return type(self)(result) | ||
return ArrowExtensionArray._str_removeprefix(self, prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return ArrowExtensionArray._str_removeprefix(self, prefix) | |
return ArrowStringArrayMixin._str_removeprefix(self, prefix) |
Just for consistency, because for the direct assignments we point to ArrowStringArrayMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will update
pandas/core/arrays/string_arrow.py
Outdated
_str_slice_replace = ArrowStringArrayMixin._str_slice_replace | ||
_str_len = ArrowStringArrayMixin._str_len | ||
|
||
_rank = ArrowExtensionArray._rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is strictly speaking not needed? (because it is only otherwise defined on the base EA class, so ArrowEA comes first in the method resolution order)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will remove
7f891c7
to
f970e3f
Compare
There are a couple of failures after the latest update |
Those failures are happening on other PRs as well, I think, so merging. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.