-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Series.str.find fix for pd.ArrowDtype(pa.string()) #56792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a test for a Series with different length strings to ensure the custom implementation is robust to this case?
Extended e2e test to cover strings of different lengths. |
This reverts commit 7fa21eb.
@@ -2368,20 +2368,27 @@ def _str_fullmatch( | |||
return self._str_match(pat, case, flags, na) | |||
|
|||
def _str_find(self, sub: str, start: int = 0, end: int | None = None) -> Self: | |||
if start != 0 and end is not None: | |||
if (start == 0 or start is None) and end is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance you've looked at using a slice object here already? It feels like that could help simplify a lot of the branching being done here for things being None / 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a slice object would help here, since I believe for this method, we need a positive start index, to calculate the offset of the substring from the original string. This PR essentially converts start into equivalent, but positive index, then it can be added to the resulting index returned by pc.find_substring. For example, if string is "abc" and start is -1, we calculate start_offset as 2, which is the equivalent positive index. We can use this offset to find the index of the substring in the original string, by adding it the position of the substring result.
Any thoughts on this PR? I agree conditional code is not good, but I don't have a better solution with current arrow compute surface area. Should I close this PR if this solution is not desirable? |
Sorry this PR slipped through. I am OK with the changes in this PR could you merge in main once more? |
Merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also lgtm. I don't think ci is related? @mroeschke merge whenever
Thanks @rohanjain101 |
* fix find * gh reference * add test for Nones * fix min version compat * restore test * improve test cases * fix empty string * inline * improve tests * fix * Revert "fix" This reverts commit 7fa21eb. * fix * merge * inline --------- Co-authored-by: Rohan Jain <[email protected]>
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.