Skip to content

Bug pyarrow implementation of str.fullmatch matches partial string. issue #56652 #56691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

JackCollins91
Copy link
Contributor

@JackCollins91 JackCollins91 commented Dec 31, 2023

Changed array.py: Makes Series(["abc$abc]).str.fullmatch("abc\$") give the same result as when dtype = pyarrow.string as opposed to dtype = str.

Issue reporter (Issue-#56652) requested change "//$" to "\$", but this resulted in DepreciationWarnng in pytest, so used "\\$" instead.

Change test_arrow.py: updated test_str_fullmatch to account for edge cases where string starts with and ends with literal $ as well as ends with the string.

…ssue-pandas-dev#56652

changed array.py: Makes Series(["abc$abc]).str.fullmatch("abc\\$") give the same result as when dtype = pyarow.string as opposed to dtype = str.

Issue reporter (Issue-pandas-dev#56652) requested change "//$" to "\$", but this resulted in DepreciationWarnng, so used "\\$" instead.

Change test_arrow.py: updated test_str_fullmatch to account for edge cases where string starts with and ends with literal $ as well as ends with the string.
Copy link
Contributor

@asishm asishm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR!

Grepping the codebase for '//\$' gives

./pandas/core/arrays/arrow/array.py
./pandas/core/arrays/string_arrow.py

Testing further - it looks like the first affects dtype=pd.ArrowDtype(pa.string()) while the second one affects dtype="string[pyarrow]". The fix should be applied to both of the above files.

Also while the test method states test_str_fullmatch it seems to be testing ser.str.match instead. The fullmatch tests are present in ./pandas/tests/strings/test_find_replace.py

@JackCollins91
Copy link
Contributor Author

JackCollins91 commented Dec 31, 2023

Thanks @asishm , I can shortly make a commit which makes the changes in ./pandas/core/arrays/string_arrow.py
and also updates the fullmatch test in ./pandas/tests/strings/test_find_replace.py

To clarify, from testing, the changes to ./pandas/core/arrays/arrow/array.py also affect test_str_fullmatch even though it does so via str.fullmatch().

Changed: string_arrow: because it shoud be conistent with other changes. Affects circumstance where dtype="string[pyarrow]" as opposed to dtype=pd.ArrowDtype(pa.string())

Changed: test_find_replace.py: Added unit test for the edge cases where the user wants o search for a string that ends in a literal dollar sign. The string_arrow.oy updates effect this also.

Question: For consistency, I formatted test_fullmatch_dollar_literal() to match other unit tests in .py - Should I instead aim to implement @pytest.mark.parametrize, which is in consistent, but more in line with developer guidelines?
@JackCollins91
Copy link
Contributor Author

Updated PR.
Question: For consistency, I formatted test_fullmatch_dollar_literal() to match other unit tests in .py - Should I instead aim to implement @pytest.mark.parametrize, which is inconsistent with its neighbour tests, but more in-line with developer guidelines?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks could you add a whatsnew note in v2.2.0.rst under the Strings section?

@mroeschke mroeschke added Strings String extension data type and string data Arrow pyarrow functionality labels Jan 2, 2024
@JackCollins91
Copy link
Contributor Author

Thanks @mroeschke that update to v2.2.0.rst is added now.

…ntation-of-str.fullmatch-matches-partial-string.-Issue-pandas-dev#56652
@JackCollins91
Copy link
Contributor Author

Resolved merge conflict by rebasing off upstream main.

@mroeschke mroeschke added this to the 2.2 milestone Jan 3, 2024
@mroeschke mroeschke merged commit dc94b98 into pandas-dev:main Jan 3, 2024
@mroeschke
Copy link
Member

Thanks @JackCollins91

Copy link

lumberbot-app bot commented Jan 3, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 dc94b987a0c6920c0e14351372a374d566313cef
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #56691: Bug pyarrow implementation of str.fullmatch matches partial string. issue #56652'
  1. Push to a named branch:
git push YOURFORK 2.2.x:auto-backport-of-pr-56691-on-2.2.x
  1. Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #56691 on branch 2.2.x (Bug pyarrow implementation of str.fullmatch matches partial string. issue #56652)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Jan 3, 2024
mroeschke added a commit that referenced this pull request Jan 3, 2024
….fullmatch matches partial string. issue #56652) (#56715)

Backport PR #56691: Bug pyarrow implementation of str.fullmatch matches partial string. issue #56652

Co-authored-by: JackCollins91 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Pyarrow implementation of str.fullmatc matches partial string
4 participants