-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Allow dictionaries to be passed to pandas.Series.str.replace #56175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
rhshadrach
merged 22 commits into
pandas-dev:main
from
rmhowe425:dev/Series/str-replace
Feb 12, 2024
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
b1ea8c2
Adding implementation, unit tests, and documentation updates.
rmhowe425 573e6e4
Fixing code check and parameterizing unit tests.
rmhowe425 f9fe71e
Added additional documentation.
rmhowe425 77a579e
Updating implementation based on reviewer feedback.
rmhowe425 5de632a
Merge branch 'main' into dev/Series/str-replace
rmhowe425 d472b01
Fixing documentation issues.
rmhowe425 eceb234
Attempting to fix double line break.
rmhowe425 5702ea9
Removed string casting for value parameter in call to _str_replace.
rmhowe425 ab28c9e
Merge branch 'main' into dev/Series/str-replace
rmhowe425 3be3c6b
Merge branch 'main' into dev/Series/str-replace
rmhowe425 f01728f
Updating whatsnew to fix merge conflict.
rmhowe425 9ca546b
Merge branch 'main' into dev/Series/str-replace
rmhowe425 591e380
Updated implementation based on reviewer feedback.
rmhowe425 bae43ed
Cleaning up implementation.
rmhowe425 0359039
Merge branch 'main' into dev/Series/str-replace
rmhowe425 f0dcd55
Merge branch 'main' into dev/Series/str-replace
rmhowe425 f626cfb
Moving contribution note to 3.0
rmhowe425 1f50035
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] a3014de
Update accessor.py
rmhowe425 95947a5
Merge branch 'pandas-dev:main' into dev/Series/str-replace
rmhowe425 470a648
Merge branch 'main' into dev/Series/str-replace
rmhowe425 c6f2d3a
Merge branch 'main' into dev/Series/str-replace
rmhowe425 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can call _wrap_result just once at the end, rather than inside the for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach wouldn't you need the for loop in the case that pat contained multiple key : value pairs of strings to be replaced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - I'm not suggesting to remove the for loop entirely. Just to call
self._wrap_result
once after the for loop is done rather than every iteration. If you think this is incorrect, let me know and I can take a closer look.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So after doing a bit of debugging it looks like we need to call _wrap_result after each iteration so we can save the output of our string replace and update
res_output
.self._data is a Series and
_str_replace()
returns an NDArray. Since we can't update self._data.array, we need a container to save the output of our string replace, so we're converting it to a Series using_wrap_result()
and then updating our container.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - thanks for checking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than doing the loop here, is there any immediate advantage to passing the dict onto the arrays
_str_replace
method? like avoiding the_wrap_result
IIUC the accessors should only be validating the passed parameters, defining the "pandas string API", providing the documentation and wrapping the array result into a Series.
IMO the implementation should be at array level and then can be overridden if the array types can be optimized or use native methods.
For example, maybe using "._str_map" could be faster for object type and maybe
pyarrow.compute.replace_substring_regex
could be used for arrow backed strings?The array level optimizations need not be in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, but agreed it need not be here (this is perf neutral compared to the status quo). If not tackled here, we can throw up an issue noting the performance improvement. @rmhowe425 - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I think this idea is deserving of a separate issue. Happy to work that issue as well!