Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ENH: added regex argument to Series.str.split #44185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: added regex argument to Series.str.split #44185
Changes from all commits
f55c968
e56f8fb
282ef51
2c5402e
609a77f
837427b
7523b1b
e1a0aa7
d7b3d8e
20dc2a6
03eaa90
0b139f3
1604915
8312d79
a82639c
76e6001
2c43fb5
e95416d
2ed7980
5f0d8df
ba812a1
e2da861
2855fa8
ed37375
057dcfb
ece00f1
b6bbf3e
27ffee7
b97ebe9
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to factor this logic of pattern stuff into a common function to share with str_replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/pandas/core/strings/object_array.py
Lines 152 to 160 in b0992ee
Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments
case
andflags
, while str_split does not. str_split handles the case whenregex=None
by using the weird logic withlen(pat)
, while str_replace does not.In a set of another PRs, I can do the following to str_split
deprecation of value dependent regex determination
currently, when
regex==None
,len(pat)==1
is handled as a literal string andlen(pat)!=1
is handled as a regex.addition of
case
andflag
argumentscommon handling of logic with str_replace
But for now, I think str_replace and str_split are not similar enough to share a common logic handling function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then shouldn't we ad
case
andflags
and make these consistent? what do we need to deprecate here?yes there is weird logic by using
len(pat)
but this is new logic you are adding no?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future.
pandas/pandas/core/strings/object_array.py
Lines 317 to 325 in 2fa2d5c
In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the
case
andflags
if you think that is a good idea.Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that you didn't change the current
len(pat)
logic, which I don't like either. Maybe it is better if we are adding an explicitregex
argument to permit only True, False, and remove thelen(pat)
logic, which only confuses things.Of course need a release note and a versionchanged note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to get rid of the len(pat) logic and thus the
regex=None
option, but I didn't want to introduce a breaking change.If you guys want me to go ahead, what should the
regex
default be? I don't want to break anyone's existing code. For example, if we makeregex
default toTrue
, it will break someone's code splitting on a period (pat="."
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback what do you think on breaking change here. I would be inclined to set the default to
True
, since it uses Regex in every case where the pattern length is is greater than 1, and in most cases where the pattern length is 1, then regex would give the same result as string anyway, so it might only break in a few cases.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that removing this logic would be great, I'd be worried about changing the default to
True
without a deprecation. While most cases won't be affected like you mention, I'd imagine many users (who may not even know what a regex is) often split on punctuation like "." or "-". Many of the issues referenced here (or others forreplace
) stem from the user not realizing that the pattern is being treated as regex - I'd guess we'd see a bunch more of those if the default is changedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this is a new method we could change things, but yeah maybe this is too much for now.
ok what i think we should do is this.
factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent.