Skip to content

ENH: added regex argument to Series.str.split #44185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Nov 4, 2021

Conversation

saehuihwang
Copy link
Contributor

@saehuihwang saehuihwang commented Oct 26, 2021

I've preserved current behavior, in which regex = None. Currently, it handles the pattern as a regex if the length of pattern is not 1. I believe that in the future, this may be worth considering deprecating.

@pep8speaks
Copy link

pep8speaks commented Oct 26, 2021

Hello @saehuihwang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-02 23:23:40 UTC

Copy link
Contributor

@attack68 attack68 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of quick comments @saehuihwang

@@ -657,7 +657,7 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):

Parameters
----------
pat : str, optional
pat : str, or compiled regex optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pat : str, or compiled regex optional
pat : str or compiled regex, optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! change has been made

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@attack68 sorry, I had missed the first comma. change has now been made according to your suggestion. Sorry for missing it earlier

Comment on lines 789 to 792
>>> s = pd.Series(['fooojpgbar.jpg'])
>>> s.str.split(r".", expand=True)
0 1
0 fooojpgbar jpg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of this new example, seems confusing, and not sure what difference it is showing. Can it be simplified to better demonstrate the new features? I quite like the original actually, although text based like:

s = pd.Series(["foo and bar plus baz"])
s.str.split(r"and|plus", expand=True)
    0   1   2
0 foo bar baz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @attack68, thanks for reviewing my pull request!

I assumed that a lot of people use st.split to handle urls and file names, so I wanted to include a similar example that includes a ".xxx" at the end. The new set of examples also helps illustrate "." be used as a regex and as a regular string depending on the regex flag.
What do you think? I can go ahead and include the original examples but also include a few more examples that illustrate the new feature?

@@ -34,6 +34,27 @@ def test_split(any_string_dtype):
exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]])
tm.assert_series_equal(result, exp)

# explicit regex = True split
values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a less confusing test to work with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have tests from all the issues you are closing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test for #43563 is included, #32835 is a feature request, and #25549 is a doc issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?

My philosophy is that you can use either the same test, or modified slightly or a better test depending upon whether it suits the purpose. Whilst a subjective opinion, the text qweqwejpgqweqwe.jpg is obfuscatingly confusing and a modification to this would make the test more readable yet still applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! It was a bit of a struggle to come up with a good test string haha. The best I came up with was "xxxjpgzzz.jpg". Let me know if you have better ideas

@@ -669,6 +669,16 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
* If ``True``, return DataFrame/MultiIndex expanding dimensionality.
* If ``False``, return Series/Index, containing lists of strings.

regex : bool, default None
Determines whether to handle the pattern as a regular expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a versionadded 1.4 tag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay!

@@ -34,6 +34,27 @@ def test_split(any_string_dtype):
exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]])
tm.assert_series_equal(result, exp)

# explicit regex = True split
values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have tests from all the issues you are closing?

@jreback jreback added Enhancement Strings String extension data type and string data labels Oct 28, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. if you can do that small refactor in this PR would be great.

cc @simonjayhawkins @Dr-Irv if any comments

n=-1,
expand=False,
regex: bool | None = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to factor this logic of pattern stuff into a common function to share with str_replace

Copy link
Contributor Author

@saehuihwang saehuihwang Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if case is False:
# add case flag, if provided
flags |= re.IGNORECASE
if regex or flags or callable(repl):
if not isinstance(pat, re.Pattern):
if regex is False:
pat = re.escape(pat)
pat = re.compile(pat, flags=flags)

Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments case and flags, while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat), while str_replace does not.

In a set of another PRs, I can do the following to str_split

  • deprecation of value dependent regex determination
    currently, when regex==None, len(pat)==1 is handled as a literal string and len(pat)!=1 is handled as a regex.

  • addition of case and flag arguments

  • common handling of logic with str_replace

But for now, I think str_replace and str_split are not similar enough to share a common logic handling function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat),

then shouldn't we ad case and flags and make these consistent? what do we need to deprecate here?

yes there is weird logic by using len(pat) but this is new logic you are adding no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future.

if len(pat) == 1:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
else:
if n is None or n == -1:
n = 0
regex = re.compile(pat)
f = lambda x: regex.split(x, maxsplit=n)

In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the case and flags if you think that is a good idea.
Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that you didn't change the current len(pat) logic, which I don't like either. Maybe it is better if we are adding an explicit regex argument to permit only True, False, and remove the len(pat) logic, which only confuses things.
Of course need a release note and a versionchanged note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to get rid of the len(pat) logic and thus the regex=None option, but I didn't want to introduce a breaking change.

If you guys want me to go ahead, what should the regex default be? I don't want to break anyone's existing code. For example, if we make regex default to True, it will break someone's code splitting on a period (pat=".")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what do you think on breaking change here. I would be inclined to set the default to True, since it uses Regex in every case where the pattern length is is greater than 1, and in most cases where the pattern length is 1, then regex would give the same result as string anyway, so it might only break in a few cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that removing this logic would be great, I'd be worried about changing the default to True without a deprecation. While most cases won't be affected like you mention, I'd imagine many users (who may not even know what a regex is) often split on punctuation like "." or "-". Many of the issues referenced here (or others for replace) stem from the user not realizing that the pattern is being treated as regex - I'd guess we'd see a bunch more of those if the default is changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a new method we could change things, but yeah maybe this is too much for now.

ok what i think we should do is this.

factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent.

@jreback jreback added this to the 1.4 milestone Oct 29, 2021
@jreback
Copy link
Contributor

jreback commented Oct 29, 2021

cc @attack68 if any comments

@attack68
Copy link
Contributor

lgtm

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

@attack68
Copy link
Contributor

Before merging I think it would be worth just revisiting if we want to get rid of the pattern length 1 logic - its pretty ropey, and there is a good opportunity here.

@simonjayhawkins
Copy link
Member

Before merging I think it would be worth just revisiting if we want to get rid of the pattern length 1 logic - its pretty ropey, and there is a good opportunity here.

Yes, once we have a regex kwarg, we could add a deprecation since we would have a path for users to change their code, but I don't think it needs to be done in this PR.

Some issues refer to consistency with the replace method, but mainly considering split here, I would probably like to see...

Firstly, Series.str.split to be compatible with Python str.split

So that would require that the default is to always treat sep as a literal and rename the n arg maxsplit

Secondly, add the functionality of re.split to the pandas method without affecting the above.

There are two ways to use the split capability of the Python re module, re.split(pattern, string, maxsplit=0, flags=0) and Pattern.split(string, maxsplit=0)

So I am happy that a regex kwarg is added (could perhaps be kwarg only) and also that the sep argument can now accept a compiled regex.

By accepting a compiled regex we are now implicitly adding the case and flags capability in this PR.

Adding flags kwarg (in a follow-up) would make pandas split with regex=True and a string passed to sep consistent with Python re.split(pattern, string, maxsplit=0, flags=0)

And also adding a case kwarg as a special case of flags would make this consistent with pandas replace

I think these points are covered by https://github.com/pandas-dev/pandas/pull/44185/files#r739415484 and can be done as a follow-up.

@attack68 anything else preventing merging this


Thanks @saehuihwang for the PR.

@attack68
Copy link
Contributor

attack68 commented Nov 1, 2021

no , think thats a great plan and lgtm

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit 669acb4 into pandas-dev:master Nov 4, 2021
@jreback
Copy link
Contributor

jreback commented Nov 4, 2021

thanks @saehuihwang very nice.

would like to refactor the inner code if possible to have all of the regex handling in one place if you are interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Strings String extension data type and string data
Projects
None yet
8 participants