ENH: added regex argument to Series.str.split #44185

saehuihwang · 2021-10-26T02:13:17Z

closes BUG: not correct work str.split #43563
closes Feature request: Allow regex flags for str.split #32835
closes DOC: more specific str.split() explanation #25549
xref BUG: Some string methods treat "." as regex, others don't #37963
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

I've preserved current behavior, in which regex = None. Currently, it handles the pattern as a regex if the length of pattern is not 1. I believe that in the future, this may be worth considering deprecating.

pep8speaks · 2021-10-26T02:13:21Z

Hello @saehuihwang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-02 23:23:40 UTC

attack68

Just a couple of quick comments @saehuihwang

attack68 · 2021-10-26T17:38:07Z

pandas/core/strings/accessor.py

@@ -657,7 +657,7 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):

    Parameters
    ----------
-    pat : str, optional
+    pat : str, or compiled regex optional


Suggested change

pat : str, or compiled regex optional

pat : str or compiled regex, optional

Thanks for catching this! change has been made

@attack68 sorry, I had missed the first comma. change has now been made according to your suggestion. Sorry for missing it earlier

attack68 · 2021-10-26T17:50:40Z

pandas/core/strings/accessor.py

+    >>> s = pd.Series(['fooojpgbar.jpg'])
+    >>> s.str.split(r".", expand=True)
+                0    1
+    0  fooojpgbar  jpg


Not a fan of this new example, seems confusing, and not sure what difference it is showing. Can it be simplified to better demonstrate the new features? I quite like the original actually, although text based like:

s = pd.Series(["foo and bar plus baz"]) s.str.split(r"and|plus", expand=True) 0 1 2 0 foo bar baz

Hi @attack68, thanks for reviewing my pull request!

I assumed that a lot of people use st.split to handle urls and file names, so I wanted to include a similar example that includes a ".xxx" at the end. The new set of examples also helps illustrate "." be used as a regex and as a regular string depending on the regex flag.
What do you think? I can go ahead and include the original examples but also include a few more examples that illustrate the new feature?

attack68 · 2021-10-26T17:51:53Z

pandas/tests/strings/test_split_partition.py

@@ -34,6 +34,27 @@ def test_split(any_string_dtype):
    exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]])
    tm.assert_series_equal(result, exp)

+    # explicit regex = True split
+    values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype)


is there a less confusing test to work with?

This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?

do you have tests from all the issues you are closing?

Yes, the test for #43563 is included, #32835 is a feature request, and #25549 is a doc issue.

This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc?

My philosophy is that you can use either the same test, or modified slightly or a better test depending upon whether it suits the purpose. Whilst a subjective opinion, the text qweqwejpgqweqwe.jpg is obfuscatingly confusing and a modification to this would make the test more readable yet still applicable.

Good point! It was a bit of a struggle to come up with a good test string haha. The best I came up with was "xxxjpgzzz.jpg". Let me know if you have better ideas

jreback · 2021-10-28T01:54:42Z

pandas/core/strings/accessor.py

@@ -669,6 +669,16 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
        * If ``True``, return DataFrame/MultiIndex expanding dimensionality.
        * If ``False``, return Series/Index, containing lists of strings.

+    regex : bool, default None
+        Determines whether to handle the pattern as a regular expression.


add a versionadded 1.4 tag

pandas/core/strings/accessor.py

pandas/tests/strings/test_split_partition.py

jreback · 2021-10-28T01:57:01Z

pandas/tests/strings/test_split_partition.py

@@ -34,6 +34,27 @@ def test_split(any_string_dtype):
    exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]])
    tm.assert_series_equal(result, exp)

+    # explicit regex = True split
+    values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype)


do you have tests from all the issues you are closing?

jreback

looks good. if you can do that small refactor in this PR would be great.

cc @simonjayhawkins @Dr-Irv if any comments

jreback · 2021-10-29T13:20:43Z

pandas/core/strings/object_array.py

+        n=-1,
+        expand=False,
+        regex: bool | None = None,
+    ):


would be nice to factor this logic of pattern stuff into a common function to share with str_replace

pandas/pandas/core/strings/object_array.py

Lines 152 to 160 in b0992ee

if case is False:

# add case flag, if provided

flags |= re.IGNORECASE

if regex or flags or callable(repl):

if not isinstance(pat, re.Pattern):

if regex is False:

pat = re.escape(pat)

pat = re.compile(pat, flags=flags)

Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments case and flags, while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat), while str_replace does not.

In a set of another PRs, I can do the following to str_split

deprecation of value dependent regex determination
currently, when regex==None, len(pat)==1 is handled as a literal string and len(pat)!=1 is handled as a regex.

addition of case and flag arguments

common handling of logic with str_replace

But for now, I think str_replace and str_split are not similar enough to share a common logic handling function.

while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat),

then shouldn't we ad case and flags and make these consistent? what do we need to deprecate here?

yes there is weird logic by using len(pat) but this is new logic you are adding no?

I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future.

pandas/pandas/core/strings/object_array.py

Lines 317 to 325 in 2fa2d5c

if len(pat) == 1:

if n is None or n == 0:

n = -1

f = lambda x: x.split(pat, n)

else:

if n is None or n == -1:

n = 0

regex = re.compile(pat)

f = lambda x: regex.split(x, maxsplit=n)

In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the case and flags if you think that is a good idea.
Thanks

I saw that you didn't change the current len(pat) logic, which I don't like either. Maybe it is better if we are adding an explicit regex argument to permit only True, False, and remove the len(pat) logic, which only confuses things.
Of course need a release note and a versionchanged note.

I'm happy to get rid of the len(pat) logic and thus the regex=None option, but I didn't want to introduce a breaking change.

If you guys want me to go ahead, what should the regex default be? I don't want to break anyone's existing code. For example, if we make regex default to True, it will break someone's code splitting on a period (pat=".")

@jreback what do you think on breaking change here. I would be inclined to set the default to True, since it uses Regex in every case where the pattern length is is greater than 1, and in most cases where the pattern length is 1, then regex would give the same result as string anyway, so it might only break in a few cases.

While I agree that removing this logic would be great, I'd be worried about changing the default to True without a deprecation. While most cases won't be affected like you mention, I'd imagine many users (who may not even know what a regex is) often split on punctuation like "." or "-". Many of the issues referenced here (or others for replace) stem from the user not realizing that the pattern is being treated as regex - I'd guess we'd see a bunch more of those if the default is changed

since this is a new method we could change things, but yeah maybe this is too much for now.

ok what i think we should do is this.

factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent.

jreback · 2021-10-29T13:21:48Z

cc @attack68 if any comments

attack68 · 2021-10-29T16:12:05Z

lgtm

lithomas1

Generally LGTM.

pandas/tests/strings/test_split_partition.py

attack68 · 2021-10-31T18:04:12Z

Before merging I think it would be worth just revisiting if we want to get rid of the pattern length 1 logic - its pretty ropey, and there is a good opportunity here.

simonjayhawkins · 2021-11-01T10:37:29Z

Before merging I think it would be worth just revisiting if we want to get rid of the pattern length 1 logic - its pretty ropey, and there is a good opportunity here.

Yes, once we have a regex kwarg, we could add a deprecation since we would have a path for users to change their code, but I don't think it needs to be done in this PR.

Some issues refer to consistency with the replace method, but mainly considering split here, I would probably like to see...

Firstly, Series.str.split to be compatible with Python str.split

So that would require that the default is to always treat sep as a literal and rename the n arg maxsplit

Secondly, add the functionality of re.split to the pandas method without affecting the above.

There are two ways to use the split capability of the Python re module, re.split(pattern, string, maxsplit=0, flags=0) and Pattern.split(string, maxsplit=0)

So I am happy that a regex kwarg is added (could perhaps be kwarg only) and also that the sep argument can now accept a compiled regex.

By accepting a compiled regex we are now implicitly adding the case and flags capability in this PR.

Adding flags kwarg (in a follow-up) would make pandas split with regex=True and a string passed to sep consistent with Python re.split(pattern, string, maxsplit=0, flags=0)

And also adding a case kwarg as a special case of flags would make this consistent with pandas replace

I think these points are covered by https://github.com/pandas-dev/pandas/pull/44185/files#r739415484 and can be done as a follow-up.

@attack68 anything else preventing merging this

Thanks @saehuihwang for the PR.

attack68 · 2021-11-01T16:07:59Z

no , think thats a great plan and lgtm

pandas/core/strings/accessor.py

… passed in compiled regex

Dr-Irv

lgtm

jreback · 2021-11-04T00:40:54Z

thanks @saehuihwang very nice.

would like to refactor the inner code if possible to have all of the regex handling in one place if you are interested.

Saehui Hwang added 13 commits October 15, 2021 00:23

BUG: sort_index did not respect ignore_index when not sorting

f55c968

BUG: sort_index did not respect ignore_index when not sorting

e56f8fb

BUG: sort_index did not respect ignore_index when not sorting

282ef51

BUG: sort_index did not respect ignore_index when not sorting

2c5402e

Merge remote-tracking branch 'upstream/master' into sort_index

609a77f

moved test to frame test directory

837427b

parameterized over inplace and ignore_index

7523b1b

BUG fix split

e1a0aa7

ENH: added regex argument to Series.str.split

d7b3d8e

format change

20dc2a6

resolve conflict

03eaa90

resolve conflict

0b139f3

ENH: added regex argument to Series.str.split

1604915

Saehui Hwang added 3 commits October 25, 2021 19:15

changed whatsnew

8312d79

fixed mypy error

a82639c

more specific docs

76e6001

attack68 requested changes Oct 26, 2021

View reviewed changes

Saehui Hwang added 2 commits October 27, 2021 10:45

added example

2c43fb5

Merge remote-tracking branch 'upstream/master' into str_split

e95416d

jreback requested changes Oct 28, 2021

View reviewed changes

jreback added Enhancement Strings String extension data type and string data labels Oct 28, 2021

Saehui Hwang added 3 commits October 27, 2021 22:13

changed doc to match str_replace, moved tests to a new test func

2ed7980

changed test string to be readable

5f0d8df

changed test string to be readable

ba812a1

jreback approved these changes Oct 29, 2021

View reviewed changes

jreback added this to the 1.4 milestone Oct 29, 2021

lithomas1 reviewed Oct 30, 2021

View reviewed changes

pandas/tests/strings/test_split_partition.py Show resolved Hide resolved

added test for raises error when regex=False and pat is regex

e2da861

lithomas1 approved these changes Oct 31, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into str_split

2855fa8

Saehui Hwang added 3 commits November 1, 2021 21:53

Merge remote-tracking branch 'upstream/master' into str_split

ed37375

added test for explicit regex=True with compiled regex

057dcfb

got rid of unnecessary comma in doc string

ece00f1

simonjayhawkins reviewed Nov 2, 2021

View reviewed changes

pandas/core/strings/accessor.py Show resolved Hide resolved

Dr-Irv reviewed Nov 2, 2021

View reviewed changes

pandas/core/strings/accessor.py Show resolved Hide resolved

attack68 approved these changes Nov 2, 2021

View reviewed changes

Saehui Hwang added 2 commits November 2, 2021 10:39

added compiled regex example, changed logic so that becomes true when…

b6bbf3e

… passed in compiled regex

corrected docs

27ffee7

Dr-Irv approved these changes Nov 2, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into str_split

b97ebe9

simonjayhawkins approved these changes Nov 3, 2021

View reviewed changes

mzeitlin11 approved these changes Nov 3, 2021

View reviewed changes

jreback merged commit 669acb4 into pandas-dev:master Nov 4, 2021

MarcoGorelli mentioned this pull request Mar 23, 2022

DOC: remove regex argument from .str.rsplit docs #46488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: added regex argument to Series.str.split #44185

ENH: added regex argument to Series.str.split #44185

saehuihwang commented Oct 26, 2021 •

edited

Loading

pep8speaks commented Oct 26, 2021 •

edited

Loading

attack68 left a comment

attack68 Oct 26, 2021

saehuihwang Oct 28, 2021

saehuihwang Nov 2, 2021

attack68 Oct 26, 2021

saehuihwang Oct 27, 2021

attack68 Oct 26, 2021

saehuihwang Oct 26, 2021

jreback Oct 28, 2021

saehuihwang Oct 28, 2021

attack68 Oct 28, 2021

saehuihwang Oct 28, 2021

jreback Oct 28, 2021

saehuihwang Oct 28, 2021

jreback Oct 28, 2021

jreback left a comment

jreback Oct 29, 2021

saehuihwang Oct 29, 2021 •

edited

Loading

jreback Oct 29, 2021

saehuihwang Oct 29, 2021

attack68 Oct 29, 2021

saehuihwang Oct 29, 2021

attack68 Oct 31, 2021

mzeitlin11 Oct 31, 2021

jreback Nov 1, 2021

jreback commented Oct 29, 2021

attack68 commented Oct 29, 2021

lithomas1 left a comment

attack68 commented Oct 31, 2021

simonjayhawkins commented Nov 1, 2021

attack68 commented Nov 1, 2021

Dr-Irv left a comment

jreback commented Nov 4, 2021

	pat : str, or compiled regex optional
	pat : str or compiled regex, optional

	if case is False:
	# add case flag, if provided
	flags \|= re.IGNORECASE

	if regex or flags or callable(repl):
	if not isinstance(pat, re.Pattern):
	if regex is False:
	pat = re.escape(pat)
	pat = re.compile(pat, flags=flags)

	if len(pat) == 1:
	if n is None or n == 0:
	n = -1
	f = lambda x: x.split(pat, n)
	else:
	if n is None or n == -1:
	n = 0
	regex = re.compile(pat)
	f = lambda x: regex.split(x, maxsplit=n)

ENH: added regex argument to Series.str.split #44185

ENH: added regex argument to Series.str.split #44185

Conversation

saehuihwang commented Oct 26, 2021 • edited Loading

pep8speaks commented Oct 26, 2021 • edited Loading

Comment last updated at 2021-11-02 23:23:40 UTC

attack68 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saehuihwang Oct 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 29, 2021

attack68 commented Oct 29, 2021

lithomas1 left a comment

Choose a reason for hiding this comment

attack68 commented Oct 31, 2021

simonjayhawkins commented Nov 1, 2021

attack68 commented Nov 1, 2021

Dr-Irv left a comment

Choose a reason for hiding this comment

jreback commented Nov 4, 2021

saehuihwang commented Oct 26, 2021 •

edited

Loading

pep8speaks commented Oct 26, 2021 •

edited

Loading

saehuihwang Oct 29, 2021 •

edited

Loading