Skip to content

ENH: added regex argument to Series.str.split #44185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Nov 4, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f55c968
BUG: sort_index did not respect ignore_index when not sorting
Oct 15, 2021
e56f8fb
BUG: sort_index did not respect ignore_index when not sorting
Oct 15, 2021
282ef51
BUG: sort_index did not respect ignore_index when not sorting
Oct 15, 2021
2c5402e
BUG: sort_index did not respect ignore_index when not sorting
Oct 15, 2021
609a77f
Merge remote-tracking branch 'upstream/master' into sort_index
Oct 17, 2021
837427b
moved test to frame test directory
Oct 17, 2021
7523b1b
parameterized over inplace and ignore_index
Oct 19, 2021
e1a0aa7
BUG fix split
Oct 25, 2021
d7b3d8e
ENH: added regex argument to Series.str.split
Oct 25, 2021
20dc2a6
format change
Oct 25, 2021
03eaa90
resolve conflict
Oct 25, 2021
0b139f3
resolve conflict
Oct 25, 2021
1604915
ENH: added regex argument to Series.str.split
Oct 26, 2021
8312d79
changed whatsnew
Oct 26, 2021
a82639c
fixed mypy error
Oct 26, 2021
76e6001
more specific docs
Oct 26, 2021
2c43fb5
added example
Oct 27, 2021
e95416d
Merge remote-tracking branch 'upstream/master' into str_split
Oct 28, 2021
2ed7980
changed doc to match str_replace, moved tests to a new test func
Oct 28, 2021
5f0d8df
changed test string to be readable
Oct 28, 2021
ba812a1
changed test string to be readable
Oct 28, 2021
e2da861
added test for raises error when regex=False and pat is regex
Oct 31, 2021
2855fa8
Merge remote-tracking branch 'upstream/master' into str_split
Oct 31, 2021
ed37375
Merge remote-tracking branch 'upstream/master' into str_split
Nov 2, 2021
057dcfb
added test for explicit regex=True with compiled regex
Nov 2, 2021
ece00f1
got rid of unnecessary comma in doc string
Nov 2, 2021
b6bbf3e
added compiled regex example, changed logic so that becomes true whe…
Nov 2, 2021
27ffee7
corrected docs
Nov 2, 2021
b97ebe9
Merge remote-tracking branch 'upstream/master' into str_split
Nov 2, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ Other enhancements
- :meth:`DataFrame.__pos__`, :meth:`DataFrame.__neg__` now retain ``ExtensionDtype`` dtypes (:issue:`43883`)
- The error raised when an optional dependency can't be imported now includes the original exception, for easier investigation (:issue:`43882`)
- Added :meth:`.ExponentialMovingWindow.sum` (:issue:`13297`)
- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
-

Expand Down
90 changes: 75 additions & 15 deletions pandas/core/strings/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -659,11 +659,11 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
Split strings around given separator/delimiter.

Splits the string in the Series/Index from the %(side)s,
at the specified delimiter string. Equivalent to :meth:`str.%(method)s`.
at the specified delimiter string.

Parameters
----------
pat : str, optional
pat : str or compiled regex, optional
String or regular expression to split on.
If not specified, split on whitespace.
n : int, default -1 (all)
Expand All @@ -672,14 +672,30 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
expand : bool, default False
Expand the split strings into separate columns.

* If ``True``, return DataFrame/MultiIndex expanding dimensionality.
* If ``False``, return Series/Index, containing lists of strings.
- If ``True``, return DataFrame/MultiIndex expanding dimensionality.
- If ``False``, return Series/Index, containing lists of strings.

regex : bool, default None
Determines if the passed-in pattern is a regular expression:

- If ``True``, assumes the passed-in pattern is a regular expression
- If ``False``, treats the pattern as a literal string.
- If ``None`` and `pat` length is 1, treats `pat` as a literal string.
- If ``None`` and `pat` length is not 1, treats `pat` as a regular expression.
- Cannot be set to False if `pat` is a compiled regex

.. versionadded:: 1.4.0

Returns
-------
Series, Index, DataFrame or MultiIndex
Type matches caller unless ``expand=True`` (see Notes).

Raises
------
ValueError
* if `regex` is False and `pat` is a compiled regex

See Also
--------
Series.str.split : Split strings around given separator/delimiter.
Expand All @@ -702,6 +718,9 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
If using ``expand=True``, Series and Index callers return DataFrame and
MultiIndex objects, respectively.

Use of `regex=False` with a `pat` as a compiled regex will raise
an error.

Examples
--------
>>> s = pd.Series(
Expand Down Expand Up @@ -776,22 +795,63 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"):
1 https://docs.python.org/3/tutorial index.html
2 NaN NaN

Remember to escape special characters when explicitly using regular
expressions.
Remember to escape special characters when explicitly using regular expressions.

>>> s = pd.Series(["1+1=2"])
>>> s
0 1+1=2
dtype: object
>>> s.str.split(r"\+|=", expand=True)
0 1 2
0 1 1 2
>>> s = pd.Series(["foo and bar plus baz"])
>>> s.str.split(r"and|plus", expand=True)
0 1 2
0 foo bar baz

Regular expressions can be used to handle urls or file names.
When `pat` is a string and ``regex=None`` (the default), the given `pat` is compiled
as a regex only if ``len(pat) != 1``.

>>> s = pd.Series(['foojpgbar.jpg'])
>>> s.str.split(r".", expand=True)
0 1
0 foojpgbar jpg

>>> s.str.split(r"\.jpg", expand=True)
0 1
0 foojpgbar

When ``regex=True``, `pat` is interpreted as a regex

>>> s.str.split(r"\.jpg", regex=True, expand=True)
0 1
0 foojpgbar

A compiled regex can be passed as `pat`

>>> import re
>>> s.str.split(re.compile(r"\.jpg"), expand=True)
0 1
0 foojpgbar

When ``regex=False``, `pat` is interpreted as the string itself

>>> s.str.split(r"\.jpg", regex=False, expand=True)
0
0 foojpgbar.jpg
"""

@Appender(_shared_docs["str_split"] % {"side": "beginning", "method": "split"})
@forbid_nonstring_types(["bytes"])
def split(self, pat=None, n=-1, expand=False):
result = self._data.array._str_split(pat, n, expand)
def split(
self,
pat: str | re.Pattern | None = None,
n=-1,
expand=False,
*,
regex: bool | None = None,
):
if regex is False and is_re(pat):
raise ValueError(
"Cannot use a compiled regex as replacement pattern with regex=False"
)
if is_re(pat):
regex = True
result = self._data.array._str_split(pat, n, expand, regex)
return self._wrap_result(result, returns_string=expand, expand=expand)

@Appender(_shared_docs["str_split"] % {"side": "end", "method": "rsplit"})
Expand Down
31 changes: 24 additions & 7 deletions pandas/core/strings/object_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,21 +308,38 @@ def f(x):

return self._str_map(f)

def _str_split(self, pat=None, n=-1, expand=False):
def _str_split(
self,
pat: str | re.Pattern | None = None,
n=-1,
expand=False,
regex: bool | None = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to factor this logic of pattern stuff into a common function to share with str_replace

Copy link
Contributor Author

@saehuihwang saehuihwang Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if case is False:
# add case flag, if provided
flags |= re.IGNORECASE
if regex or flags or callable(repl):
if not isinstance(pat, re.Pattern):
if regex is False:
pat = re.escape(pat)
pat = re.compile(pat, flags=flags)

Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments case and flags, while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat), while str_replace does not.

In a set of another PRs, I can do the following to str_split

  • deprecation of value dependent regex determination
    currently, when regex==None, len(pat)==1 is handled as a literal string and len(pat)!=1 is handled as a regex.

  • addition of case and flag arguments

  • common handling of logic with str_replace

But for now, I think str_replace and str_split are not similar enough to share a common logic handling function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while str_split does not. str_split handles the case when regex=None by using the weird logic with len(pat),

then shouldn't we ad case and flags and make these consistent? what do we need to deprecate here?

yes there is weird logic by using len(pat) but this is new logic you are adding no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future.

if len(pat) == 1:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
else:
if n is None or n == -1:
n = 0
regex = re.compile(pat)
f = lambda x: regex.split(x, maxsplit=n)

In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the case and flags if you think that is a good idea.
Thanks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that you didn't change the current len(pat) logic, which I don't like either. Maybe it is better if we are adding an explicit regex argument to permit only True, False, and remove the len(pat) logic, which only confuses things.
Of course need a release note and a versionchanged note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to get rid of the len(pat) logic and thus the regex=None option, but I didn't want to introduce a breaking change.

If you guys want me to go ahead, what should the regex default be? I don't want to break anyone's existing code. For example, if we make regex default to True, it will break someone's code splitting on a period (pat=".")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback what do you think on breaking change here. I would be inclined to set the default to True, since it uses Regex in every case where the pattern length is is greater than 1, and in most cases where the pattern length is 1, then regex would give the same result as string anyway, so it might only break in a few cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that removing this logic would be great, I'd be worried about changing the default to True without a deprecation. While most cases won't be affected like you mention, I'd imagine many users (who may not even know what a regex is) often split on punctuation like "." or "-". Many of the issues referenced here (or others for replace) stem from the user not realizing that the pattern is being treated as regex - I'd guess we'd see a bunch more of those if the default is changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a new method we could change things, but yeah maybe this is too much for now.

ok what i think we should do is this.

factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent.

if pat is None:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
else:
if len(pat) == 1:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
new_pat: str | re.Pattern
if regex is True or isinstance(pat, re.Pattern):
new_pat = re.compile(pat)
elif regex is False:
new_pat = pat
# regex is None so link to old behavior #43563
else:
if len(pat) == 1:
new_pat = pat
else:
new_pat = re.compile(pat)

if isinstance(new_pat, re.Pattern):
if n is None or n == -1:
n = 0
regex = re.compile(pat)
f = lambda x: regex.split(x, maxsplit=n)
f = lambda x: new_pat.split(x, maxsplit=n)
else:
if n is None or n == 0:
n = -1
f = lambda x: x.split(pat, n)
return self._str_map(f, dtype=object)

def _str_rsplit(self, pat=None, n=-1):
Expand Down
39 changes: 39 additions & 0 deletions pandas/tests/strings/test_split_partition.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from datetime import datetime
import re

import numpy as np
import pytest
Expand Down Expand Up @@ -35,6 +36,44 @@ def test_split(any_string_dtype):
tm.assert_series_equal(result, exp)


def test_split_regex(any_string_dtype):
# GH 43563
# explicit regex = True split
values = Series("xxxjpgzzz.jpg", dtype=any_string_dtype)
result = values.str.split(r"\.jpg", regex=True)
exp = Series([["xxxjpgzzz", ""]])
tm.assert_series_equal(result, exp)

# explicit regex = True split with compiled regex
regex_pat = re.compile(r".jpg")
values = Series("xxxjpgzzz.jpg", dtype=any_string_dtype)
result = values.str.split(regex_pat)
exp = Series([["xx", "zzz", ""]])
tm.assert_series_equal(result, exp)

# explicit regex = False split
result = values.str.split(r"\.jpg", regex=False)
exp = Series([["xxxjpgzzz.jpg"]])
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length == 1
result = values.str.split(r".")
exp = Series([["xxxjpgzzz", "jpg"]])
tm.assert_series_equal(result, exp)

# non explicit regex split, pattern length != 1
result = values.str.split(r".jpg")
exp = Series([["xx", "zzz", ""]])
tm.assert_series_equal(result, exp)

# regex=False with pattern compiled regex raises error
with pytest.raises(
ValueError,
match="Cannot use a compiled regex as replacement pattern with regex=False",
):
values.str.split(regex_pat, regex=False)


def test_split_object_mixed():
mixed = Series(["a_b_c", np.nan, "d_e_f", True, datetime.today(), None, 1, 2.0])
result = mixed.str.split("_")
Expand Down