-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: added regex argument to Series.str.split #44185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
f55c968
e56f8fb
282ef51
2c5402e
609a77f
837427b
7523b1b
e1a0aa7
d7b3d8e
20dc2a6
03eaa90
0b139f3
1604915
8312d79
a82639c
76e6001
2c43fb5
e95416d
2ed7980
5f0d8df
ba812a1
e2da861
2855fa8
ed37375
057dcfb
ece00f1
b6bbf3e
27ffee7
b97ebe9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -657,7 +657,7 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"): | |
|
||
Parameters | ||
---------- | ||
pat : str, optional | ||
pat : str, or compiled regex optional | ||
String or regular expression to split on. | ||
If not specified, split on whitespace. | ||
n : int, default -1 (all) | ||
|
@@ -669,6 +669,18 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"): | |
* If ``True``, return DataFrame/MultiIndex expanding dimensionality. | ||
* If ``False``, return Series/Index, containing lists of strings. | ||
|
||
regex : bool, default None | ||
Determines whether to handle the pattern as a regular expression. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded 1.4 tag There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay! |
||
If ``pat`` is a compiled regular expression, it is interpreted as a | ||
regular expression regardless of ``regex`` | ||
|
||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* If ``True``, assumes the passed-in pattern is a regular expression | ||
* If ``False``, treats the pattern as a literal string. | ||
* If ``None`` and the pattern length is 1, treats the pattern as a | ||
literal string. | ||
* If ``None`` and the pattern length is not 1, treats the pattern as | ||
a regular expression. | ||
|
||
Returns | ||
------- | ||
Series, Index, DataFrame or MultiIndex | ||
|
@@ -770,22 +782,44 @@ def cat(self, others=None, sep=None, na_rep=None, join="left"): | |
1 https://docs.python.org/3/tutorial index.html | ||
2 NaN NaN | ||
|
||
Remember to escape special characters when explicitly using regular | ||
expressions. | ||
Remember to escape special characters when explicitly using regular expressions. | ||
When `pat` is a string and ``regex=None`` (the default), the given `pat` is compiled | ||
as a regex only if ``len(pat) != 1``. | ||
|
||
>>> s = pd.Series(["1+1=2"]) | ||
>>> s | ||
0 1+1=2 | ||
dtype: object | ||
>>> s.str.split(r"\+|=", expand=True) | ||
0 1 2 | ||
0 1 1 2 | ||
>>> s = pd.Series(['fooojpgbar.jpg']) | ||
>>> s.str.split(r".", expand=True) | ||
0 1 | ||
0 fooojpgbar jpg | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a fan of this new example, seems confusing, and not sure what difference it is showing. Can it be simplified to better demonstrate the new features? I quite like the original actually, although text based like:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @attack68, thanks for reviewing my pull request! I assumed that a lot of people use st.split to handle urls and file names, so I wanted to include a similar example that includes a ".xxx" at the end. The new set of examples also helps illustrate "." be used as a regex and as a regular string depending on the |
||
>>> s.str.split(r"\.jpg", expand=True) | ||
0 1 | ||
0 fooojpgbar | ||
>>> s.str.split(r".jpg", expand=True) | ||
0 1 2 | ||
0 foo bar | ||
|
||
When ``regex=True``, `pat` is interpreted as a regex | ||
|
||
>>> s.str.split(r"\.jpg", regex=True, expand=True) | ||
0 1 | ||
0 fooojpgbar | ||
|
||
When ``regex=False``, `pat` is interpreted as the string itself | ||
|
||
>>> s.str.split(r"\.jpg", regex=False, expand=True) | ||
0 | ||
0 fooojpgbar.jpg | ||
""" | ||
|
||
@Appender(_shared_docs["str_split"] % {"side": "beginning", "method": "split"}) | ||
@forbid_nonstring_types(["bytes"]) | ||
def split(self, pat=None, n=-1, expand=False): | ||
result = self._data.array._str_split(pat, n, expand) | ||
def split( | ||
self, | ||
pat: str | re.Pattern | None = None, | ||
n=-1, | ||
expand=False, | ||
regex: bool | None = None, | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
): | ||
result = self._data.array._str_split(pat, n, expand, regex) | ||
return self._wrap_result(result, returns_string=expand, expand=expand) | ||
|
||
@Appender(_shared_docs["str_split"] % {"side": "end", "method": "rsplit"}) | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -308,21 +308,38 @@ def f(x): | |||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
return self._str_map(f) | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
def _str_split(self, pat=None, n=-1, expand=False): | ||||||||||||||||||||||||||||||||||||||
def _str_split( | ||||||||||||||||||||||||||||||||||||||
self, | ||||||||||||||||||||||||||||||||||||||
pat: str | re.Pattern | None = None, | ||||||||||||||||||||||||||||||||||||||
n=-1, | ||||||||||||||||||||||||||||||||||||||
expand=False, | ||||||||||||||||||||||||||||||||||||||
regex: bool | None = None, | ||||||||||||||||||||||||||||||||||||||
): | ||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would be nice to factor this logic of pattern stuff into a common function to share with str_replace There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pandas/pandas/core/strings/object_array.py Lines 152 to 160 in b0992ee
Are you referring to this part? Unfortunately, str_replace and str_split handle arguments quite differently. str_replace has two additional arguments In a set of another PRs, I can do the following to str_split
But for now, I think str_replace and str_split are not similar enough to share a common logic handling function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
then shouldn't we ad yes there is weird logic by using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did not add the len(pat) logic - it has always been there. I purposely didn't cut it out to maintain current behavior. I think that we should remove it in the future. pandas/pandas/core/strings/object_array.py Lines 317 to 325 in 2fa2d5c
In this PR, I am simply adding the regex flag, as requested by several issues. I can go ahead and add the case and flags if you think that is a good idea.Thanks There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I saw that you didn't change the current There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm happy to get rid of the len(pat) logic and thus the If you guys want me to go ahead, what should the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jreback what do you think on breaking change here. I would be inclined to set the default to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I agree that removing this logic would be great, I'd be worried about changing the default to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since this is a new method we could change things, but yeah maybe this is too much for now. ok what i think we should do is this. factor to a common method and use it here. when we deprecate this it will deprecate in both places (yes even though this is a new method it is fine). its at least consistent. |
||||||||||||||||||||||||||||||||||||||
if pat is None: | ||||||||||||||||||||||||||||||||||||||
if n is None or n == 0: | ||||||||||||||||||||||||||||||||||||||
n = -1 | ||||||||||||||||||||||||||||||||||||||
f = lambda x: x.split(pat, n) | ||||||||||||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||||||||||||
if len(pat) == 1: | ||||||||||||||||||||||||||||||||||||||
if n is None or n == 0: | ||||||||||||||||||||||||||||||||||||||
n = -1 | ||||||||||||||||||||||||||||||||||||||
f = lambda x: x.split(pat, n) | ||||||||||||||||||||||||||||||||||||||
new_pat: str | re.Pattern | ||||||||||||||||||||||||||||||||||||||
if regex is True or isinstance(pat, re.Pattern): | ||||||||||||||||||||||||||||||||||||||
new_pat = re.compile(pat) | ||||||||||||||||||||||||||||||||||||||
elif regex is False: | ||||||||||||||||||||||||||||||||||||||
new_pat = pat | ||||||||||||||||||||||||||||||||||||||
# regex is None so link to old behavior #43563 | ||||||||||||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||||||||||||
if len(pat) == 1: | ||||||||||||||||||||||||||||||||||||||
new_pat = pat | ||||||||||||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||||||||||||
new_pat = re.compile(pat) | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
if isinstance(new_pat, re.Pattern): | ||||||||||||||||||||||||||||||||||||||
if n is None or n == -1: | ||||||||||||||||||||||||||||||||||||||
n = 0 | ||||||||||||||||||||||||||||||||||||||
regex = re.compile(pat) | ||||||||||||||||||||||||||||||||||||||
f = lambda x: regex.split(x, maxsplit=n) | ||||||||||||||||||||||||||||||||||||||
f = lambda x: new_pat.split(x, maxsplit=n) | ||||||||||||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||||||||||||
if n is None or n == 0: | ||||||||||||||||||||||||||||||||||||||
n = -1 | ||||||||||||||||||||||||||||||||||||||
f = lambda x: x.split(pat, n) | ||||||||||||||||||||||||||||||||||||||
return self._str_map(f, dtype=object) | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
def _str_rsplit(self, pat=None, n=-1): | ||||||||||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,27 @@ def test_split(any_string_dtype): | |
exp = Series([["a", "b", "c"], ["c", "d", "e"], np.nan, ["f", "g", "h"]]) | ||
tm.assert_series_equal(result, exp) | ||
|
||
# explicit regex = True split | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
values = Series("qweqwejpgqweqwe.jpg", dtype=any_string_dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a less confusing test to work with? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is the test from the original issue #43563 so I thought that it might be best to include the same string. Do you think it's better to use a simpler one, perhaps the ones I put in the examples section of the doc? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you have tests from all the issues you are closing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
My philosophy is that you can use either the same test, or modified slightly or a better test depending upon whether it suits the purpose. Whilst a subjective opinion, the text There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point! It was a bit of a struggle to come up with a good test string haha. The best I came up with was "xxxjpgzzz.jpg". Let me know if you have better ideas |
||
result = values.str.split(r"\.jpg", regex=True) | ||
exp = Series([["qweqwejpgqweqwe", ""]]) | ||
tm.assert_series_equal(result, exp) | ||
|
||
# explicit regex = False split | ||
result = values.str.split(r"\.jpg", regex=False) | ||
exp = Series([["qweqwejpgqweqwe.jpg"]]) | ||
tm.assert_series_equal(result, exp) | ||
|
||
# non explicit regex split, pattern length == 1 | ||
result = values.str.split(r".") | ||
exp = Series([["qweqwejpgqweqwe", "jpg"]]) | ||
tm.assert_series_equal(result, exp) | ||
|
||
# non explicit regex split, pattern length != 1 | ||
result = values.str.split(r".jpg") | ||
exp = Series([["qweqw", "qweqwe", ""]]) | ||
tm.assert_series_equal(result, exp) | ||
|
||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
def test_split_object_mixed(): | ||
mixed = Series(["a_b_c", np.nan, "d_e_f", True, datetime.today(), None, 1, 2.0]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! change has been made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@attack68 sorry, I had missed the first comma. change has now been made according to your suggestion. Sorry for missing it earlier