-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: fix Series.str.split when 'regex=None' for series having 'pd.ArrowDtype(pa.string())' dtype #58418
Changes from 3 commits
0349837
4948784
14c059c
f55ca62
f0c2097
6f93a8d
b9b3197
d919957
5fd5bcb
a7f8af2
20e5ebe
14fd864
cd802ac
ab5f337
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2296,6 +2296,16 @@ def test_str_split_pat_none(method): | |
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_str_split_regex_none(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you move this test to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done in the new commit. |
||
# GH 58321 | ||
ser = pd.Series(["230/270/270", "240-290-290"], dtype=ArrowDtype(pa.string())) | ||
result = ser.str.split(r"/|-", regex=None) | ||
expected = pd.Series( | ||
ArrowExtensionArray(pa.array([["230", "270", "270"], ["240", "290", "290"]])) | ||
) | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
def test_str_split(): | ||
# GH 52401 | ||
ser = pd.Series(["a1cbcb", "a2cbcb", None], dtype=ArrowDtype(pa.string())) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR - I'm not sure this is the right fix though. Do you see where the behavior deviates between the different string types? This current fix seems like it would apply a behavior change to all types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. The behavior deviates here.
string[pyarrow]
goes throughpandas/pandas/core/strings/object_array.py
Line 327 in a1fc8e8
while
pd.ArrowDtype(pa.string())
goes throughpandas/pandas/core/arrays/arrow/array.py
Line 2571 in a1fc8e8
The docstring of
str.split
says this aboutregex
: "If None and pat length is not 1, treats pat as a regular expression."This behavior has been implemented in the first
_str_split
, but not in the second_str_split
. So I add this condition in the second_str_split
to fix the issue.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks that is helpful. Is there a way to make these implementations look more alike? I see what you are trying to accomplish here but its hard to tell the corner cases where these may still diverge. Is there a reason why the implementations need to differ at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial intention was to make as few changes as possible.
To make it more coherent, I would rather set
regex=True
for the corner case before calling_str_split
in the code below. Do you think it's OK?pandas/pandas/core/strings/accessor.py
Lines 911 to 913 in a1fc8e8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I move outside the logic that determines if
pat
is a regex, so that the two_str_split
look more alike. Coud you review again?