-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: na parameter for str.startswith and str.endswith not propagating for Series with categorical dtype #36249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @asishm, can you add a 1.1.3 release note?
pandas/tests/test_strings.py
Outdated
values = Series(["om", np.nan, "foo_nom", "nom", "bar_foo", np.nan, "foo"]) | ||
# add category dtype parametrizations for GH-36241 | ||
@pytest.mark.parametrize("dtype", [None, "category"]) | ||
def test_startswith(self, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would want to parameterize over different fill values to test the behavior from the OP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameterized over True, False
question here:
for string Series with dtype = object, passing strings into the na
parameter of these two methods causes it to replace the nan with the string - but for StringArrays it replaces the nans with bool(na)
. The current fix behaves as the former.
In [49]: pd.Series(['a', np.nan, 'b']).str.startswith('a', na='c')
Out[49]:
0 True
1 c
2 False
dtype: object
In [50]: pd.Series(['a', np.nan, 'b'], dtype='string').str.startswith('a', na='c')
Out[50]:
0 True
1 True
2 False
dtype: boolean
In [53]: pd.Series(['a', np.nan, 'b'], dtype='category').str.startswith('a', na='c')
Out[53]:
0 True
1 c
2 False
dtype: object
edits: copying the wrong cells!!
pandas/tests/test_strings.py
Outdated
@pytest.mark.parametrize("dtype", [None, "category"]) | ||
def test_startswith(self, dtype): | ||
values = Series( | ||
["om", np.nan, "foo_nom", "nom", "bar_foo", np.nan, "foo"], dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls tests with pd.NA as well (you can parameterize over the null value as well
def test_startswith(self): | ||
values = Series(["om", np.nan, "foo_nom", "nom", "bar_foo", np.nan, "foo"]) | ||
# add category dtype parametrizations for GH-36241 | ||
@pytest.mark.parametrize("dtype", [None, "category"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to parameterize over 'string' dtype as well rigth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string arrays are already tested here
https://github.com/pandas-dev/pandas/blob/1.1.x/pandas/tests/test_strings.py#L3530-L3531
parametrizing over string dtype and na=True/False was making it a bit tricky as these methods return boolean series causing a dtype mismatch (boolean vs object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback tried to do that changing the referenced tests slightly with
https://github.com/pandas-dev/pandas/pull/36249/files#diff-683f8dacd08abda4913680fafe3a4ea7R3573-R3576 , https://github.com/pandas-dev/pandas/pull/36249/files#diff-683f8dacd08abda4913680fafe3a4ea7R32-R33 and https://github.com/pandas-dev/pandas/pull/36249/files#diff-683f8dacd08abda4913680fafe3a4ea7R63-R64
def test_endswith(self): | ||
values = Series(["om", np.nan, "foo_nom", "nom", "bar_foo", np.nan, "foo"]) | ||
# add category dtype parametrizations for GH-36241 | ||
@pytest.mark.parametrize("dtype", [None, "category"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
pandas/tests/test_strings.py
Outdated
@@ -867,8 +871,12 @@ def test_startswith(self): | |||
) | |||
tm.assert_series_equal(rs, xp) | |||
|
|||
def test_endswith(self): | |||
values = Series(["om", np.nan, "foo_nom", "nom", "bar_foo", np.nan, "foo"]) | |||
# add category dtype parametrizations for GH-36241 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add inside the test
Co-authored-by: Daniel Saxton <[email protected]>
thanks @asishm |
@meeseeksdev backport 1.1.x |
…d str.endswith not propagating for Series with categorical dtype
…with not propagating for Series with categorical dtype (#36331) Co-authored-by: Asish Mahapatra <[email protected]>
… for Series with categorical dtype (pandas-dev#36249)
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
I parametrized dtype in existing tests. Should I create a separate test instead?