-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: more explicit dtypes in strings.accessor #41727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/strings/accessor.py
Outdated
@@ -2145,7 +2157,7 @@ def startswith(self, pat, na=None): | |||
dtype: bool | |||
""" | |||
result = self._data.array._str_startswith(pat, na=na) | |||
return self._wrap_result(result, returns_string=False) | |||
return self._wrap_result(result, returns_string=False, returns_bool=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just change _wrap_result so we don't have 2 keywords? maybe rename to returns_dtype and pass a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats tough bc the end-result dtype depends on self._is_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again it really doesnt look worth the trouble to try to combine these args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think either keywords should really be necessary here, the issue is that the arrays passed when expand is True are object arrays of lists/tuples
which method returns bool, where the dtype of the result is not bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which method returns bool, where the dtype of the result is not bool?
makes sense, ill give this a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g isnumeric
we can get object-dtype result of bools and nans
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the wrapped result would be object in this case? This is for a object dtype array? for StringArray/ArrowString array the return type should always be boolean. object dtype accessors do cast their results, except extract, so if the result is all bools and no nans, the result dtype should be np.bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the result dtype should be np.bool
the result to be wrapped dtype should be np.bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these conversions are handled in _str_map in ObjectStringArrayMixin
@@ -209,8 +212,12 @@ def _validate(data): | |||
# see _libs/lib.pyx for list of inferred types | |||
allowed_types = ["string", "empty", "bytes", "mixed", "mixed-integer"] | |||
|
|||
values = getattr(data, "values", data) # Series / Index | |||
values = getattr(values, "categories", values) # categorical / normal | |||
# TODO: avoid kludge for tests.extension.test_numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
umm cn you avoid this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can revert to use the getattr
pattern, either way is an anti-pattern
@simonjayhawkins was right, it was possible to remove the returns_bool keyword; green |
Experimenting with #40489 I found that making Series inference behavior more like Index behavior broke a bunch of strings tests. This makes the strings code robust to that potential change by being more explicit about dtypes.