Skip to content

REF: more explicit dtypes in strings.accessor #41727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jun 9, 2021

Conversation

jbrockmendel
Copy link
Member

Experimenting with #40489 I found that making Series inference behavior more like Index behavior broke a bunch of strings tests. This makes the strings code robust to that potential change by being more explicit about dtypes.

@simonjayhawkins simonjayhawkins added Refactor Internal refactoring of code Strings String extension data type and string data labels May 30, 2021
@@ -2145,7 +2157,7 @@ def startswith(self, pat, na=None):
dtype: bool
"""
result = self._data.array._str_startswith(pat, na=na)
return self._wrap_result(result, returns_string=False)
return self._wrap_result(result, returns_string=False, returns_bool=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just change _wrap_result so we don't have 2 keywords? maybe rename to returns_dtype and pass a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats tough bc the end-result dtype depends on self._is_string

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again it really doesnt look worth the trouble to try to combine these args

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either keywords should really be necessary here, the issue is that the arrays passed when expand is True are object arrays of lists/tuples

which method returns bool, where the dtype of the result is not bool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which method returns bool, where the dtype of the result is not bool?

makes sense, ill give this a try

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g isnumeric we can get object-dtype result of bools and nans

Copy link
Member

@simonjayhawkins simonjayhawkins Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the wrapped result would be object in this case? This is for a object dtype array? for StringArray/ArrowString array the return type should always be boolean. object dtype accessors do cast their results, except extract, so if the result is all bools and no nans, the result dtype should be np.bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the result dtype should be np.bool

the result to be wrapped dtype should be np.bool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these conversions are handled in _str_map in ObjectStringArrayMixin

@@ -209,8 +212,12 @@ def _validate(data):
# see _libs/lib.pyx for list of inferred types
allowed_types = ["string", "empty", "bytes", "mixed", "mixed-integer"]

values = getattr(data, "values", data) # Series / Index
values = getattr(values, "categories", values) # categorical / normal
# TODO: avoid kludge for tests.extension.test_numpy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm cn you avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can revert to use the getattr pattern, either way is an anti-pattern

@jbrockmendel
Copy link
Member Author

@simonjayhawkins was right, it was possible to remove the returns_bool keyword; green

@jreback jreback added this to the 1.3 milestone Jun 9, 2021
@jreback jreback merged commit 2c1e656 into pandas-dev:master Jun 9, 2021
@jbrockmendel jbrockmendel deleted the ref-string-casting branch June 9, 2021 00:34
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants