-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: more permissive conversion to StringDtype #33465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 8 commits
9b90c3c
f94169e
681a211
f316e42
89ef931
053dae4
af0bf4d
c966562
914349a
08ff77a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -20,7 +20,7 @@ | |
from pandas.util._validators import validate_fillna_kwargs | ||
|
||
from pandas.core.dtypes.cast import maybe_cast_to_extension_array | ||
from pandas.core.dtypes.common import is_array_like, is_list_like | ||
from pandas.core.dtypes.common import is_array_like, is_list_like, pandas_dtype | ||
from pandas.core.dtypes.dtypes import ExtensionDtype | ||
from pandas.core.dtypes.generic import ABCIndexClass, ABCSeries | ||
from pandas.core.dtypes.missing import isna | ||
|
@@ -178,7 +178,7 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |
---------- | ||
scalars : Sequence | ||
Each element will be an instance of the scalar type for this | ||
array, ``cls.dtype.type``. | ||
array, ``cls.dtype.type`` or be converted into this type in this method. | ||
dtype : dtype, optional | ||
Construct for this particular dtype. This should be a Dtype | ||
compatible with the ExtensionArray. | ||
|
@@ -451,6 +451,12 @@ def astype(self, dtype, copy=True): | |
array : ndarray | ||
NumPy ndarray with 'dtype' for its dtype. | ||
""" | ||
from pandas.core.arrays.string_ import StringDtype | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so we atcually ever hit this base type? I think we override this everywhere There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. floats arrays are PandasArrays, and those get their astype from ExtensionsArray. >>> pd.array([1.5, 2.5])
<PandasArray>
[1.5, 2.5]
Length: 2, dtype: float64 Allowing any ExtensionArray by default to convert to StringArray seems reasonable to me (and if subclassers don't want that, they can make their own astype implementation disallowing StringArrays). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kk, can you add a commment to this effect here (agree with your statement), but comment for future readers There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
All methods in this base class are there for subclasses to (potentially) use, so I don't think a comment about that is needed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array) |
||
|
||
dtype = pandas_dtype(dtype) | ||
if isinstance(dtype, StringDtype): # allow conversion to StringArrays | ||
return dtype.construct_array_type()._from_sequence(self, copy=False) | ||
|
||
return np.array(self, dtype=dtype, copy=copy) | ||
|
||
def isna(self) -> ArrayLike: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,15 +152,21 @@ class StringArray(PandasArray): | |
['This is', 'some text', <NA>, 'data.'] | ||
Length: 4, dtype: string | ||
|
||
Unlike ``object`` dtype arrays, ``StringArray`` doesn't allow non-string | ||
values. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is important to keep the original intent of this part, in some way. As it was meant to explain that the StringArray will only contain strings (as opposed to object dtype, which is not strict here). This could maybe also shown with setting a non-string value ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How you changed it still doesn't explain the original intent, as mentioned above, IMO. I would explain both (only strings allowed + showing the automatic conversion) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I've tried it differently. |
||
Unlike arrays instantiated with ``dtype="object"``, ``StringArray`` | ||
will convert the values to strings. | ||
|
||
>>> pd.array(['1', 1], dtype="object") | ||
<PandasArray> | ||
['1', 1] | ||
Length: 2, dtype: object | ||
>>> pd.array(['1', 1], dtype="string") | ||
Traceback (most recent call last): | ||
... | ||
ValueError: StringArray requires an object-dtype ndarray of strings. | ||
<StringArray> | ||
['1', '1'] | ||
Length: 2, dtype: string | ||
|
||
However, instantiating StringArrays directly with non-strings will raise an error. | ||
|
||
For comparison methods, this returns a :class:`pandas.BooleanArray` | ||
For comparison methods, `StringArray` returns a :class:`pandas.BooleanArray`: | ||
|
||
>>> pd.array(["a", None, "c"], dtype="string") == "a" | ||
<BooleanArray> | ||
|
@@ -203,10 +209,15 @@ def _from_sequence(cls, scalars, dtype=None, copy=False): | |
# TODO: it would be nice to do this in _validate / lib.is_string_array | ||
# We are already doing a scan over the values there. | ||
na_values = isna(result) | ||
if na_values.any(): | ||
if result is scalars: | ||
# force a copy now, if we haven't already | ||
result = result.copy() | ||
has_nans = na_values.any() | ||
if has_nans and result is scalars: | ||
# force a copy now, if we haven't already | ||
result = result.copy() | ||
|
||
# convert to str, then to object to avoid dtype like '<U3', then insert na_value | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
result = np.asarray(result, dtype=str) | ||
result = np.asarray(result, dtype="object") | ||
if has_nans: | ||
result[na_values] = StringDtype.na_value | ||
|
||
return cls(result) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't render (needs an ipython block here as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.