Skip to content

Change in astype leads to no longer casting to numpy string dtypes #39945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ Other API changes
^^^^^^^^^^^^^^^^^
- Partially initialized :class:`CategoricalDtype` (i.e. those with ``categories=None`` objects will no longer compare as equal to fully initialized dtype objects.
- Accessing ``_constructor_expanddim`` on a :class:`DataFrame` and ``_constructor_sliced`` on a :class:`Series` now raise an ``AttributeError``. Previously a ``NotImplementedError`` was raised (:issue:`38782`)
-
- :meth:`DataFrame.astype` and :meth:`Series.astype` for ``bytes`` no longer casting to numpy string dtypes but instead casting to ``object`` dtype (:issue:`39566`)

.. ---------------------------------------------------------------------------

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/dtypes/cast.py
Original file line number Diff line number Diff line change
Expand Up @@ -1268,7 +1268,7 @@ def soft_convert_objects(
values = lib.maybe_convert_objects(
values, convert_datetime=datetime, convert_timedelta=timedelta
)
except (OutOfBoundsDatetime, ValueError):
except OutOfBoundsDatetime:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

return values

if numeric and is_object_dtype(values.dtype):
Expand Down
9 changes: 7 additions & 2 deletions pandas/core/internals/array_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -287,7 +287,9 @@ def apply(
if not ignore_failures:
raise
continue
# if not isinstance(applied, ExtensionArray):
if not isinstance(applied, ExtensionArray):
if issubclass(applied.dtype.type, (str, bytes)):
applied = np.array(applied, dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we ought to have a sanitize_str_dtypes or something akin to sanitize_to_nanoseconds

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a _sanitize_str_dtypes, but not fully sure it does exactly what we want here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now done as part of maybe_coerce_values, though that may be more heavy-weight than we want here

# # TODO not all EA operations return new EAs (eg astype)
# applied = array(applied)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this commented out code was from the initial implementation at which point I only supported storing EAs in the ArrayManager, and not numpy.ndarrays. But now both are supported, so you can remove this code.

result_arrays.append(applied)
Expand Down Expand Up @@ -413,7 +415,10 @@ def downcast(self) -> ArrayManager:
return self.apply_with_block("downcast")

def astype(self, dtype, copy: bool = False, errors: str = "raise") -> ArrayManager:
return self.apply("astype", dtype=dtype, copy=copy) # , errors=errors)
# if issubclass(dtype, (str, bytes)):
# dtype = "object"
y = self.apply("astype", dtype=dtype, copy=copy) # , errors=errors)
return y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the commented-out code be removed?


def convert(
self,
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -2219,7 +2219,7 @@ class ObjectBlock(Block):

@classmethod
def _maybe_coerce_values(cls, values):
if issubclass(values.dtype.type, str):
if issubclass(values.dtype.type, (str, bytes)):
values = np.array(values, dtype=object)
return values

Expand Down
9 changes: 5 additions & 4 deletions pandas/tests/frame/methods/test_astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -644,7 +644,8 @@ def test_astype_dt64_to_string(self, frame_or_series, tz_naive_fixture, request)
alt = obj.astype(str)
assert np.all(alt.iloc[1:] == result.iloc[1:])

def test_astype_bytes(self):
# GH#39474
result = DataFrame(["foo", "bar", "baz"]).astype(bytes)
assert result.dtypes[0] == np.dtype("S3")
@pytest.mark.parametrize("dtype", [bytes, np.string_, np.bytes_])
def test_astype_bytes(self, dtype):
# GH#39566
result = DataFrame(["foo", "bar", "baz"]).astype(dtype)
assert result.dtypes[0] == "object"
7 changes: 4 additions & 3 deletions pandas/tests/series/methods/test_astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,10 +344,11 @@ def test_astype_unicode(self):
reload(sys)
sys.setdefaultencoding(former_encoding)

def test_astype_bytes(self):
@pytest.mark.parametrize("dtype", [bytes, np.string_, np.bytes_])
def test_astype_bytes(self, dtype):
# GH#39474
result = Series(["foo", "bar", "baz"]).astype(bytes)
assert result.dtypes == np.dtype("S3")
result = Series(["foo", "bar", "baz"]).astype(dtype)
assert result.dtypes == "object"


class TestAstypeCategorical:
Expand Down