Skip to content

BUG: astype to string modifying input array inplace #51073

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 1, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1021,6 +1021,7 @@ Conversion
- Bug where any :class:`ExtensionDtype` subclass with ``kind="M"`` would be interpreted as a timezone type (:issue:`34986`)
- Bug in :class:`.arrays.ArrowExtensionArray` that would raise ``NotImplementedError`` when passed a sequence of strings or binary (:issue:`49172`)
- Bug in :meth:`Series.astype` raising ``pyarrow.ArrowInvalid`` when converting from a non-pyarrow string dtype to a pyarrow numeric type (:issue:`50430`)
- Bug in :meth:`DataFrame.astype` modifying input array inplace when converting to ``string`` and ``copy=False`` (:issue:`51073`)
- Bug in :meth:`Series.to_numpy` converting to NumPy array before applying ``na_value`` (:issue:`48951`)
- Bug in :meth:`DataFrame.astype` not copying data when converting to pyarrow dtype (:issue:`50984`)
- Bug in :func:`to_datetime` was not respecting ``exact`` argument when ``format`` was an ISO8601 format (:issue:`12649`)
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -357,6 +357,12 @@ def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy: bool = Fal

else:
# convert non-na-likes to str, and nan-likes to StringDtype().na_value
if is_object_dtype(scalars):
# copy if we get object dtype with non-string values to avoid
# modifying input inplace
inferred = lib.infer_dtype(scalars, skipna=False)
if inferred != "string":
copy = True
result = lib.ensure_string_array(scalars, na_value=libmissing.NA, copy=copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this fix in ensure_string_array? ive been hoping to simplify it to make it always take an ndarray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep sure, was not sure if this was intended or not. Moved to Cython and improved performance as well


# Manually creating new array avoids the validation step in the __init__, so is
Expand Down
7 changes: 7 additions & 0 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,13 @@ def _from_sequence(cls, scalars, dtype: Dtype | None = None, copy: bool = False)
result = lib.ensure_string_array(result, copy=copy, convert_na_value=False)
return cls(pa.array(result, mask=na_values, type=pa.string()))

if is_object_dtype(scalars):
# copy if we get object dtype with non-string values to avoid
# modifying input inplace
inferred = lib.infer_dtype(scalars, skipna=False)
if inferred != "string":
copy = True

# convert non-na-likes to str
result = lib.ensure_string_array(scalars, copy=copy)
return cls(pa.array(result, type=pa.string(), from_pandas=True))
Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/frame/methods/test_astype.py
Original file line number Diff line number Diff line change
Expand Up @@ -879,3 +879,13 @@ def test_astype_copies(dtype):
df.iloc[0, 0] = 100
expected = DataFrame({"a": [1, 2, 3]}, dtype="int64[pyarrow]")
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("val", [None, 1, 1.5, np.nan, NaT])
def test_astype_to_string_not_modifying_input(string_storage, val):
# GH#51073
df = DataFrame({"a": ["a", "b", val]})
expected = df.copy()
with option_context("mode.string_storage", string_storage):
df.astype("string", copy=False)
tm.assert_frame_equal(df, expected)