Skip to content

API: allow nan-likes in StringArray constructor #41412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
3e1784d
API: allow nan-likes in StringArray constructor
lithomas1 May 10, 2021
96ff1da
Revert weird changes & Fix stuff
lithomas1 May 11, 2021
418e1d2
Remove failing test
lithomas1 May 11, 2021
25a6c4d
Changes from code review
lithomas1 May 19, 2021
47d68f7
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 May 19, 2021
8257dbd
typo
lithomas1 May 20, 2021
922436a
Update lib.pyi
lithomas1 May 21, 2021
2f28086
Update lib.pyx
lithomas1 May 29, 2021
3ee2198
Update lib.pyx
lithomas1 May 29, 2021
9426a52
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 May 30, 2021
3ee55f2
Updates
lithomas1 May 30, 2021
fe4981a
Update lib.pyx
lithomas1 May 30, 2021
a66948a
Update lib.pyx
lithomas1 May 30, 2021
e852719
Update lib.pyx
lithomas1 May 31, 2021
91b73bb
disallow invalid nans in stringarray constructor
lithomas1 Jun 4, 2021
42ec3e4
Merge branch 'master' into stringarray-nan
lithomas1 Jun 4, 2021
41f49d2
add to _from_sequence and fixes
lithomas1 Jun 4, 2021
62cc5be
address code review
lithomas1 Jun 4, 2021
29909f3
Merge branch 'master' into stringarray-nan
lithomas1 Jun 4, 2021
153b6b4
Fix failures
lithomas1 Jun 5, 2021
b27a839
maybe fix benchmarks?
lithomas1 Jun 5, 2021
ed5b953
Partially address code review
lithomas1 Jun 5, 2021
caa5705
Test coerce=False
lithomas1 Jun 6, 2021
2d75031
move benchmarks
lithomas1 Jun 7, 2021
52a00d1
accidental formatting changes
lithomas1 Jun 7, 2021
8dc0b66
Fix
lithomas1 Jun 8, 2021
1bacaed
Merge branch 'master' into stringarray-nan
lithomas1 Jun 8, 2021
66be087
missing import from conflict
lithomas1 Jun 8, 2021
7b058cd
Merge branch 'master' into stringarray-nan
lithomas1 Jun 9, 2021
1be1bdf
Merge branch 'pandas-dev:master' into stringarray-nan
lithomas1 Jun 22, 2021
3c57094
remove old whatsnew
lithomas1 Jul 21, 2021
03738a9
Merge branch 'master' of https://github.com/pandas-dev/pandas into st…
lithomas1 Jul 21, 2021
12351de
move whatsnew
lithomas1 Jul 21, 2021
889829a
Merge branch 'master' into stringarray-nan
lithomas1 Oct 4, 2021
358000f
typo
lithomas1 Oct 4, 2021
c649b1d
Merge branch 'master' into stringarray-nan
lithomas1 Oct 16, 2021
5e5aa9c
Merge branch 'master' into stringarray-nan
lithomas1 Nov 27, 2021
eb7d8f2
Merge branch 'master' into stringarray-nan
lithomas1 Dec 18, 2021
2426319
Merge branch 'master' into stringarray-nan
lithomas1 Dec 27, 2021
20817a7
address comments
lithomas1 Dec 27, 2021
33d8f9a
accept any float nan w/ util.is_nan
lithomas1 Dec 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,7 @@ Other API changes
- Partially initialized :class:`CategoricalDtype` (i.e. those with ``categories=None`` objects will no longer compare as equal to fully initialized dtype objects.
- Accessing ``_constructor_expanddim`` on a :class:`DataFrame` and ``_constructor_sliced`` on a :class:`Series` now raise an ``AttributeError``. Previously a ``NotImplementedError`` was raised (:issue:`38782`)
- Added new ``engine`` and ``**engine_kwargs`` parameters to :meth:`DataFrame.to_sql` to support other future "SQL engines". Currently we still only use ``SQLAlchemy`` under the hood, but more engines are planned to be supported such as ``turbodbc`` (:issue:`36893`)
- :class:`StringArray` now accepts nan-likes(``None``, ``nan``, ``NaT``, ``NA``, Decimal("NaN")) in its constructor in addition to strings.

Build
=====
Expand Down
1 change: 1 addition & 0 deletions pandas/_libs/lib.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,7 @@ def ensure_string_array(
arr,
na_value: object = np.nan,
convert_na_value: bool = True,
coerce: bool = True,
copy: bool = True,
skipna: bool = True,
) -> np.ndarray: ... # np.ndarray[object]
Expand Down
17 changes: 11 additions & 6 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -679,11 +679,14 @@ cpdef ndarray[object] ensure_string_array(
arr,
object na_value=np.nan,
bint convert_na_value=True,
bint coerce=True,
bint copy=True,
bint skipna=True,
):
"""
Returns a new numpy array with object dtype and only strings and na values.
Checks that all elements in numpy are string or null and returns a new numpy array
with object dtype and only strings and na values if so. Otherwise,
raise a ValueError.

Parameters
----------
Expand All @@ -693,6 +696,9 @@ cpdef ndarray[object] ensure_string_array(
The value to use for na. For example, np.nan or pd.NA.
convert_na_value : bool, default True
If False, existing na values will be used unchanged in the new array.
coerce : bool, default True
Whether to coerce non-null non-string elements to strings.
Will raise ValueError otherwise.
copy : bool, default True
Whether to ensure that a new array is returned.
skipna : bool, default True
Expand Down Expand Up @@ -724,7 +730,10 @@ cpdef ndarray[object] ensure_string_array(
continue

if not checknull(val):
result[i] = str(val)
if coerce:
result[i] = str(val)
else:
raise ValueError("Non-string element encountered in array.")
else:
if convert_na_value:
val = na_value
Expand Down Expand Up @@ -1835,10 +1844,6 @@ cdef class StringValidator(Validator):
cdef inline bint is_array_typed(self) except -1:
return issubclass(self.dtype.type, np.str_)

cdef bint is_valid_null(self, object value) except -1:
# We deliberately exclude None / NaN here since StringArray uses NA
return value is C_NA


cpdef bint is_string_array(ndarray values, bint skipna=False):
cdef:
Expand Down
26 changes: 21 additions & 5 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,11 +144,18 @@ class StringArray(PandasArray):
.. warning::

Currently, this expects an object-dtype ndarray
where the elements are Python strings or :attr:`pandas.NA`.
where the elements are Python strings
or nan-likes(``None``, ``nan``, ``NaT``, ``NA``, Decimal("NaN")).
This may change without warning in the future. Use
:meth:`pandas.array` with ``dtype="string"`` for a stable way of
creating a `StringArray` from any sequence.

.. versionchanged:: 1.3

StringArray now accepts nan-likes in the constructor in addition
to strings, whereas it only accepted strings and :attr:`pandas.NA`
before.

copy : bool, default False
Whether to copy the array of data.

Expand Down Expand Up @@ -208,21 +215,30 @@ def __init__(self, values, copy=False):
values = extract_array(values)

super().__init__(values, copy=copy)
if not isinstance(values, type(self)):
self._validate()
# error: Incompatible types in assignment (expression has type "StringDtype",
# variable has type "PandasDtype")
NDArrayBacked.__init__(self, self._ndarray, StringDtype())
if not isinstance(values, type(self)):
self._validate()

def _validate(self):
"""Validate that we only store NA or strings."""
if len(self._ndarray) and not lib.is_string_array(self._ndarray, skipna=True):
raise ValueError("StringArray requires a sequence of strings or pandas.NA")
if self._ndarray.dtype != "object":
raise ValueError(
"StringArray requires a sequence of strings or pandas.NA. Got "
f"'{self._ndarray.dtype}' dtype instead."
)
try:
lib.ensure_string_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this going through ensure_string_array instead of e.g. is_string_array? For the latter, the ravel would be unnecessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_string_array will not convert the other nans(None, np.nan, etc.) to the correct na_value of pd.NA. FWIW, switching to ensure_string_array will also align us with _from_sequence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does any of this become simpler if done in conjunction with the edits in #45057

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to split this one up.

I'm thinking of maybe sticking to is_string_array and then doing another pass over the data to convert the not pd.NA nans, as a quick short term fix to unblock you. This will probably give a perf regression, but since is_string_array got sped up in #44495 on master(not 1.3.x), all I have to do is make the perf regression less than the perf improvement there, so that the regression is not user visible.

self._ndarray, na_value=StringDtype.na_value, coerce=False, copy=False
),
NDArrayBacked.__init__(
self,
self._ndarray,
StringDtype(),
)
except ValueError:
raise ValueError("StringArray requires a sequence of strings or pandas.NA")

@classmethod
def _from_sequence(cls, scalars, *, dtype: Dtype | None = None, copy=False):
Expand Down
13 changes: 6 additions & 7 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,14 +296,13 @@ def test_constructor_raises(cls):
with pytest.raises(ValueError, match=msg):
cls(np.array([]))

with pytest.raises(ValueError, match=msg):
cls(np.array(["a", np.nan], dtype=object))

with pytest.raises(ValueError, match=msg):
cls(np.array(["a", None], dtype=object))

with pytest.raises(ValueError, match=msg):
cls(np.array(["a", pd.NaT], dtype=object))
@pytest.mark.parametrize("na", [np.nan, pd.NaT, None, pd.NA])
def test_constructor_nan_like(na):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize over list as well (assume same result).

expected = pd.arrays.StringArray(np.array(["a", pd.NA]))
tm.assert_extension_array_equal(
pd.arrays.StringArray(np.array(["a", na], dtype="object")), expected
)


@pytest.mark.parametrize("copy", [True, False])
Expand Down
7 changes: 4 additions & 3 deletions pandas/tests/dtypes/test_inference.py
Original file line number Diff line number Diff line change
Expand Up @@ -1376,11 +1376,12 @@ def test_is_string_array(self):
assert lib.is_string_array(
np.array(["foo", "bar", pd.NA], dtype=object), skipna=True
)
# NaN is not valid for string array, just NA
assert not lib.is_string_array(
assert lib.is_string_array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC from #41412 (comment), we now call ensure_string_array in the constructor whereas previously is_string_array was called. so why is lib.is_string_array changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib.is_string_array is used for inferring dtypes elsewhere I think, and we want ndarrays with strings and np.nan/None to be inferred as string in addition to ndarrays with only strings/pd.NA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. so is is_string_array called before passing the values to the StringArray constructor? The reason I ask is that a reasonably contained behavior change to StringArray could be backported to 1.3.x during the rc period

will hopefully get a chance to look in more detail soon, but if you could summarize why it's changing that'll be great.

Copy link
Member Author

@lithomas1 lithomas1 Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it isn't called anymore in this PR, but it should be changed(allow np.nan/None) to be consistent with what we are accepting in the constructor now.

The actual change is just a one-liner here, so should be pretty harmless. https://github.com/pandas-dev/pandas/pull/41412/files#diff-c61205b9d6ae5756840d1ed6915157fe9d99aa33b29a762f821e9fe38ab0277cR1900

np.array(["foo", "bar", np.nan], dtype=object), skipna=True
)

assert not lib.is_string_array(
np.array(["foo", "bar", np.nan], dtype=object), skipna=False
)
assert not lib.is_string_array(np.array([1, 2]))

def test_to_object_array_tuples(self):
Expand Down