Skip to content

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 23, 2020
Merged
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ type dedicated to boolean data that can hold missing values. The default
``bool`` data type based on a bool-dtype NumPy array, the column can only hold
``True`` or ``False``, and not missing values. This new :class:`~arrays.BooleanArray`
can store missing values as well by keeping track of this in a separate mask.
(:issue:`29555`, :issue:`30095`)
(:issue:`29555`, :issue:`30095`, :issue:`31131`)

.. ipython:: python

Expand Down
19 changes: 18 additions & 1 deletion pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import numbers
from typing import TYPE_CHECKING, Any, Tuple, Type
from typing import TYPE_CHECKING, Any, List, Optional, Tuple, Type
import warnings

import numpy as np
Expand Down Expand Up @@ -286,6 +286,23 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
values, mask = coerce_to_array(scalars, copy=copy)
return BooleanArray(values, mask)

@classmethod
def _from_sequence_of_strings(
cls, strings: List[str], dtype: Optional[str] = None, copy: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typings are not fully correct. dtype can be a dtype object, string an ndarray.
Now since none of the other _from_... methods are typed, I wouldn't care about that too much on this PR

):
def map_string(s):
if isna(s):
return s
elif s in ["True", "true", "1"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values differ from _libs.parsers._true_values and _false_values. We should be consistent with those.

cdef:
    object _true_values = [b'True', b'TRUE', b'true']
    object _false_values = [b'False', b'FALSE', b'false']

return True
elif s in ["False", "false", "0"]:
return False
else:
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should probably make a more performant implementation for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, any thoughts on what such an implementation would look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.


scalars = [map_string(x) for x in strings]
return cls._from_sequence(scalars, dtype, copy)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
data = self._data.astype("int8")
data[self._mask] = -1
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import io
import operator

import numpy as np
Expand Down Expand Up @@ -251,6 +252,16 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need to parametrize over de na values. The parser always gives data with NaNs

def test_to_boolean_array_from_strings(na_value):
result = BooleanArray._from_sequence_of_strings(["True", "False", na_value])
expected = BooleanArray(
np.array([True, False, False]), np.array([False, False, True])
)

tm.assert_extension_array_equal(result, expected)


def test_repr():
df = pd.DataFrame({"A": pd.array([True, False, None], dtype="boolean")})
expected = " A\n0 True\n1 False\n2 <NA>"
Expand Down
16 changes: 16 additions & 0 deletions pandas/tests/io/parser/test_dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -550,3 +550,19 @@ def test_numeric_dtype(all_parsers, dtype):

result = parser.read_csv(StringIO(data), header=None, dtype=dtype)
tm.assert_frame_equal(expected, result)


@pytest.mark.parametrize("null_string", ["NaN", "nan", "NA", "null", "NULL", ""])
def test_boolean_dtype(all_parsers, null_string):
parser = all_parsers
data = f"a,b\nTrue,False\nTrue,{null_string}\n"

result = parser.read_csv(StringIO(data), dtype="boolean")
expected = pd.DataFrame(
{
"a": pd.array([True, True], dtype="boolean"),
"b": pd.array([False, None], dtype="boolean"),
}
)

tm.assert_frame_equal(result, expected)