Skip to content

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 23, 2020
Merged
13 changes: 13 additions & 0 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
values, mask = coerce_to_array(scalars, copy=copy)
return BooleanArray(values, mask)

@classmethod
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type strings (List[str]), dtype and copy

def map_string(s):
if s in ["True", "true", "1"]:
return True
elif s in ["False", "false", "0"]:
return False
else:
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should probably make a more performant implementation for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, any thoughts on what such an implementation would look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.


scalars = [map_string(x) for x in strings]
return cls._from_sequence(scalars, dtype, copy)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
data = self._data.astype("int8")
data[self._mask] = -1
Expand Down
7 changes: 7 additions & 0 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


def test_to_boolean_array_from_strings():
result = BooleanArray._from_sequence_of_strings(["True", "False"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add in null values e.g. '' and NaN i think would work (as strings).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by read_csv according to na_values.

I could add in a None if you think that makes sense, or add additional null strings to the read_csv test (could parametrize over different NA strings)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

expected = BooleanArray(np.array([True, False]), np.array([False, False]))

tm.assert_extension_array_equal(result, expected)


def test_repr():
df = pd.DataFrame({"A": pd.array([True, False, None], dtype="boolean")})
expected = " A\n0 True\n1 False\n2 <NA>"
Expand Down