-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement _from_sequence_of_strings for BooleanArray #31159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
f7fdf45
9e57350
eb591cd
87ac09b
51b6ac9
6ffb018
9ddc5b1
4dd64b8
d28a6de
978f22d
19f9c18
f851b83
29dbfa1
0481153
be0731b
fdec55d
e2656cb
6d06b84
604b862
f2db6ed
9e35b63
71bafbf
184a9be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False): | |
values, mask = coerce_to_array(scalars, copy=copy) | ||
return BooleanArray(values, mask) | ||
|
||
@classmethod | ||
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False): | ||
def map_string(s): | ||
if s in ["True", "true", "1"]: | ||
return True | ||
elif s in ["False", "false", "0"]: | ||
return False | ||
else: | ||
return s | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At some point, we should probably make a more performant implementation for this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, any thoughts on what such an implementation would look like? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython. |
||
|
||
scalars = [map_string(x) for x in strings] | ||
return cls._from_sequence(scalars, dtype, copy) | ||
|
||
def _values_for_factorize(self) -> Tuple[np.ndarray, Any]: | ||
data = self._data.astype("int8") | ||
data[self._mask] = -1 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array(): | |
np.array(arr, dtype="bool") | ||
|
||
|
||
def test_to_boolean_array_from_strings(): | ||
result = BooleanArray._from_sequence_of_strings(["True", "False"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add in null values e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by I could add in a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indeed, the csv parser already handles the different null string representations. So testing here just with
dsaxton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
expected = BooleanArray(np.array([True, False]), np.array([False, False])) | ||
|
||
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_repr(): | ||
df = pd.DataFrame({"A": pd.array([True, False, None], dtype="boolean")}) | ||
expected = " A\n0 True\n1 False\n2 <NA>" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type strings (List[str]), dtype and copy