-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement _from_sequence_of_strings for BooleanArray #31159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
f7fdf45
9e57350
eb591cd
87ac09b
51b6ac9
6ffb018
9ddc5b1
4dd64b8
d28a6de
978f22d
19f9c18
f851b83
29dbfa1
0481153
be0731b
fdec55d
e2656cb
6d06b84
604b862
f2db6ed
9e35b63
71bafbf
184a9be
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -251,6 +251,16 @@ def test_coerce_to_numpy_array(): | |
np.array(arr, dtype="bool") | ||
|
||
|
||
@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there is a need to parametrize over de na values. The parser always gives data with NaNs |
||
def test_to_boolean_array_from_strings(na_value): | ||
result = BooleanArray._from_sequence_of_strings(["True", "False", na_value]) | ||
expected = BooleanArray( | ||
np.array([True, False, False]), np.array([False, False, True]) | ||
) | ||
|
||
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_repr(): | ||
df = pd.DataFrame({"A": pd.array([True, False, None], dtype="boolean")}) | ||
expected = " A\n0 True\n1 False\n2 <NA>" | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -550,3 +550,21 @@ def test_numeric_dtype(all_parsers, dtype): | |
|
||
result = parser.read_csv(StringIO(data), header=None, dtype=dtype) | ||
tm.assert_frame_equal(expected, result) | ||
|
||
|
||
@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""]) | ||
@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"]) | ||
@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"]) | ||
def test_boolean_dtype(all_parsers, na_string, true_string, false_string): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This parametrization generates a ton of test cases which increases pytest collection / run time. Can you inline all of them in a single test? data = "\n".join(["a", "true", "TRUE", "True", "false", "FALSE", "False", "NaN", "nan", ...]) and then make an assert? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also test with a "wrong" string value to ensure a proper error message is raised? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes a lot more sense than what I was doing. Regarding the assert, something like
(separating them out to make a failure more meaningful / make it a little easier to read)? I dropped the empty string for now because it seems to get ignored in a single column CSV. |
||
parser = all_parsers | ||
data = f"a,b\n{true_string},{false_string}\nTrue,{na_string}\n" | ||
|
||
result = parser.read_csv(StringIO(data), dtype="boolean") | ||
expected = pd.DataFrame( | ||
{ | ||
"a": pd.array([True, True], dtype="boolean"), | ||
"b": pd.array([False, None], dtype="boolean"), | ||
} | ||
) | ||
|
||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we should probably make a more performant implementation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, any thoughts on what such an implementation would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.