-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Implement _from_sequence_of_strings for BooleanArray #31159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test with this in read_csv
also you can add onto the existing whatsnew note for boolean (just add the issue number)
pandas/tests/arrays/test_boolean.py
Outdated
@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array(): | |||
np.array(arr, dtype="bool") | |||
|
|||
|
|||
def test_to_boolean_array_from_strings(): | |||
result = BooleanArray._from_sequence_of_strings(["True", "False"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add in null values e.g. ''
and NaN
i think would work (as strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by read_csv
according to na_values
.
I could add in a None
if you think that makes sense, or add additional null strings to the read_csv
test (could parametrize over different NA strings)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan
or None is fine.
pandas/core/arrays/boolean.py
Outdated
@@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False): | |||
values, mask = coerce_to_array(scalars, copy=copy) | |||
return BooleanArray(values, mask) | |||
|
|||
@classmethod | |||
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type strings (List[str]), dtype and copy
pandas/tests/arrays/test_boolean.py
Outdated
@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array(): | |||
np.array(arr, dtype="bool") | |||
|
|||
|
|||
def test_to_boolean_array_from_strings(): | |||
result = BooleanArray._from_sequence_of_strings(["True", "False"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.
pandas/tests/arrays/test_boolean.py
Outdated
tm.assert_extension_array_equal(result, expected) | ||
|
||
|
||
def test_boolean_from_csv(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to go in the pandas/tests/io/parsers/test_dtypes.py
pls follow the existing patterns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
pandas/core/arrays/boolean.py
Outdated
elif s in ["False", "false", "0"]: | ||
return False | ||
else: | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, we should probably make a more performant implementation for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, any thoughts on what such an implementation would look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.
pandas/tests/arrays/test_boolean.py
Outdated
@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array(): | |||
np.array(arr, dtype="bool") | |||
|
|||
|
|||
def test_to_boolean_array_from_strings(): | |||
result = BooleanArray._from_sequence_of_strings(["True", "False"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan
or None is fine.
lgtm unless @jorisvandenbossche has more comments. (should make an issue to review performance of the string construction, e.g. an asv would asnwer if perf is an issue here) |
pandas/core/arrays/boolean.py
Outdated
@@ -286,6 +286,23 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False): | |||
values, mask = coerce_to_array(scalars, copy=copy) | |||
return BooleanArray(values, mask) | |||
|
|||
@classmethod | |||
def _from_sequence_of_strings( | |||
cls, strings: List[str], dtype: Optional[str] = None, copy: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the typings are not fully correct. dtype
can be a dtype object, string
an ndarray.
Now since none of the other _from_...
methods are typed, I wouldn't care about that too much on this PR
pandas/core/arrays/boolean.py
Outdated
def map_string(s): | ||
if isna(s): | ||
return s | ||
elif s in ["True", "true", "1"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values differ from _libs.parsers._true_values
and _false_values. We should be consistent with those.
cdef:
object _true_values = [b'True', b'TRUE', b'true']
object _false_values = [b'False', b'FALSE', b'false']
pandas/core/arrays/boolean.py
Outdated
elif s in ["False", "false", "0"]: | ||
return False | ||
else: | ||
return s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.
@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""]) | ||
def test_boolean_dtype(all_parsers, na_string): | ||
parser = all_parsers | ||
data = f"a,b\nTrue,False\nTrue,{na_string}\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you assert that all the true / false strings are testsed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll parametrize this over the different strings as well
@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""]) | ||
@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"]) | ||
@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"]) | ||
def test_boolean_dtype(all_parsers, na_string, true_string, false_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parametrization generates a ton of test cases which increases pytest collection / run time. Can you inline all of them in a single test?
data = "\n".join(["a", "true", "TRUE", "True", "false", "FALSE", "False", "NaN", "nan", ...])
and then make an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test with a "wrong" string value to ensure a proper error message is raised?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot more sense than what I was doing. Regarding the assert, something like
assert all([s in data for s in ["True", "TRUE", "true"]])
assert all([s in data for s in ["False", "FALSE", "false"]])
assert all([s in data for s in ["NaN", "nan", "NA", "null", "NULL"]])
(separating them out to make a failure more meaningful / make it a little easier to read)? I dropped the empty string for now because it seems to get ignored in a single column CSV.
There are actually also options in read_csv to specify the true/false strings. I suppose we are OK to ignore that? Or can we pass that through somehow? |
I’m ok with ignoring that for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also check what you get when you have an invalid string in your column? (if it is a decent error message)
pandas/tests/arrays/test_boolean.py
Outdated
@@ -251,6 +251,16 @@ def test_coerce_to_numpy_array(): | |||
np.array(arr, dtype="bool") | |||
|
|||
|
|||
@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a need to parametrize over de na values. The parser always gives data with NaNs
@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""]) | ||
@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"]) | ||
@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"]) | ||
def test_boolean_dtype(all_parsers, na_string, true_string, false_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also test with a "wrong" string value to ensure a proper error message is raised?
|
||
assert all(s in data for s in ["True", "TRUE", "true"]) | ||
assert all(s in data for s in ["False", "FALSE", "false"]) | ||
assert all(s in data for s in ["NaN", "nan", "NA", "null", "NULL"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those asserts needed? (I think it is clear they are in the data, as you generated it yourself?)
I think the comparison with result of read_csv below is good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Right now the error message that gets raised is |
Yes, I think it is easy to raise a more informative ValueError from the last |
Co-Authored-By: Joris Van den Bossche <[email protected]>
…s for BooleanArray
black pandas