ENH: Implement _from_sequence_of_strings for BooleanArray #31159

dsaxton · 2020-01-20T18:30:01Z

closes ENH: Implement CSV reading for BooleanArray #31131
tests added / passed
passes black pandas
whatsnew entry

jreback

can you add a test with this in read_csv

also you can add onto the existing whatsnew note for boolean (just add the issue number)

jreback · 2020-01-20T20:05:49Z

pandas/tests/arrays/test_boolean.py

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
        np.array(arr, dtype="bool")


+def test_to_boolean_array_from_strings():
+    result = BooleanArray._from_sequence_of_strings(["True", "False"])


can you add in null values e.g. '' and NaN i think would work (as strings).

It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by read_csv according to na_values.

I could add in a None if you think that makes sense, or add additional null strings to the read_csv test (could parametrize over different NA strings)?

it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

jreback · 2020-01-20T22:38:55Z

pandas/core/arrays/boolean.py

@@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
        values, mask = coerce_to_array(scalars, copy=copy)
        return BooleanArray(values, mask)

+    @classmethod
+    def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):


can you type strings (List[str]), dtype and copy

jreback · 2020-01-20T22:40:04Z

pandas/tests/arrays/test_boolean.py

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
        np.array(arr, dtype="bool")


+def test_to_boolean_array_from_strings():
+    result = BooleanArray._from_sequence_of_strings(["True", "False"])


it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

jreback · 2020-01-20T22:40:59Z

pandas/tests/arrays/test_boolean.py

+    tm.assert_extension_array_equal(result, expected)
+
+
+def test_boolean_from_csv():


this needs to go in the pandas/tests/io/parsers/test_dtypes.py

pls follow the existing patterns

pep8speaks · 2020-01-20T23:14:22Z

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-23 17:13:40 UTC

jorisvandenbossche

Thanks for the PR!

jorisvandenbossche · 2020-01-20T23:10:56Z

pandas/core/arrays/boolean.py

+            elif s in ["False", "false", "0"]:
+                return False
+            else:
+                return s


At some point, we should probably make a more performant implementation for this.

Agreed, any thoughts on what such an implementation would look like?

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.

jorisvandenbossche · 2020-01-20T23:13:06Z

pandas/tests/arrays/test_boolean.py

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
        np.array(arr, dtype="bool")


+def test_to_boolean_array_from_strings():
+    result = BooleanArray._from_sequence_of_strings(["True", "False"])


Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

jreback · 2020-01-20T23:30:43Z

lgtm unless @jorisvandenbossche has more comments. (should make an issue to review performance of the string construction, e.g. an asv would asnwer if perf is an issue here)

jorisvandenbossche · 2020-01-20T23:42:08Z

pandas/core/arrays/boolean.py

@@ -286,6 +286,23 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
        values, mask = coerce_to_array(scalars, copy=copy)
        return BooleanArray(values, mask)

+    @classmethod
+    def _from_sequence_of_strings(
+        cls, strings: List[str], dtype: Optional[str] = None, copy: bool = False


the typings are not fully correct. dtype can be a dtype object, string an ndarray.
Now since none of the other _from_... methods are typed, I wouldn't care about that too much on this PR

TomAugspurger · 2020-01-21T20:44:08Z

pandas/core/arrays/boolean.py

+        def map_string(s):
+            if isna(s):
+                return s
+            elif s in ["True", "true", "1"]:


These values differ from _libs.parsers._true_values and _false_values. We should be consistent with those.

cdef: object _true_values = [b'True', b'TRUE', b'true'] object _false_values = [b'False', b'FALSE', b'false']

TomAugspurger · 2020-01-21T20:45:11Z

pandas/core/arrays/boolean.py

+            elif s in ["False", "false", "0"]:
+                return False
+            else:
+                return s


IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

TomAugspurger · 2020-01-21T20:45:48Z

LGTM aside from https://github.com/pandas-dev/pandas/pull/31159/files/29dbfa1165311260ffd54afbc6f8208ee27405c6#diff-20faa368ac513e325ac07ec72a8f93c7

TomAugspurger · 2020-01-21T21:26:31Z

pandas/tests/io/parser/test_dtypes.py

+@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
+def test_boolean_dtype(all_parsers, na_string):
+    parser = all_parsers
+    data = f"a,b\nTrue,False\nTrue,{na_string}\n"


Can you assert that all the true / false strings are testsed?

I'll parametrize this over the different strings as well

TomAugspurger · 2020-01-21T22:09:22Z

pandas/tests/io/parser/test_dtypes.py

+@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
+@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"])
+@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"])
+def test_boolean_dtype(all_parsers, na_string, true_string, false_string):


This parametrization generates a ton of test cases which increases pytest collection / run time. Can you inline all of them in a single test?

data = "\n".join(["a", "true", "TRUE", "True", "false", "FALSE", "False", "NaN", "nan", ...])

and then make an assert?

Can you also test with a "wrong" string value to ensure a proper error message is raised?

That makes a lot more sense than what I was doing. Regarding the assert, something like

assert all([s in data for s in ["True", "TRUE", "true"]]) assert all([s in data for s in ["False", "FALSE", "false"]]) assert all([s in data for s in ["NaN", "nan", "NA", "null", "NULL"]])

(separating them out to make a failure more meaningful / make it a little easier to read)? I dropped the empty string for now because it seems to get ignored in a single column CSV.

jorisvandenbossche · 2020-01-21T22:11:36Z

There are actually also options in read_csv to specify the true/false strings. I suppose we are OK to ignore that? Or can we pass that through somehow?

TomAugspurger · 2020-01-21T22:49:01Z

I’m ok with ignoring that for now

jorisvandenbossche

Can you also check what you get when you have an invalid string in your column? (if it is a decent error message)

jorisvandenbossche · 2020-01-21T22:13:18Z

pandas/tests/arrays/test_boolean.py

@@ -251,6 +251,16 @@ def test_coerce_to_numpy_array():
        np.array(arr, dtype="bool")


+@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA])


I don't think there is a need to parametrize over de na values. The parser always gives data with NaNs

jorisvandenbossche · 2020-01-21T22:16:20Z

pandas/tests/io/parser/test_dtypes.py

+@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
+@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"])
+@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"])
+def test_boolean_dtype(all_parsers, na_string, true_string, false_string):


Can you also test with a "wrong" string value to ensure a proper error message is raised?

jorisvandenbossche · 2020-01-23T08:10:04Z

pandas/tests/io/parser/test_dtypes.py

+
+    assert all(s in data for s in ["True", "TRUE", "true"])
+    assert all(s in data for s in ["False", "FALSE", "false"])
+    assert all(s in data for s in ["NaN", "nan", "NA", "null", "NULL"])


Are those asserts needed? (I think it is clear they are in the data, as you generated it yourself?)
I think the comparison with result of read_csv below is good enough

dsaxton · 2020-01-23T14:16:25Z

Can you also check what you get when you have an invalid string in your column? (if it is a decent error message)

Right now the error message that gets raised is TypeError: Need to pass bool-like values from inside BooleanArray._from_sequence. I think this should be a ValueError, maybe raised from map_string (instead of returning any string that doesn't match one of the true / false ones)?

pandas/tests/arrays/test_boolean.py

jorisvandenbossche · 2020-01-23T15:41:06Z

I think this should be a ValueError, maybe raised from map_string (instead of returning any string that doesn't match one of the true / false ones)?

Yes, I think it is easy to raise a more informative ValueError from the last else clause in map_string

Co-Authored-By: Joris Van den Bossche <[email protected]>

…s for BooleanArray

…eanArray (#31261)

DANIEL SAXTON added 2 commits January 20, 2020 12:06

Add test

f7fdf45

Implement _from_sequence_of_strings

9e57350

dsaxton requested review from TomAugspurger and jorisvandenbossche January 20, 2020 18:30

jreback requested changes Jan 20, 2020

View reviewed changes

jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 20, 2020

Daniel Saxton added 2 commits January 20, 2020 15:08

Add read_csv test

eb591cd

Add to release note

87ac09b

jreback requested changes Jan 20, 2020

View reviewed changes

Daniel Saxton added 2 commits January 20, 2020 17:12

Check for pd.NA

51b6ac9

Update tests

6ffb018

Daniel Saxton added 2 commits January 20, 2020 17:19

Blacken

9ddc5b1

Type arguments

4dd64b8

jorisvandenbossche reviewed Jan 20, 2020

View reviewed changes

Use optional type

d28a6de

jreback added this to the 1.0.0 milestone Jan 20, 2020

jreback approved these changes Jan 20, 2020

View reviewed changes

jorisvandenbossche reviewed Jan 20, 2020

View reviewed changes

Daniel Saxton added 4 commits January 20, 2020 17:50

Don't import io

978f22d

Don't type dtype

19f9c18

Nit

f851b83

Don't import Optional

29dbfa1

TomAugspurger reviewed Jan 21, 2020

View reviewed changes

Change Boolean strings

0481153

TomAugspurger reviewed Jan 21, 2020

View reviewed changes

Parametrize test over true / false strings

be0731b

TomAugspurger reviewed Jan 21, 2020

View reviewed changes

Remove test parameterization

fdec55d

Fix linting

e2656cb

jorisvandenbossche reviewed Jan 23, 2020

View reviewed changes

Daniel Saxton added 3 commits January 23, 2020 08:16

Take out assertions

6d06b84

Take out parameterization

604b862

Blacken

f2db6ed

jorisvandenbossche reviewed Jan 23, 2020

View reviewed changes

pandas/tests/arrays/test_boolean.py Outdated Show resolved Hide resolved

dsaxton and others added 2 commits January 23, 2020 09:41

Update pandas/tests/arrays/test_boolean.py

9e35b63

Co-Authored-By: Joris Van den Bossche <[email protected]>

Update tests and raise invalid string error

71bafbf

TomAugspurger approved these changes Jan 23, 2020

View reviewed changes

Fix test

184a9be

jorisvandenbossche approved these changes Jan 23, 2020

View reviewed changes

TomAugspurger merged commit d0d93db into pandas-dev:master Jan 23, 2020

meeseeksmachine mentioned this pull request Jan 23, 2020

Backport PR #31159 on branch 1.0.x (ENH: Implement _from_sequence_of_strings for BooleanArray) #31261

Merged

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 23, 2020

Backport PR pandas-dev#31159: ENH: Implement _from_sequence_of_string…

bc6b36f

…s for BooleanArray

WillAyd pushed a commit that referenced this pull request Jan 23, 2020

Backport PR #31159: ENH: Implement _from_sequence_of_strings for Bool…

29edc79

…eanArray (#31261)

dsaxton deleted the bool-frm-str branch January 23, 2020 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

dsaxton commented Jan 20, 2020 •

edited

Loading

jreback left a comment

jreback Jan 20, 2020

dsaxton Jan 20, 2020

jreback Jan 20, 2020

jorisvandenbossche Jan 20, 2020

jreback Jan 20, 2020

jreback Jan 20, 2020

jreback Jan 20, 2020

pep8speaks commented Jan 20, 2020 •

edited

Loading

jorisvandenbossche left a comment

jorisvandenbossche Jan 20, 2020

dsaxton Jan 21, 2020

TomAugspurger Jan 21, 2020

jreback Jan 23, 2020

jorisvandenbossche Jan 20, 2020

jreback commented Jan 20, 2020

jorisvandenbossche Jan 20, 2020

TomAugspurger Jan 21, 2020

TomAugspurger Jan 21, 2020

TomAugspurger commented Jan 21, 2020

TomAugspurger Jan 21, 2020

dsaxton Jan 21, 2020

TomAugspurger Jan 21, 2020

jorisvandenbossche Jan 21, 2020

dsaxton Jan 21, 2020 •

edited

Loading

jorisvandenbossche commented Jan 21, 2020

TomAugspurger commented Jan 21, 2020

jorisvandenbossche left a comment

jorisvandenbossche Jan 21, 2020

jorisvandenbossche Jan 21, 2020

jorisvandenbossche Jan 23, 2020

TomAugspurger Jan 23, 2020

dsaxton commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

		tm.assert_extension_array_equal(result, expected)


		def test_boolean_from_csv():

		@@ -251,6 +251,16 @@ def test_coerce_to_numpy_array():
		np.array(arr, dtype="bool")


		@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA])

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

Conversation

dsaxton commented Jan 20, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Jan 20, 2020 • edited Loading

Comment last updated at 2020-01-23 17:13:40 UTC

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 21, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Jan 21, 2020 • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 21, 2020

TomAugspurger commented Jan 21, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton commented Jan 23, 2020

jorisvandenbossche commented Jan 23, 2020

dsaxton commented Jan 20, 2020 •

edited

Loading

pep8speaks commented Jan 20, 2020 •

edited

Loading

dsaxton Jan 21, 2020 •

edited

Loading