Skip to content

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 23, 2020
Merged

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

merged 23 commits into from
Jan 23, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Jan 20, 2020

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test with this in read_csv

also you can add onto the existing whatsnew note for boolean (just add the issue number)

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


def test_to_boolean_array_from_strings():
result = BooleanArray._from_sequence_of_strings(["True", "False"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add in null values e.g. '' and NaN i think would work (as strings).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by read_csv according to na_values.

I could add in a None if you think that makes sense, or add additional null strings to the read_csv test (could parametrize over different NA strings)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 20, 2020
@@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
values, mask = coerce_to_array(scalars, copy=copy)
return BooleanArray(values, mask)

@classmethod
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type strings (List[str]), dtype and copy

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


def test_to_boolean_array_from_strings():
result = BooleanArray._from_sequence_of_strings(["True", "False"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

tm.assert_extension_array_equal(result, expected)


def test_boolean_from_csv():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to go in the pandas/tests/io/parsers/test_dtypes.py

pls follow the existing patterns

@pep8speaks
Copy link

pep8speaks commented Jan 20, 2020

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-23 17:13:40 UTC

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

elif s in ["False", "false", "0"]:
return False
else:
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should probably make a more performant implementation for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, any thoughts on what such an implementation would look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.

@@ -251,6 +251,13 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


def test_to_boolean_array_from_strings():
result = BooleanArray._from_sequence_of_strings(["True", "False"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

@jreback jreback added this to the 1.0.0 milestone Jan 20, 2020
@jreback
Copy link
Contributor

jreback commented Jan 20, 2020

lgtm unless @jorisvandenbossche has more comments. (should make an issue to review performance of the string construction, e.g. an asv would asnwer if perf is an issue here)

@@ -286,6 +286,23 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
values, mask = coerce_to_array(scalars, copy=copy)
return BooleanArray(values, mask)

@classmethod
def _from_sequence_of_strings(
cls, strings: List[str], dtype: Optional[str] = None, copy: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typings are not fully correct. dtype can be a dtype object, string an ndarray.
Now since none of the other _from_... methods are typed, I wouldn't care about that too much on this PR

def map_string(s):
if isna(s):
return s
elif s in ["True", "true", "1"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values differ from _libs.parsers._true_values and _false_values. We should be consistent with those.

cdef:
    object _true_values = [b'True', b'TRUE', b'true']
    object _false_values = [b'False', b'FALSE', b'false']

elif s in ["False", "false", "0"]:
return False
else:
return s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
def test_boolean_dtype(all_parsers, na_string):
parser = all_parsers
data = f"a,b\nTrue,False\nTrue,{na_string}\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assert that all the true / false strings are testsed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll parametrize this over the different strings as well

@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"])
@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"])
def test_boolean_dtype(all_parsers, na_string, true_string, false_string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parametrization generates a ton of test cases which increases pytest collection / run time. Can you inline all of them in a single test?

data = "\n".join(["a", "true", "TRUE", "True", "false", "FALSE", "False", "NaN", "nan", ...])

and then make an assert?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test with a "wrong" string value to ensure a proper error message is raised?

Copy link
Member Author

@dsaxton dsaxton Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot more sense than what I was doing. Regarding the assert, something like

assert all([s in data for s in ["True", "TRUE", "true"]])
assert all([s in data for s in ["False", "FALSE", "false"]])
assert all([s in data for s in ["NaN", "nan", "NA", "null", "NULL"]])

(separating them out to make a failure more meaningful / make it a little easier to read)? I dropped the empty string for now because it seems to get ignored in a single column CSV.

@jorisvandenbossche
Copy link
Member

There are actually also options in read_csv to specify the true/false strings. I suppose we are OK to ignore that? Or can we pass that through somehow?

@TomAugspurger
Copy link
Contributor

I’m ok with ignoring that for now

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check what you get when you have an invalid string in your column? (if it is a decent error message)

@@ -251,6 +251,16 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


@pytest.mark.parametrize("na_value", [None, np.nan, pd.NA])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a need to parametrize over de na values. The parser always gives data with NaNs

@pytest.mark.parametrize("na_string", ["NaN", "nan", "NA", "null", "NULL", ""])
@pytest.mark.parametrize("true_string", ["True", "TRUE", "true"])
@pytest.mark.parametrize("false_string", ["False", "FALSE", "false"])
def test_boolean_dtype(all_parsers, na_string, true_string, false_string):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also test with a "wrong" string value to ensure a proper error message is raised?


assert all(s in data for s in ["True", "TRUE", "true"])
assert all(s in data for s in ["False", "FALSE", "false"])
assert all(s in data for s in ["NaN", "nan", "NA", "null", "NULL"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those asserts needed? (I think it is clear they are in the data, as you generated it yourself?)
I think the comparison with result of read_csv below is good enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@dsaxton
Copy link
Member Author

dsaxton commented Jan 23, 2020

Can you also check what you get when you have an invalid string in your column? (if it is a decent error message)

Right now the error message that gets raised is TypeError: Need to pass bool-like values from inside BooleanArray._from_sequence. I think this should be a ValueError, maybe raised from map_string (instead of returning any string that doesn't match one of the true / false ones)?

@jorisvandenbossche
Copy link
Member

I think this should be a ValueError, maybe raised from map_string (instead of returning any string that doesn't match one of the true / false ones)?

Yes, I think it is easy to raise a more informative ValueError from the last else clause in map_string

@TomAugspurger TomAugspurger merged commit d0d93db into pandas-dev:master Jan 23, 2020
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 23, 2020
@dsaxton dsaxton deleted the bool-frm-str branch January 23, 2020 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Implement CSV reading for BooleanArray
5 participants