BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

phofl · 2021-01-07T00:06:13Z

closes read_csv cannot use dtype and true_values/false_values #34655
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

I am not really experienced with cython, so I would appreciate feedback on the switiching function. This was not done previously in case of ea boolean dtype, hence why this was failing before.

…ean dtype

jreback

i would instead allow passing in the true/false values into BooleanArrary._from_sequence_from_strings (and if they aren't passed, then use what is there now). and handle the conversion there rather than doing it this way.

phofl · 2021-01-07T00:24:00Z

Hm this sounds good. Will try this tomorrow or Friday. Thx.

phofl · 2021-01-07T21:03:21Z

This is way better. To avoid special casing the boolean case, I added **kwargs to _from_sequence_of_strings, Is there another way you would prefer?

jorisvandenbossche · 2021-01-08T14:02:22Z

Passing such additional keywords to _from_sequence_of_strings is not backwards compatible for externally defined EAs, if they do not accepts kwargs.
So I would maybe keep a special case check for boolean dtype in _parsers.pyx, and only pass the keywords to BooleanArray._from_sequence_of_strings.

(alternative could be to try/except, catch the error, and then try again without passing the additional keywords (and potentially raising a warning so that external EAs can update to accept **kwargs)

pandas/core/arrays/base.py

pandas/core/arrays/boolean.py

pandas/core/arrays/floating.py

jreback · 2021-01-08T14:06:59Z

Passing such additional keywords to _from_sequence_of_strings is not backwards compatible for externally defined EAs, if they do not accepts kwargs.
So I would maybe keep a special case check for boolean dtype in _parsers.pyx, and only pass the keywords to BooleanArray._from_sequence_of_strings.

(alternative could be to try/except, catch the error, and then try again without passing the additional keywords (and potentially raising a warning so that external EAs can update to accept **kwargs)

we don't have these kinds of guarantees on EA. I think we should accept them.

jorisvandenbossche · 2021-01-08T14:18:42Z

_from_sequence_of_strings is a public method of the ExtensionArray Interface, documented through means of the base class implementation:

pandas/pandas/core/arrays/base.py

Lines 213 to 237 in 510bc20

    
               @classmethod 
        
               def _from_sequence_of_strings( 
        
                   cls, strings, *, dtype: Optional[Dtype] = None, copy=False 
        
               ): 
        
                   """ 
        
                   Construct a new ExtensionArray from a sequence of strings. 
        
                   .. versionadded:: 0.24.0 
        
                   Parameters 
        
                   ---------- 
        
                   strings : Sequence 
        
                       Each element will be an instance of the scalar type for this 
        
                       array, ``cls.dtype.type``. 
        
                   dtype : dtype, optional 
        
                       Construct for this particular dtype. This should be a Dtype 
        
                       compatible with the ExtensionArray. 
        
                   copy : bool, default False 
        
                       If True, copy the underlying data. 
        
                   Returns 
        
                   ------- 
        
                   ExtensionArray 
        
                   """ 
        
                   raise AbstractMethodError(cls)

(and in the online API docs)

If we can easily avoid breaking third-party EA implementations, then we should do that.

jreback · 2021-01-08T14:30:16Z

_from_sequence_of_strings is a public method of the ExtensionArray Interface, documented through means of the base class implementation:

pandas/pandas/core/arrays/base.py

Lines 213 to 237 in 510bc20

@classmethod

def _from_sequence_of_strings(

cls, strings, *, dtype: Optional[Dtype] = None, copy=False

):

"""

Construct a new ExtensionArray from a sequence of strings.

.. versionadded:: 0.24.0

Parameters

----------

strings : Sequence

Each element will be an instance of the scalar type for this

array, ``cls.dtype.type``.

dtype : dtype, optional

Construct for this particular dtype. This should be a Dtype

compatible with the ExtensionArray.

copy : bool, default False

If True, copy the underlying data.

Returns

-------

ExtensionArray

"""

raise AbstractMethodError(cls)

(and in the online API docs)

If we can easily avoid breaking third-party EA implementations, then we should do that.

I think we need to in this case (for BooleanArray) to make a proper api. yes I agree we can avoid breaking all others (see above my comment).

jorisvandenbossche · 2021-01-08T14:34:53Z

Yep, no problem for adding it to BooleanArray, that's our own EA implementation, there we can add keyword however we want.

jorisvandenbossche

Thanks for the updates!

pandas/core/arrays/boolean.py

jreback · 2021-01-08T22:47:37Z

pandas/core/arrays/boolean.py

@@ -257,6 +257,8 @@ class BooleanArray(BaseMaskedArray):

    # The value used to fill '_data' to avoid upcasting
    _internal_fill_value = False
+    _TRUE_VALUES = {"True", "TRUE", "true", "1", "1.0"}


these could also be module level (doesn't matter), note in the parser itself we only have

object _true_values = [b'True', b'TRUE', b'true'] object _false_values = [b'False', b'FALSE', b'false']

but yeah i think we agreed the 1/1.0 are fine.

In this case let's keep it there

pandas/core/arrays/boolean.py

jreback

small comment

jreback

lgtm

jreback · 2021-01-09T22:18:27Z

thanks @phofl very nice!

…ean dtype (pandas-dev#39012)

phofl added 5 commits January 7, 2021 00:50

BUG: read_csv raising ValueError for tru_values/false_values and bool…

1ffe298

…ean dtype

Add tests

44a9a55

Fix bug for c engine

1428478

Add whatsnew

8404740

Fix function header

0a89217

phofl added ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv labels Jan 7, 2021

jreback requested changes Jan 7, 2021

View reviewed changes

Do switch in boolean

546754d

Change function signature

f9ab895

jreback changed the title ~~BUG: read_csv raising ValueError for tru_values/false_values and boolean dtyp~~ BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype Jan 8, 2021

jreback requested changes Jan 8, 2021

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

pandas/core/arrays/floating.py Outdated Show resolved Hide resolved

Make bool call explicit

a008073

jorisvandenbossche reviewed Jan 8, 2021

View reviewed changes

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

phofl added 3 commits January 8, 2021 22:33

Do not update in place

1148833

Make private

aaf2977

Fix mypy issues

be7b3b8

jreback reviewed Jan 8, 2021

View reviewed changes

Simplify code

4df4ee6

jreback reviewed Jan 8, 2021

View reviewed changes

pandas/core/arrays/boolean.py Outdated Show resolved Hide resolved

jreback requested changes Jan 8, 2021

View reviewed changes

jreback added this to the 1.3 milestone Jan 8, 2021

jreback approved these changes Jan 8, 2021

View reviewed changes

jreback merged commit e7ac30d into pandas-dev:master Jan 9, 2021

phofl deleted the 34655 branch January 9, 2021 22:22

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: read_csv raising ValueError for tru_values/false_values and bool…

43b0298

…ean dtype (pandas-dev#39012)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

phofl commented Jan 7, 2021

jreback left a comment

phofl commented Jan 7, 2021

phofl commented Jan 7, 2021

jorisvandenbossche commented Jan 8, 2021

jreback commented Jan 8, 2021

jorisvandenbossche commented Jan 8, 2021

jreback commented Jan 8, 2021

jorisvandenbossche commented Jan 8, 2021

jorisvandenbossche left a comment

jreback Jan 8, 2021

phofl Jan 8, 2021

jreback left a comment

jreback left a comment

jreback commented Jan 9, 2021

BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

Conversation

phofl commented Jan 7, 2021

jreback left a comment

Choose a reason for hiding this comment

phofl commented Jan 7, 2021

phofl commented Jan 7, 2021

jorisvandenbossche commented Jan 8, 2021

jreback commented Jan 8, 2021

jorisvandenbossche commented Jan 8, 2021

jreback commented Jan 8, 2021

jorisvandenbossche commented Jan 8, 2021

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback Jan 8, 2021

Choose a reason for hiding this comment

phofl Jan 8, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jan 9, 2021