Skip to content

BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype #39012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 9, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Jan 7, 2021

I am not really experienced with cython, so I would appreciate feedback on the switiching function. This was not done previously in case of ea boolean dtype, hence why this was failing before.

@phofl phofl added ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv labels Jan 7, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would instead allow passing in the true/false values into BooleanArrary._from_sequence_from_strings (and if they aren't passed, then use what is there now). and handle the conversion there rather than doing it this way.

@phofl
Copy link
Member Author

phofl commented Jan 7, 2021

Hm this sounds good. Will try this tomorrow or Friday. Thx.

@phofl
Copy link
Member Author

phofl commented Jan 7, 2021

This is way better. To avoid special casing the boolean case, I added **kwargs to _from_sequence_of_strings, Is there another way you would prefer?

@jorisvandenbossche
Copy link
Member

Passing such additional keywords to _from_sequence_of_strings is not backwards compatible for externally defined EAs, if they do not accepts kwargs.
So I would maybe keep a special case check for boolean dtype in _parsers.pyx, and only pass the keywords to BooleanArray._from_sequence_of_strings.

(alternative could be to try/except, catch the error, and then try again without passing the additional keywords (and potentially raising a warning so that external EAs can update to accept **kwargs)

@jreback jreback changed the title BUG: read_csv raising ValueError for tru_values/false_values and boolean dtyp BUG: read_csv raising ValueError for tru_values/false_values and boolean dtype Jan 8, 2021
@jreback
Copy link
Contributor

jreback commented Jan 8, 2021

Passing such additional keywords to _from_sequence_of_strings is not backwards compatible for externally defined EAs, if they do not accepts kwargs.
So I would maybe keep a special case check for boolean dtype in _parsers.pyx, and only pass the keywords to BooleanArray._from_sequence_of_strings.

(alternative could be to try/except, catch the error, and then try again without passing the additional keywords (and potentially raising a warning so that external EAs can update to accept **kwargs)

we don't have these kinds of guarantees on EA. I think we should accept them.

@jorisvandenbossche
Copy link
Member

_from_sequence_of_strings is a public method of the ExtensionArray Interface, documented through means of the base class implementation:

@classmethod
def _from_sequence_of_strings(
cls, strings, *, dtype: Optional[Dtype] = None, copy=False
):
"""
Construct a new ExtensionArray from a sequence of strings.
.. versionadded:: 0.24.0
Parameters
----------
strings : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
dtype : dtype, optional
Construct for this particular dtype. This should be a Dtype
compatible with the ExtensionArray.
copy : bool, default False
If True, copy the underlying data.
Returns
-------
ExtensionArray
"""
raise AbstractMethodError(cls)

(and in the online API docs)

If we can easily avoid breaking third-party EA implementations, then we should do that.

@jreback
Copy link
Contributor

jreback commented Jan 8, 2021

_from_sequence_of_strings is a public method of the ExtensionArray Interface, documented through means of the base class implementation:

@classmethod
def _from_sequence_of_strings(
cls, strings, *, dtype: Optional[Dtype] = None, copy=False
):
"""
Construct a new ExtensionArray from a sequence of strings.
.. versionadded:: 0.24.0
Parameters
----------
strings : Sequence
Each element will be an instance of the scalar type for this
array, ``cls.dtype.type``.
dtype : dtype, optional
Construct for this particular dtype. This should be a Dtype
compatible with the ExtensionArray.
copy : bool, default False
If True, copy the underlying data.
Returns
-------
ExtensionArray
"""
raise AbstractMethodError(cls)

(and in the online API docs)

If we can easily avoid breaking third-party EA implementations, then we should do that.

I think we need to in this case (for BooleanArray) to make a proper api. yes I agree we can avoid breaking all others (see above my comment).

@jorisvandenbossche
Copy link
Member

Yep, no problem for adding it to BooleanArray, that's our own EA implementation, there we can add keyword however we want.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

@@ -257,6 +257,8 @@ class BooleanArray(BaseMaskedArray):

# The value used to fill '_data' to avoid upcasting
_internal_fill_value = False
_TRUE_VALUES = {"True", "TRUE", "true", "1", "1.0"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these could also be module level (doesn't matter), note in the parser itself we only have

    object _true_values = [b'True', b'TRUE', b'true']
    object _false_values = [b'False', b'FALSE', b'false']

but yeah i think we agreed the 1/1.0 are fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case let's keep it there

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

@jreback jreback added this to the 1.3 milestone Jan 8, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit e7ac30d into pandas-dev:master Jan 9, 2021
@jreback
Copy link
Contributor

jreback commented Jan 9, 2021

thanks @phofl very nice!

@phofl phofl deleted the 34655 branch January 9, 2021 22:22
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv cannot use dtype and true_values/false_values
3 participants