Skip to content

ENH: Implement CSV reading for BooleanArray #31131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
koizumihiroo opened this issue Jan 19, 2020 · 2 comments · Fixed by #31159
Closed

ENH: Implement CSV reading for BooleanArray #31131

koizumihiroo opened this issue Jan 19, 2020 · 2 comments · Fixed by #31159
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Milestone

Comments

@koizumihiroo
Copy link

koizumihiroo commented Jan 19, 2020

Code Sample, a copy-pastable example if possible

In 1.0.0 rc0, pd.read_csv returns this error when reading NA with specifying dtype of 'boolean'

# import pandas as pd
# import io

>>> txt = """
X1,X2,X3
1,a,True
2,b,False
NA,NA,NA
"""
>>> df1 = pd.read_csv(io.StringIO(txt), dtype={'X3': 'boolean'}) # `pd.BooleanDtype()` also fails
raceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
    raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods

while "Int64" and "string" with NA can be correctly recognized.

>>> df2 = pd.read_csv(io.StringIO(txt), dtype={'X1': 'Int64', 'X2': 'string'})
>>> df2
     X1    X2     X3
0     1     a   True
1     2     b  False
2  <NA>  <NA>    NaN
>>> df2.dtypes
X1     Int64
X2    string
X3    object
dtype: object

Problem description

NA literal for boolean is not parsed by pd.read_csv

Expected Output

df3 = pd.read_csv(io.StringIO(txt), dtype={'X3': 'boolean'})
>>> df3
    X1   X2     X3
0  1.0    a   True
1  2.0    b  False
2  NaN  NaN   <NA>
>>> df3.dtypes
X1    float64
X2     object
X3    boolean
dtype: object

# and
df4 = pd.read_csv(io.StringIO(txt), dtype={'X1': 'Int64', 'X2': 'string', 'X3': 'boolean'})
>>> df4
     X1    X2     X3
0     1     a   True
1     2     b  False
2  <NA>  <NA>   <NA>
>>> df4.dtypes
X1     Int64
X2    string
X3   boolean
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@koizumihiroo koizumihiroo changed the title pd.read_csv cannot recognize NA for dtype="boolean" in 1.0.0rc0 pd.read_csv cannot recognize NA for dtype='boolean' in 1.0.0rc0 Jan 19, 2020
@sach99in
Copy link

@koizumihiroo try this
df = pd.read_csv(io.StringIO(txt), dtype={'X3': 'bool'})

@TomAugspurger
Copy link
Contributor

@koizumihiroo We need to implement BooleanArray._from_sequence_of_strings. Are you interested in doing that?

@sach99in bool and boolean are different dtypes.

@TomAugspurger TomAugspurger changed the title pd.read_csv cannot recognize NA for dtype='boolean' in 1.0.0rc0 Implement CSV reading for BooleanArray Jan 19, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Jan 19, 2020
@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv labels Jan 19, 2020
@jorisvandenbossche jorisvandenbossche changed the title Implement CSV reading for BooleanArray ENH: Implement CSV reading for BooleanArray Jan 19, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.0.0 Jan 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants