ENH: Implement CSV reading for BooleanArray #31131

koizumihiroo · 2020-01-19T03:01:14Z

Code Sample, a copy-pastable example if possible

In 1.0.0 rc0, pd.read_csv returns this error when reading NA with specifying dtype of 'boolean'

# import pandas as pd
# import io

>>> txt = """
X1,X2,X3
1,a,True
2,b,False
NA,NA,NA
"""
>>> df1 = pd.read_csv(io.StringIO(txt), dtype={'X3': 'boolean'}) # `pd.BooleanDtype()` also fails
raceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/usr/local/lib/python3.7/site-packages/pandas/core/arrays/base.py", line 232, in _from_sequence_of_strings
    raise AbstractMethodError(cls)
pandas.errors.AbstractMethodError: This method must be defined in the concrete class type

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1083, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1114, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1194, in pandas._libs.parsers.TextReader._convert_with_dtype
NotImplementedError: Extension Array: <class 'pandas.core.arrays.boolean.BooleanArray'> must implement _from_sequence_of_strings in order to be used in parser methods

while "Int64" and "string" with NA can be correctly recognized.

>>> df2 = pd.read_csv(io.StringIO(txt), dtype={'X1': 'Int64', 'X2': 'string'})
>>> df2
     X1    X2     X3
0     1     a   True
1     2     b  False
2  <NA>  <NA>    NaN
>>> df2.dtypes
X1     Int64
X2    string
X3    object
dtype: object

Problem description

NA literal for boolean is not parsed by pd.read_csv

Expected Output

df3 = pd.read_csv(io.StringIO(txt), dtype={'X3': 'boolean'})
>>> df3
    X1   X2     X3
0  1.0    a   True
1  2.0    b  False
2  NaN  NaN   <NA>
>>> df3.dtypes
X1    float64
X2     object
X3    boolean
dtype: object

# and
df4 = pd.read_csv(io.StringIO(txt), dtype={'X1': 'Int64', 'X2': 'string', 'X3': 'boolean'})
>>> df4
     X1    X2     X3
0     1     a   True
1     2     b  False
2  <NA>  <NA>   <NA>
>>> df4.dtypes
X1     Int64
X2    string
X3   boolean
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.9.184-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

sach99in · 2020-01-19T13:48:58Z

@koizumihiroo try this
df = pd.read_csv(io.StringIO(txt), dtype={'X3': 'bool'})

TomAugspurger · 2020-01-19T16:02:15Z

@koizumihiroo We need to implement BooleanArray._from_sequence_of_strings. Are you interested in doing that?

@sach99in bool and boolean are different dtypes.

koizumihiroo changed the title ~~pd.read_csv cannot recognize NA for dtype="boolean" in 1.0.0rc0~~ pd.read_csv cannot recognize NA for dtype='boolean' in 1.0.0rc0 Jan 19, 2020

TomAugspurger changed the title ~~pd.read_csv cannot recognize NA for dtype='boolean' in 1.0.0rc0~~ Implement CSV reading for BooleanArray Jan 19, 2020

TomAugspurger added this to the Contributions Welcome milestone Jan 19, 2020

TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. IO CSV read_csv, to_csv labels Jan 19, 2020

jorisvandenbossche changed the title ~~Implement CSV reading for BooleanArray~~ ENH: Implement CSV reading for BooleanArray Jan 19, 2020

dsaxton mentioned this issue Jan 20, 2020

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.0.0 Jan 20, 2020

TomAugspurger closed this as completed in #31159 Jan 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implement CSV reading for BooleanArray #31131

ENH: Implement CSV reading for BooleanArray #31131

koizumihiroo commented Jan 19, 2020 •

edited

Loading

INSTALLED VERSIONS

sach99in commented Jan 19, 2020

TomAugspurger commented Jan 19, 2020

ENH: Implement CSV reading for BooleanArray #31131

ENH: Implement CSV reading for BooleanArray #31131

Comments

koizumihiroo commented Jan 19, 2020 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

sach99in commented Jan 19, 2020

TomAugspurger commented Jan 19, 2020

koizumihiroo commented Jan 19, 2020 •

edited

Loading

Output of `pd.show_versions()`