BUG: Reading fails when `dtype` is defined with `bool[pyarrow]` #53390

LeonLuttenberger · 2023-05-25T17:17:52Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"col": [True, False, True]})
df.to_csv("example.csv", index=False)

df2 = pd.read_csv("example.csv", dtype={"col": "bool[pyarrow]"})

Issue Description

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py:912, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
    899 kwds_defaults = _refine_defaults_read(
    900     dialect,
    901     delimiter,
   (...)
    908     dtype_backend=dtype_backend,
    909 )
    910 kwds.update(kwds_defaults)
--> 912 return _read(filepath_or_buffer, kwds)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py:583, in _read(filepath_or_buffer, kwds)
    580     return parser
    582 with parser:
--> 583     return parser.read(nrows)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1704, in TextFileReader.read(self, nrows)
   1697 nrows = validate_integer("nrows", nrows)
   1698 try:
   1699     # error: "ParserBase" has no attribute "read"
   1700     (
   1701         index,
   1702         columns,
   1703         col_dict,
-> 1704     ) = self._engine.read(  # type: ignore[attr-defined]
   1705         nrows
   1706     )
   1707 except Exception:
   1708     self.close()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py:234, in CParserWrapper.read(self, nrows)
    232 try:
    233     if self.low_memory:
--> 234         chunks = self._reader.read_low_memory(nrows)
    235         # destructive to chunks
    236         data = _concatenate_chunks(chunks)

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:812, in pandas._libs.parsers.TextReader.read_low_memory()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:889, in pandas._libs.parsers.TextReader._read_rows()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1034, in pandas._libs.parsers.TextReader._convert_column_data()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1073, in pandas._libs.parsers.TextReader._convert_tokens()

File ~/opt/anaconda3/lib/python3.9/site-packages/pandas/_libs/parsers.pyx:1173, in pandas._libs.parsers.TextReader._convert_with_dtype()

TypeError: _from_sequence_of_strings() got an unexpected keyword argument 'true_values'

Expected Behavior

In the above example, I would expect that the CSV data is loaded onto a DataFrame with PyArrow-backed types. This behavior works when the type is a string, int or float. However, it produces an error when bool[arrow] is specified.

Installed Versions

INSTALLED VERSIONS

commit : 37ea63d
python : 3.9.13.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 21:00:41 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.1
numpy : 1.24.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.1.2
Cython : None
pytest : 7.3.1
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : 1.0.3
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.4.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 12.0.0
pyreadstat : None
pyxlsb : None
s3fs : 0.4.2
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

### What changes were proposed in this pull request? The pr aims to upgrade `pandas` from 2.0.2 to 2.0.3. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Bug in DataFrame.convert_dtype() and Series.convert_dtype() when trying to convert [ArrowDtype](https://pandas.pydata.org/docs/reference/api/pandas.ArrowDtype.html#pandas.ArrowDtype) with dtype_backend="nullable_numpy" ([GH53648](pandas-dev/pandas#53648)) - Bug in [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) when defining dtype with bool[pyarrow] for the "c" and "python" engines ([GH53390](pandas-dev/pandas#53390)) 2.Release notes: https://pandas.pydata.org/docs/whatsnew/v2.0.3.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes #41812 from panbingkun/SPARK-44267. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

### What changes were proposed in this pull request? The pr aims to upgrade `pandas` from 2.0.2 to 2.0.3. ### Why are the changes needed? 1.The new version brings some bug fixed, eg: - Bug in DataFrame.convert_dtype() and Series.convert_dtype() when trying to convert [ArrowDtype](https://pandas.pydata.org/docs/reference/api/pandas.ArrowDtype.html#pandas.ArrowDtype) with dtype_backend="nullable_numpy" ([GH53648](pandas-dev/pandas#53648)) - Bug in [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) when defining dtype with bool[pyarrow] for the "c" and "python" engines ([GH53390](pandas-dev/pandas#53390)) 2.Release notes: https://pandas.pydata.org/docs/whatsnew/v2.0.3.html ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. Closes apache#41812 from panbingkun/SPARK-44267. Authored-by: panbingkun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

LeonLuttenberger added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 25, 2023

mroeschke mentioned this issue May 25, 2023

BUG: read_csv with dtype=bool[pyarrow] #53391

Merged

5 tasks

lithomas1 added IO CSV read_csv, to_csv Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2023

lithomas1 added this to the 2.0.2 milestone May 26, 2023

datapythonista modified the milestones: 2.0.2, 2.0.3 May 26, 2023

mroeschke closed this as completed in #53391 May 31, 2023

panbingkun mentioned this issue Jul 1, 2023

[SPARK-44267][PS][INFRA] Upgrade pandas to 2.0.3 apache/spark#41812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Reading fails when `dtype` is defined with `bool[pyarrow]` #53390

BUG: Reading fails when `dtype` is defined with `bool[pyarrow]` #53390

LeonLuttenberger commented May 25, 2023 •

edited

Loading

INSTALLED VERSIONS

BUG: Reading fails when dtype is defined with bool[pyarrow] #53390

BUG: Reading fails when dtype is defined with bool[pyarrow] #53390

Comments

LeonLuttenberger commented May 25, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

BUG: Reading fails when `dtype` is defined with `bool[pyarrow]` #53390

BUG: Reading fails when `dtype` is defined with `bool[pyarrow]` #53390

LeonLuttenberger commented May 25, 2023 •

edited

Loading