BUG: pd.read_csv does not work with nullable_dtype coercion #52594

MCRE-BE · 2023-04-11T14:30:46Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
from textwrap import dedent
csv = dedent("""
    Site;Weight MD;CodeWeight;MAX SM;Pallet Equ.;Crate Equ.
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW01;0;0;999;0,00625;1
""")[1:]
csv_param = {
    'decimal' : ',',
    'sep' : ';',
    'encoding' : "utf-8"
}

# Examples that fail
layout = {
    'Site': 'string',
    'Weight MD': "Float64",
    'CodeWeight': "UInt8",
    'MAX SM': "Float64",
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string',
    'Weight MD': pd.Float64Dtype,
    'CodeWeight':pd.UInt8Dtype,
    'MAX SM': pd.Int64Dtype,
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string[pyarrow]',
    'Weight MD': 'double[pyarrow]',
    'CodeWeight':'int64[pyarrow]',
    'MAX SM': 'int64[pyarrow]',
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

Error thrown :

ValueError: Unable to parse string "0,24" at position 0

But it can read :

data = pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype_backend="numpy_nullable", **csv_param)
data.dtypes

Site          string[pyarrow]
Weight MD             Float64
CodeWeight              Int64
MAX SM                  Int64
dtype: object

Issue Description

I think it's related to #49146, the dtype coercion does not work with nullable dtypes.

Expected Behavior

It should be able to read them with the requested dtype

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.10.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Netherlands.1252

pandas : 2.0.0
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : 3.0.9
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: 0.10.0
bs4 : 4.12.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

The text was updated successfully, but these errors were encountered:

phofl · 2023-04-11T21:28:53Z

Hi, thanks for your report. I think we have an open issue about this (this looks familiar to me). Could you double check the issue tracker?

MCRE-BE · 2023-04-12T04:41:11Z

Hey

Searching for "pd.read_csv" I went back to October 2022 without finding any mention to bugs related to dtype coercion.

What seems similar :
#52086 : this one is the closest (bug horrible bug name). It seems that @hxy450 is solving it? But bug is open since > 3 weeks.
#49146 : this mentions a solved bug, but its seems the fix does not work (?)

What I can find that could be related :
#52301 : Likely it's a bug with how we pass some kwargs to an engine
#52266 : Seems closely related as I don't understand why the comma is not being recognized with nullable dtypes

If 52086 is the same, we can close either this one or the other (then sorry for the double bug, but I didn't open every single bug to read the content 😄 )

MCRE-BE · 2023-07-27T09:38:36Z

The bug is still present in 2.0.3.

MCRE-BE · 2024-08-28T05:57:48Z

Bug is still present in 2.2.2

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Belgium.1252

pandas : 2.2.2
numpy : 2.0.1
pytz : 2024.1
dateutil : 2.9.0
setuptools : 72.2.0
pip : 24.2
Cython : None
pytest : 8.3.2
hypothesis : None
sphinx : 7.4.7
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.3.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.21.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.23.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

aureliobarbosa · 2024-09-12T13:20:54Z

I did further investigations on this issue on main (commit 16b7288).

If you explicitly specify engine you realize that the c engine is not being able to parse data when decimal is anything different from . AND dtype = pd.Float64Dtype at the same time (or dtype="Float64", which has the same result).

Regarding the test suit, there is a single test 'function' to test the c engine with 'delimiter' parameter and it can be run in a development environment with

> pytest pandas/tests/io/parser/test_c_parser_only.py::test_1000_sep_with_decimal

Within this test, by adding the argument dtype = pd.Float64Dtype on the call to read_csv() one obtains exactly the same error. In my opinion this issue and #52086 are the same while the others issues @MCRE-BE mentioned are related to different problems in the code. #52086 provides a better minimal reproducible example.

If relevant I can create and PR a test (marking it as failing) for this issue.

Anyway, one workaround is to use any engine besides c.

MCRE-BE added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023

phofl added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

MCRE-BE commented Apr 11, 2023 •

edited

Loading

INSTALLED VERSIONS

phofl commented Apr 11, 2023

MCRE-BE commented Apr 12, 2023 •

edited

Loading

MCRE-BE commented Jul 27, 2023

MCRE-BE commented Aug 28, 2024

INSTALLED VERSIONS

aureliobarbosa commented Sep 12, 2024 •

edited

Loading

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

Comments

MCRE-BE commented Apr 11, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

phofl commented Apr 11, 2023

MCRE-BE commented Apr 12, 2023 • edited Loading

MCRE-BE commented Jul 27, 2023

MCRE-BE commented Aug 28, 2024

INSTALLED VERSIONS

aureliobarbosa commented Sep 12, 2024 • edited Loading

MCRE-BE commented Apr 11, 2023 •

edited

Loading

MCRE-BE commented Apr 12, 2023 •

edited

Loading

aureliobarbosa commented Sep 12, 2024 •

edited

Loading