Skip to content

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
MCRE-BE opened this issue Apr 11, 2023 · 5 comments
Open
2 of 3 tasks

BUG: pd.read_csv does not work with nullable_dtype coercion #52594

MCRE-BE opened this issue Apr 11, 2023 · 5 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@MCRE-BE
Copy link

MCRE-BE commented Apr 11, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO
from textwrap import dedent
csv = dedent("""
    Site;Weight MD;CodeWeight;MAX SM;Pallet Equ.;Crate Equ.
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW08;0,24;2;999;0,03125;0,14286
    BW01;0;0;999;0,00625;1
""")[1:]
csv_param = {
    'decimal' : ',',
    'sep' : ';',
    'encoding' : "utf-8"
}

# Examples that fail
layout = {
    'Site': 'string',
    'Weight MD': "Float64",
    'CodeWeight': "UInt8",
    'MAX SM': "Float64",
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string',
    'Weight MD': pd.Float64Dtype,
    'CodeWeight':pd.UInt8Dtype,
    'MAX SM': pd.Int64Dtype,
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

layout ={
    'Site': 'string[pyarrow]',
    'Weight MD': 'double[pyarrow]',
    'CodeWeight':'int64[pyarrow]',
    'MAX SM': 'int64[pyarrow]',
}
pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype=layout, **csv_param)

Error thrown :

ValueError: Unable to parse string "0,24" at position 0

But it can read :

data = pd.read_csv(StringIO(csv), usecols=layout.keys(), dtype_backend="numpy_nullable", **csv_param)
data.dtypes

Site          string[pyarrow]
Weight MD             Float64
CodeWeight              Int64
MAX SM                  Int64
dtype: object

Issue Description

I think it's related to #49146, the dtype coercion does not work with nullable dtypes.

Expected Behavior

It should be able to read them with the requested dtype

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.10.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_Netherlands.1252

pandas : 2.0.0
numpy : 1.23.5
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.0.1
Cython : None
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : 3.0.9
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: 0.10.0
bs4 : 4.12.1
bottleneck : None
brotli :
fastparquet : None
fsspec : 2023.3.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.0
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : 2023.3
qtpy : 2.3.1
pyqt5 : None

@MCRE-BE MCRE-BE added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@phofl
Copy link
Member

phofl commented Apr 11, 2023

Hi, thanks for your report. I think we have an open issue about this (this looks familiar to me). Could you double check the issue tracker?

@phofl phofl added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@MCRE-BE
Copy link
Author

MCRE-BE commented Apr 12, 2023

Hey

Searching for "pd.read_csv" I went back to October 2022 without finding any mention to bugs related to dtype coercion.

What seems similar :
#52086 : this one is the closest (bug horrible bug name). It seems that @hxy450 is solving it? But bug is open since > 3 weeks.
#49146 : this mentions a solved bug, but its seems the fix does not work (?)

What I can find that could be related :
#52301 : Likely it's a bug with how we pass some kwargs to an engine
#52266 : Seems closely related as I don't understand why the comma is not being recognized with nullable dtypes

If 52086 is the same, we can close either this one or the other (then sorry for the double bug, but I didn't open every single bug to read the content 😄 )

@MCRE-BE
Copy link
Author

MCRE-BE commented Jul 27, 2023

The bug is still present in 2.0.3.

@MCRE-BE
Copy link
Author

MCRE-BE commented Aug 28, 2024

Bug is still present in 2.2.2

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : AMD64 Family 23 Model 24 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Belgium.1252

pandas : 2.2.2
numpy : 2.0.1
pytz : 2024.1
dateutil : 2.9.0
setuptools : 72.2.0
pip : 24.2
Cython : None
pytest : 8.3.2
hypothesis : None
sphinx : 7.4.7
blosc : None
feather : None
xlsxwriter : 3.2.0
lxml.etree : 5.3.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.21.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.1
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : 1.0.10
s3fs : None
scipy : 1.14.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : 0.23.0
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

@aureliobarbosa
Copy link
Contributor

aureliobarbosa commented Sep 12, 2024

I did further investigations on this issue on main (commit 16b7288).

If you explicitly specify engine you realize that the c engine is not being able to parse data when decimal is anything different from . AND dtype = pd.Float64Dtype at the same time (or dtype="Float64", which has the same result).

Regarding the test suit, there is a single test 'function' to test the c engine with 'delimiter' parameter and it can be run in a development environment with

> pytest pandas/tests/io/parser/test_c_parser_only.py::test_1000_sep_with_decimal

Within this test, by adding the argument dtype = pd.Float64Dtype on the call to read_csv() one obtains exactly the same error. In my opinion this issue and #52086 are the same while the others issues @MCRE-BE mentioned are related to different problems in the code. #52086 provides a better minimal reproducible example.

If relevant I can create and PR a test (marking it as failing) for this issue.

Anyway, one workaround is to use any engine besides c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants