Usefulness of error_bad_lines when dtypes are defined #20573

lgmoneda · 2018-04-01T21:06:59Z

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import dtype

### Create the sample data
with open('data.csv', 'w+') as file:

    file.write("ID,X1,X2,X3\n")
    file.write("0,1,Amigo,3\n")
    file.write("1,1,Inimigo, amor,9\n")
    file.write("2,1,Cowboy,42\n") 
 
file.close()

dtypes = {"ID": dtype("int64"),
         "X1": dtype("int64"),
         "X2": dtype("O"),
         "X3": dtype("int64")}

print("Load df with no params: ", end="")
try:
    df = pd.read_csv("data.csv")
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines: ", end="")
try:
    df = pd.read_csv("data.csv", error_bad_lines=False)
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines and dtypes: ", end="")    
try:
    df = pd.read_csv("data.csv", error_bad_lines=False, dtype=dtypes)
    print("Sucess")
except:
    print("Fail")

Problem description

The problem is that error_bad_lines is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But when dtype is defined, it checks the type of each column before skipping a problematic row, causing it to not match.

I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.

Expected Output

It should skip the problematic row even when dtype is passed as param.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_BR.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.14.2
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.4
fastparquet: None
pandas_gbq: None
pandas_datareader: None

update: I've updated my pandas version.

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-04-01T21:32:32Z

When submitting a bug report please try to make the reproducible example as minimal as possible. FWIW I've done that below, but I am not getting any errors on the master branch. I see you are using pandas 20.3 - have you at least tried on the latest version of 22?

>>> import io
>>> buf = io.StringIO("ID,X1,X2,X3\n0,1,Amigo,3\n1,1,Inimigo, amor,9\n2,1,Cowboy,42\n")
>>> pd.read_csv(buf, error_bad_lines=False)
b'Skipping line 3: expected 4 fields, saw 5\n'
   ID  X1      X2  X3
0   0   1   Amigo   3
1   2   1  Cowboy  42

>>> buf.seek(0)
>>> dtypes = {"X3": np.intp}
>>> pd.read_csv(buf, error_bad_lines=False, dtype=dtypes)
b'Skipping line 3: expected 4 fields, saw 5\n'
Out[6]: 
   ID  X1      X2  X3
0   0   1   Amigo   3
1   2   1  Cowboy  42

INSTALLED VERSIONS

commit: a1c5e51
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+725.ga1c5e5129
pytest: 3.4.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

lgmoneda · 2018-04-01T22:05:12Z

In fact, I've tested it before with the 0.21.1 also. But it indeed works in the expected way in the 0.22.0.

Thanks

lgmoneda closed this as completed Apr 1, 2018

weichslgartner mentioned this issue Jul 31, 2018

error_bad_lines is ignored if names argument is used in read_csv function #22144

Closed

gfyoung added the Error Reporting Incorrect or improved errors from pandas label Aug 3, 2018

gfyoung added this to the No action milestone Aug 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usefulness of error_bad_lines when dtypes are defined #20573

Usefulness of error_bad_lines when dtypes are defined #20573

lgmoneda commented Apr 1, 2018 •

edited

Loading

WillAyd commented Apr 1, 2018

INSTALLED VERSIONS

lgmoneda commented Apr 1, 2018

Usefulness of error_bad_lines when dtypes are defined #20573

Usefulness of error_bad_lines when dtypes are defined #20573

Comments

lgmoneda commented Apr 1, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Apr 1, 2018

INSTALLED VERSIONS

lgmoneda commented Apr 1, 2018

lgmoneda commented Apr 1, 2018 •

edited

Loading

Output of `pd.show_versions()`