Skip to content

Usefulness of error_bad_lines when dtypes are defined #20573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lgmoneda opened this issue Apr 1, 2018 · 2 comments
Closed

Usefulness of error_bad_lines when dtypes are defined #20573

lgmoneda opened this issue Apr 1, 2018 · 2 comments
Labels
Error Reporting Incorrect or improved errors from pandas

Comments

@lgmoneda
Copy link

lgmoneda commented Apr 1, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd
from numpy import dtype

### Create the sample data
with open('data.csv', 'w+') as file:

    file.write("ID,X1,X2,X3\n")
    file.write("0,1,Amigo,3\n")
    file.write("1,1,Inimigo, amor,9\n")
    file.write("2,1,Cowboy,42\n") 
 
file.close()

dtypes = {"ID": dtype("int64"),
         "X1": dtype("int64"),
         "X2": dtype("O"),
         "X3": dtype("int64")}

print("Load df with no params: ", end="")
try:
    df = pd.read_csv("data.csv")
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines: ", end="")
try:
    df = pd.read_csv("data.csv", error_bad_lines=False)
    print("Sucess")
except:
    print("Fail")

print("Load df with error bad lines and dtypes: ", end="")    
try:
    df = pd.read_csv("data.csv", error_bad_lines=False, dtype=dtypes)
    print("Sucess")
except:
    print("Fail")

Problem description

The problem is that error_bad_lines is pretty useful to deal with undesired commas inside the data that splits a single column into two new ones. But when dtype is defined, it checks the type of each column before skipping a problematic row, causing it to not match.

I'd argue that the row should be skipped before checking the dtype, because when a problematic row appears its dtypes are messed.

Expected Output

It should skip the problematic row even when dtype is passed as param.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_BR.UTF-8 LOCALE: None.None

pandas: 0.22.0
pytest: 3.0.5
pip: 9.0.1
setuptools: 36.5.0
Cython: None
numpy: 1.14.2
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.14
pymysql: None
psycopg2: 2.7.3.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.4
fastparquet: None
pandas_gbq: None
pandas_datareader: None

update: I've updated my pandas version.

@WillAyd
Copy link
Member

WillAyd commented Apr 1, 2018

When submitting a bug report please try to make the reproducible example as minimal as possible. FWIW I've done that below, but I am not getting any errors on the master branch. I see you are using pandas 20.3 - have you at least tried on the latest version of 22?

>>> import io
>>> buf = io.StringIO("ID,X1,X2,X3\n0,1,Amigo,3\n1,1,Inimigo, amor,9\n2,1,Cowboy,42\n")
>>> pd.read_csv(buf, error_bad_lines=False)
b'Skipping line 3: expected 4 fields, saw 5\n'
   ID  X1      X2  X3
0   0   1   Amigo   3
1   2   1  Cowboy  42

>>> buf.seek(0)
>>> dtypes = {"X3": np.intp}
>>> pd.read_csv(buf, error_bad_lines=False, dtype=dtypes)
b'Skipping line 3: expected 4 fields, saw 5\n'
Out[6]: 
   ID  X1      X2  X3
0   0   1   Amigo   3
1   2   1  Cowboy  42

INSTALLED VERSIONS

commit: a1c5e51
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0.dev0+725.ga1c5e5129
pytest: 3.4.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.27.3
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.7.0
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.3
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@lgmoneda
Copy link
Author

lgmoneda commented Apr 1, 2018

In fact, I've tested it before with the 0.21.1 also. But it indeed works in the expected way in the 0.22.0.

Thanks

@lgmoneda lgmoneda closed this as completed Apr 1, 2018
@gfyoung gfyoung added the Error Reporting Incorrect or improved errors from pandas label Aug 3, 2018
@gfyoung gfyoung added this to the No action milestone Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas
Projects
None yet
Development

No branches or pull requests

3 participants