error_bad_lines is ignored if names argument is used in read_csv function #22144

weichslgartner · 2018-07-31T08:36:21Z

Code Sample, a copy-pastable example if possible

#example taken from #20573
import io
import numpy as np
import pandas as pd
buf = io.StringIO("0,1,Amigo,3\n1,1,Inimigo,amigo,9\n2,1,Cowboy,42\n")
names = ['ID','X1','X2','X3']
dtypes = {"X3": int}
pd.read_csv(buf, names=names, error_bad_lines=False, dtype=dtypes, header=None)

Problem description

Bad lines option (error_bad_lines=False) is ignored when using the names argument.
When omitting the names option everything works fine with pandas 0.23.3 (see issue #20573), but when names is used a ValueError is raised (ValueError: invalid literal for int() with base 10: 'amigo').

Expected Output

b'Skipping line 3: expected 4 fields, saw 5\n'

ID	X1	X2	X3
0	1	Amigo	3
2	1	Cowboy	42

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.4.2
pip: 9.0.2
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2018-08-01T05:03:52Z

Thanks for the report - investigation and PRs are always welcome!

louis-red · 2018-08-02T08:07:53Z

Will try to have a look at this one !

louis-red · 2018-08-09T22:20:46Z

This error is giving me headaches ^^
But from what I collected

it will work fine with engine='python'
actually your example will fail even with engine='python' but only because the malformed line is the second line : in that case, the second line is taken as indicative of the number of columns plus an "implicit index" and is not skipped no matter what.

So for this case we have as choices :

desactivate the "implicit index" look up
better document the "implicit index" look up so that special attention is drawn on the first rows from the user.
force the look up for "implicit index" on more than one line.

The error resides for engine='c' though.

WillAyd · 2018-08-09T22:27:38Z

@louis-red thanks for the summary as it is very much appreciated! I'm not sure I follow what you are referring to with the "implicit index" though - any chance you can link to the code base where you were seeing the error?

louis-red · 2018-08-13T18:57:57Z

Of course @WillAyd, sorry for the delay ;

where the "implicit index" detection is defined and performed :

pandas/pandas/io/parsers.py

Line 2833 in eb0ac54

1) Look for implicit index: there are more columns
where it is used to remove the error :

pandas/pandas/io/parsers.py

Line 2892 in eb0ac54

if self._implicit_index:

We have unexpected results here when specifying index_col=False because it disable altogether lines too long :

pandas/pandas/io/parsers.py

Line 2901 in eb0ac54

self.index_col is not False and

Synthesis of the outputs according to inputs :

names = ['ID', 'X1', 'X2', 'X3']

# With malformed line on line 3
s = """0,1,2,3
1,2,3,4
4,3,2,1,5"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python')
# Output : Good
   ID  X1  X2  X3
0   0   1   2   3
1   1   2   3   4
Skipping line 3: Expected 4 fields in line 3, saw 5

# With malformed line on line 2
s = """0,1,2,3
1,2,3,4,5
4,3,2,1"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python')
# Output : Good or Bad ?
   ID  X1  X2   X3
0   1   2   3  NaN
1   2   3   4  5.0
4   3   2   1  NaN

# With index_col = False and malformed line on line 2 (or 3, it doesn't matter)
s = """0,1,2,3
1,2,3,4,5
4,3,2,1"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python', 
            index_col=False)
# Output : Bad
   ID  X1  X2  X3
0   0   1   2   3
1   1   2   3   4
2   4   3   2   1

WillAyd · 2018-08-15T21:26:19Z

@louis-red thanks I think I follow. You can always try removing that conditional to see what breaks; perhaps it is even errant in the first place.

A PR would certainly be welcome if you could get it to pass tests locally

louis-red · 2018-08-15T21:48:33Z

@WillAyd for issues like this that are only tangentially related to an opened Github issue, what is the best practice here for reference in a PR ? Opening a new issue clearly outlining the particular problem or not necessary ?

WillAyd · 2018-08-15T22:02:07Z

Sorry I thought you were addressing the issue with the names argument as outlined by OP. If that's not the case then yes open a separate issue

WillAyd added Bug IO CSV read_csv, to_csv labels Aug 1, 2018

dargueta mentioned this issue Apr 26, 2019

read_csv() crashes if engine='c', header=None, and 2+ extra columns #26218

Closed

This was referenced Apr 18, 2021

Detect Parsing errors in read_csv first row with index_col=False #40629

Closed

BUG: read_csv not erroring on a bad line with extra columns #40333

Closed

lithomas1 mentioned this issue Jun 2, 2021

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

Closed

phofl mentioned this issue Nov 28, 2021

BUG: read_csv not recognizing bad lines with names given #44646

Merged

4 tasks

jreback added this to the 1.4 milestone Nov 28, 2021

jreback closed this as completed in #44646 Nov 28, 2021

rebecca-palmer mentioned this issue Jan 9, 2023

Skip bad lines in CDEC get_stations; and ghcn_daily test failure fix ulmo-dev/ulmo#214

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error_bad_lines is ignored if names argument is used in read_csv function #22144

error_bad_lines is ignored if names argument is used in read_csv function #22144

weichslgartner commented Jul 31, 2018 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Aug 1, 2018

louis-red commented Aug 2, 2018

louis-red commented Aug 9, 2018 •

edited

Loading

WillAyd commented Aug 9, 2018

louis-red commented Aug 13, 2018 •

edited

Loading

WillAyd commented Aug 15, 2018

louis-red commented Aug 15, 2018

WillAyd commented Aug 15, 2018

error_bad_lines is ignored if names argument is used in read_csv function #22144

error_bad_lines is ignored if names argument is used in read_csv function #22144

Comments

weichslgartner commented Jul 31, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Aug 1, 2018

louis-red commented Aug 2, 2018

louis-red commented Aug 9, 2018 • edited Loading

WillAyd commented Aug 9, 2018

louis-red commented Aug 13, 2018 • edited Loading

WillAyd commented Aug 15, 2018

louis-red commented Aug 15, 2018

WillAyd commented Aug 15, 2018

weichslgartner commented Jul 31, 2018 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

louis-red commented Aug 9, 2018 •

edited

Loading

louis-red commented Aug 13, 2018 •

edited

Loading