Skip to content

error_bad_lines is ignored if names argument is used in read_csv function #22144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
weichslgartner opened this issue Jul 31, 2018 · 8 comments · Fixed by #44646
Closed

error_bad_lines is ignored if names argument is used in read_csv function #22144

weichslgartner opened this issue Jul 31, 2018 · 8 comments · Fixed by #44646
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@weichslgartner
Copy link

weichslgartner commented Jul 31, 2018

Code Sample, a copy-pastable example if possible

#example taken from #20573
import io
import numpy as np
import pandas as pd
buf = io.StringIO("0,1,Amigo,3\n1,1,Inimigo,amigo,9\n2,1,Cowboy,42\n")
names = ['ID','X1','X2','X3']
dtypes = {"X3": int}
pd.read_csv(buf, names=names, error_bad_lines=False, dtype=dtypes, header=None)

Problem description

Bad lines option (error_bad_lines=False) is ignored when using the names argument.
When omitting the names option everything works fine with pandas 0.23.3 (see issue #20573), but when names is used a ValueError is raised (ValueError: invalid literal for int() with base 10: 'amigo').

Expected Output

b'Skipping line 3: expected 4 fields, saw 5\n'

ID X1 X2 X3
0 1 Amigo 3
2 1 Cowboy 42

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.3
pytest: 3.4.2
pip: 9.0.2
setuptools: 38.5.2
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2018

Thanks for the report - investigation and PRs are always welcome!

@WillAyd WillAyd added Bug IO CSV read_csv, to_csv labels Aug 1, 2018
@louis-red
Copy link
Contributor

Will try to have a look at this one !

@louis-red
Copy link
Contributor

louis-red commented Aug 9, 2018

This error is giving me headaches ^^
But from what I collected

  • it will work fine with engine='python'
  • actually your example will fail even with engine='python' but only because the malformed line is the second line : in that case, the second line is taken as indicative of the number of columns plus an "implicit index" and is not skipped no matter what.

So for this case we have as choices :

  • desactivate the "implicit index" look up
  • better document the "implicit index" look up so that special attention is drawn on the first rows from the user.
  • force the look up for "implicit index" on more than one line.

The error resides for engine='c' though.

@WillAyd
Copy link
Member

WillAyd commented Aug 9, 2018

@louis-red thanks for the summary as it is very much appreciated! I'm not sure I follow what you are referring to with the "implicit index" though - any chance you can link to the code base where you were seeing the error?

@louis-red
Copy link
Contributor

louis-red commented Aug 13, 2018

Of course @WillAyd, sorry for the delay ;

We have unexpected results here when specifying index_col=False because it disable altogether lines too long :

self.index_col is not False and

Synthesis of the outputs according to inputs :

names = ['ID', 'X1', 'X2', 'X3']

# With malformed line on line 3
s = """0,1,2,3
1,2,3,4
4,3,2,1,5"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python')
# Output : Good
   ID  X1  X2  X3
0   0   1   2   3
1   1   2   3   4
Skipping line 3: Expected 4 fields in line 3, saw 5

# With malformed line on line 2
s = """0,1,2,3
1,2,3,4,5
4,3,2,1"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python')
# Output : Good or Bad ?
   ID  X1  X2   X3
0   1   2   3  NaN
1   2   3   4  5.0
4   3   2   1  NaN

# With index_col = False and malformed line on line 2 (or 3, it doesn't matter)
s = """0,1,2,3
1,2,3,4,5
4,3,2,1"""
pd.read_csv(io.StringIO(s), names=names, error_bad_lines=False, header=None,
            engine='python', 
            index_col=False)
# Output : Bad
   ID  X1  X2  X3
0   0   1   2   3
1   1   2   3   4
2   4   3   2   1

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

@louis-red thanks I think I follow. You can always try removing that conditional to see what breaks; perhaps it is even errant in the first place.

A PR would certainly be welcome if you could get it to pass tests locally

@louis-red
Copy link
Contributor

@WillAyd for issues like this that are only tangentially related to an opened Github issue, what is the best practice here for reference in a PR ? Opening a new issue clearly outlining the particular problem or not necessary ?

@WillAyd
Copy link
Member

WillAyd commented Aug 15, 2018

Sorry I thought you were addressing the issue with the names argument as outlined by OP. If that's not the case then yes open a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
4 participants