Skip to content

read_csv() crashes if engine='c', header=None, and 2+ extra columns #26218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dargueta opened this issue Apr 26, 2019 · 6 comments · Fixed by #32839
Closed

read_csv() crashes if engine='c', header=None, and 2+ extra columns #26218

dargueta opened this issue Apr 26, 2019 · 6 comments · Fixed by #32839
Labels
IO CSV read_csv, to_csv
Milestone

Comments

@dargueta
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import io
import pandas as pd

# Create a CSV with a single row:
stream = io.StringIO(u'foo,bar,baz,bam,blah')

# Succeeds, returns a DataFrame with four columns and ignores the fifth.
pd.read_csv(
    stream,
    header=None,
    names=['one', 'two', 'three', 'four'],
    index_col=False,
    engine='c'
)

# Change the engine to 'python' and you get the same result.
stream.seek(0)
pd.read_csv(
    stream,
    header=None,
    names=['one', 'two', 'three', 'four'],
    index_col=False,
    engine='python'
)

# Succeeds, returns a DataFrame with three columns and ignores the extra two.
stream.seek(0)
pd.read_csv(
    stream,
    header=None,
    names=['one', 'two', 'three'],
    index_col=False,
    engine='python'
)


# Change the engine to 'c' and it crashes:
stream.seek(0)
pd.read_csv(
    stream,
    header=None,
    names=['one', 'two', 'three'],
    index_col=False,
    engine='c'
)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tux/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/tux/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/Users/tux/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/Users/tux/.pyenv/versions/2.7.15/envs/gds/lib/python2.7/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1067, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1387, in pandas._libs.parsers.TextReader._get_column_name
IndexError: list index out of range

Problem description

The normal behavior for read_csv() is to ignore extra columns if it's given names. However, if the CSV has two or more extra columns and the engine is c then it crashes. The exact same CSV can be read correctly if engine is python.

Expected Output

Behavior of reading a CSV with the C and Python engines should be identical.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 17.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 41.0.1
Cython: 0.29.5
numpy: 1.14.5
scipy: 1.1.0
pyarrow: 0.10.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.14
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.0.15
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.6
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.2.1

I tested this on Python 3.7.3 as well but for some reason pd.show_versions() keeps blowing up with:

Assertion failed: (PassInf && "Expected all immutable passes to be initialized"), function addImmutablePass, file /Users/buildbot/miniconda3/conda-bld/llvmdev_1545076115094/work/lib/IR/LegacyPassManager.cpp, line 812.
Abort trap: 6 (core dumped)
@WillAyd
Copy link
Member

WillAyd commented Apr 26, 2019

Specifying the columns to use here would give you the result you want:

stream.seek(0)
pd.read_csv(
    stream,
    header=None,
    names=['one', 'two', 'three'],
    index_col=False,
    engine='c',
    usecols=[0, 1, 2]
)

Behavior should be consistent, though I'm not sure dropping the remaining columns is really the correct approach regardless. Let's see what others think

@WillAyd WillAyd added IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action labels Apr 26, 2019
@dargueta
Copy link
Contributor Author

though I'm not sure dropping the remaining columns is really the correct approach regardless

It isn't, it's a known bug that's been hanging around for months. See #22144.

@WillAyd
Copy link
Member

WillAyd commented Apr 26, 2019

OK thanks. Linked issue seems related though not quite the same. If you'd like to take a look and submit a PR for this or the other would certainly be appreciated!

@WillAyd WillAyd removed the Needs Discussion Requires discussion from core team before further action label Apr 26, 2019
@dargueta
Copy link
Contributor Author

I think I found the problem. Do I fork off of master or 0.24.x? Is 0.24 still taking bugfixes or no?

@WillAyd
Copy link
Member

WillAyd commented Apr 26, 2019

Fork off master

@WillAyd WillAyd added this to the Contributions Welcome milestone Apr 29, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.1 Mar 21, 2020
@rainwoodman
Copy link

Hi, we were relying on the pre-1.1 behaver to raise an error when the provided list of names mismatched the data, and the exception is no longer caught:

https://travis-ci.org/github/bccp/nbodykit/jobs/725070683#L3274

What about adding a read_csv() argument to strictly enforce the matching between the provided schema and the data -- or is there a canonical way for the caller to check this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
4 participants