BUG: IndexError when header rows have unequal column counts #43102

MrCreosote · 2021-08-18T21:53:45Z

I have checked that this issue has not already been reported.

I couldn't find it with a search on indexerror read_csv anyway

I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In [1]: import pandas

In [2]: from io import StringIO

In [3]: data = """
   ...: head1, head2, head3
   ...: sub1, sub2
   ...: """[1:]

In [4]: data
Out[4]: 'head1, head2, head3\nsub1, sub2\n'

In [5]: pandas.read_csv(StringIO(data), sep=",", header=[0, 1], on_bad_lines="sk
   ...: ip")
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-1bf8982a1449> in <module>
----> 1 pandas.read_csv(StringIO(data), sep=",", header=[0, 1], on_bad_lines="skip")

*snip - I can provide the entire trace on request, but leave it out here for brevity*

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in <genexpr>(.0)
    343 
    344         def extract(r):
--> 345             return tuple(r[i] for i in range(field_count) if i not in sic)
    346 
    347         columns = list(zip(*(extract(r) for r in header)))

IndexError: list index out of range

Problem description

A generic index error is thrown rather than a pandas specific error, or skipping the row as per on_bad_lines.

Expected Output

I would expect one of two things:

a pandas.errors.ParseError describing the problem
skipping the header. This seems a little odd to me so maybe 1 is the better option.

Maybe there are other reasonable responses as well.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-154-generic
Version : #161-Ubuntu SMP Fri Jul 30 13:04:17 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 20.3.3
setuptools : 51.3.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

MrCreosote · 2021-08-18T21:59:56Z

Here's the entire stacktrace:

<ipython-input-5-1bf8982a1449> in <module>
----> 1 pandas.read_csv(StringIO(data), sep=",", header=[0, 1], on_bad_lines="skip")

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    584     kwds.update(kwds_defaults)
    585 
--> 586     return _read(filepath_or_buffer, kwds)
    587 
    588 

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    480 
    481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
    483 
    484     if chunksize or iterator:

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    809             self.options["has_index_names"] = kwds["has_index_names"]
    810 
--> 811         self._engine = self._make_engine(self.engine)
    812 
    813     def close(self):

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/readers.py in _make_engine(self, engine)
   1038             )
   1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1041 
   1042     def _failover_to_python(self):

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
     90                     self.col_names,
     91                     passed_names,
---> 92                 ) = self._extract_multi_indexer_columns(
     93                     self._reader.header,
     94                     self.index_names,  # type: ignore[has-type]

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in _extract_multi_indexer_columns(self, header, index_names, col_names, passed_names)
    345             return tuple(r[i] for i in range(field_count) if i not in sic)
    346 
--> 347         columns = list(zip(*(extract(r) for r in header)))
    348         names = ic + columns
    349 

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in <genexpr>(.0)
    345             return tuple(r[i] for i in range(field_count) if i not in sic)
    346 
--> 347         columns = list(zip(*(extract(r) for r in header)))
    348         names = ic + columns
    349 

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in extract(r)
    343 
    344         def extract(r):
--> 345             return tuple(r[i] for i in range(field_count) if i not in sic)
    346 
    347         columns = list(zip(*(extract(r) for r in header)))

~/.local/share/virtualenvs/py3.9playground--W4ZB8cD/lib/python3.9/site-packages/pandas/io/parsers/base_parser.py in <genexpr>(.0)
    343 
    344         def extract(r):
--> 345             return tuple(r[i] for i in range(field_count) if i not in sic)
    346 
    347         columns = list(zip(*(extract(r) for r in header)))

IndexError: list index out of range

quantumalaviya · 2021-08-19T17:59:02Z

So, just to confirm: an exception needs to be thrown if the length of the row (r) is not equal to the value of field_count inside the extract function, like so:

def extract(r):
            try:
                return tuple(r[i] for i in range(field_count) if i not in sic)
            except IndexError:
                raise Exception("Error") from None

This changes the stack trace to:

  File "<ipython-input-4-1bf8982a1449>", line 1, in <module>
    pandas.read_csv(StringIO(data), sep=",", header=[0, 1], on_bad_lines="skip")

  File "pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "pandas\io\parsers\readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "pandas\io\parsers\readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)

  File "pandas\io\parsers\readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)

  File "pandas\io\parsers\readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]

  File "pandas\io\parsers\c_parser_wrapper.py", line 92, in __init__
    ) = self._extract_multi_indexer_columns(

  File "pandas\io\parsers\base_parser.py", line 350, in _extract_multi_indexer_columns
    columns = list(zip(*(extract(r) for r in header)))

  File "pandas\io\parsers\base_parser.py", line 350, in <genexpr>
    columns = list(zip(*(extract(r) for r in header)))

  File "pandas\io\parsers\base_parser.py", line 348, in extract
    raise Exception("Error") from None

Exception: Error

If someone can guide me as to which exception I should raise and if there's anything else that needs to be added, I would love to work on this issue.

Also, I did confirm; the bug also exists on the master branch

MrCreosote · 2021-08-19T19:03:11Z

It'd be good if the exception could provide information regarding the header row on which the error occured

phofl · 2021-08-19T20:03:05Z

@quantumalaviya
You could do this before the extract, simply validating if all lists in header have the same length.

ParserError would be good here.

quantumalaviya · 2021-08-19T20:50:09Z

@phofl Something like this?

#check if header lengths are equal
for l in range(len(header)):
    if len(header[l])!=field_count:
        raise ParserError(f"Header rows must have equal number of columns")

quantumalaviya · 2021-08-19T20:50:51Z

@MrCreosote How would you suggest dealing with it in cases where there is more than one discrepancy? For example, if 3 rows are chosen to be headers and all have different values?

MrCreosote · 2021-08-19T20:57:44Z

Personally I'd probably just throw an error for the first discrepancy found, top to bottom, and ignore the others, but I'm a brand new Pandas user and not a dev, so the weight of my opinion is pretty low.

MrCreosote · 2021-08-19T20:59:54Z

Although I would definitely want the row index of the first offending row in the exception

quantumalaviya · 2021-08-20T09:36:53Z

Have a look at the PR and let me know if it's satisfactory.

The traceback now looks like this:

File "<ipython-input-4-1bf8982a1449>", line 1, in <module>
    pandas.read_csv(StringIO(data), sep=",", header=[0, 1], on_bad_lines="skip")

  File "pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)

  File "pandas\io\parsers\readers.py", line 586, in read_csv
    return _read(filepath_or_buffer, kwds)

  File "pandas\io\parsers\readers.py", line 482, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)

  File "pandas\io\parsers\readers.py", line 811, in __init__
    self._engine = self._make_engine(self.engine)

  File "pandas\io\parsers\readers.py", line 1040, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]

  File "pandas\io\parsers\c_parser_wrapper.py", line 92, in __init__
    ) = self._extract_multi_indexer_columns(

  File "pandas\io\parsers\base_parser.py", line 347, in _extract_multi_indexer_columns
    raise ParserError("Header rows must have equal number of columns."

ParserError: Header rows must have equal number of columns. Mismatch found at row 1

MrCreosote · 2021-08-20T16:59:46Z

Shouldn't the minimum mismatch number be 2, assuming I'm understanding this correctly? You need at least 2 rows to have a mismatch

quantumalaviya · 2021-08-20T17:14:29Z

I made it 0-indexed.

MrCreosote · 2021-08-20T17:22:26Z

Hmm, that's not what I would expect for rows in a file, but shrug

MrCreosote added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 18, 2021

phofl added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 19, 2021

lithomas1 added the Error Reporting Incorrect or improved errors from pandas label Aug 19, 2021

quantumalaviya mentioned this issue Aug 20, 2021

BUG: Throw a ParserError when header rows have unequal column counts … #43118

Merged

lithomas1 closed this as completed in #43118 Sep 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: IndexError when header rows have unequal column counts #43102

BUG: IndexError when header rows have unequal column counts #43102

MrCreosote commented Aug 18, 2021 •

edited

Loading

INSTALLED VERSIONS

MrCreosote commented Aug 18, 2021

quantumalaviya commented Aug 19, 2021 •

edited

Loading

MrCreosote commented Aug 19, 2021

phofl commented Aug 19, 2021

quantumalaviya commented Aug 19, 2021

quantumalaviya commented Aug 19, 2021

MrCreosote commented Aug 19, 2021

MrCreosote commented Aug 19, 2021 •

edited

Loading

quantumalaviya commented Aug 20, 2021

MrCreosote commented Aug 20, 2021

quantumalaviya commented Aug 20, 2021

MrCreosote commented Aug 20, 2021

BUG: IndexError when header rows have unequal column counts #43102

BUG: IndexError when header rows have unequal column counts #43102

Comments

MrCreosote commented Aug 18, 2021 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MrCreosote commented Aug 18, 2021

quantumalaviya commented Aug 19, 2021 • edited Loading

MrCreosote commented Aug 19, 2021

phofl commented Aug 19, 2021

quantumalaviya commented Aug 19, 2021

quantumalaviya commented Aug 19, 2021

MrCreosote commented Aug 19, 2021

MrCreosote commented Aug 19, 2021 • edited Loading

quantumalaviya commented Aug 20, 2021

MrCreosote commented Aug 20, 2021

quantumalaviya commented Aug 20, 2021

MrCreosote commented Aug 20, 2021

MrCreosote commented Aug 18, 2021 •

edited

Loading

Output of `pd.show_versions()`

quantumalaviya commented Aug 19, 2021 •

edited

Loading

MrCreosote commented Aug 19, 2021 •

edited

Loading