Skip to content

Detect Parsing errors in read_csv first row with index_col=False #40629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -782,7 +782,10 @@ I/O
- Bug in :func:`read_hdf` returning unexpected records when filtering on categorical string columns using ``where`` parameter (:issue:`39189`)
- Bug in :func:`read_sas` raising ``ValueError`` when ``datetimes`` were null (:issue:`39725`)
- Bug in :func:`read_excel` dropping empty values from single-column spreadsheets (:issue:`39808`)
- Bug in :func:`read_csv` failing to raise ParserError when first row had too many columns and ``index_col=False`` (:issue:`40333`)
- Bug in :meth:`DataFrame.to_string` misplacing the truncation column when ``index=False`` (:issue:`40907`)
- Bug in :func:`read_csv` failing to raise ParserError when first row had too many columns and ``index_col=False`` (:issue:`40333`)
- Bug in :func:`read_csv` failing to raise ParserError when ``names is not None`` and ``header=None`` (:issue:`22144`)

Period
^^^^^^
Expand Down
12 changes: 11 additions & 1 deletion pandas/_libs/parsers.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,13 @@ cdef extern from "parser/tokenizer.h":
int64_t header_start # header row start
uint64_t header_end # header row end

bint allow_leading_cols # Boolean: 1: can infer index col, 0: no index col
bint skip_header_end # Boolean: 1: Header=None,
# 0 Header is not None
# This is used because header_end is
# uint64_t so there is no valid NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by header_end being uint64_t? Based on this diff, it doesn't look like it used to be uint64_t (since it held -1), and I don't see anything in this pr changing the type of header_end

Copy link
Author

@njriasan njriasan Apr 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of header_end is still uint64_t (see the struct parser_t definition in parsers.pyx or the definition in tokenizer.h). There was previously a section of code that initialized the value to -1, which is clearly incorrect because the type is unsigned (I believe it is uint64_t to be consistent with lines and handle very large files).

This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, good to fix that!

# value (i.e. header_end == -1).

void *skipset
PyObject *skipfunc
int64_t skip_first_N_rows
Expand Down Expand Up @@ -378,6 +385,7 @@ cdef class TextReader:
self.encoding_errors = PyBytes_AsString(encoding_errors)

self.parser = parser_new()
self.parser.allow_leading_cols = allow_leading_cols
self.parser.chunksize = tokenize_chunksize

self.mangle_dupe_cols = mangle_dupe_cols
Expand Down Expand Up @@ -517,11 +525,13 @@ cdef class TextReader:
if header is None:
# sentinel value
self.parser.header_start = -1
self.parser.header_end = -1
self.parser.skip_header_end = True
self.parser.header_end = 0
self.parser.header = -1
self.parser_start = 0
prelim_header = []
else:
self.parser.skip_header_end = False
if isinstance(header, list):
if len(header) > 1:
# need to artificially skip the final line
Expand Down
25 changes: 22 additions & 3 deletions pandas/_libs/src/parser/tokenizer.c
Original file line number Diff line number Diff line change
Expand Up @@ -444,9 +444,28 @@ static int end_line(parser_t *self) {
self->line_fields[self->lines] = 0;
return 0;
}

if (!(self->lines <= self->header_end + 1) &&
(self->expected_fields < 0 && fields > ex_fields) && !(self->usecols)) {
// Explanation of each condition:
// Cond1: (self->skip_header_end ||
// !(self->lines <= (self->header_end + self->allow_leading_cols)))
// We don't check the expected number of fields within the header
// lines and we are allowed to infer the index.
// We check for if Header=None is specified with self->skip_header_end.
// Cond2: (ex_fields > 0) && (fields > ex_fields)
// We only throw an error if we know how many fields
// to expect and have encountered too many fields.
// Cond3: !(self->usecols)
// Ignore field parsing errors if we will use a subset of the columns.
// Cond4: !(((fields - 1) == ex_fields)
// && !self->stream[self->stream_len - 2])
// Ignore a trailing delimter (see gh-2442) by checking if
// the last field is empty. We determine this if the next
// to last character is null (last character must be null).
if ((self->skip_header_end
|| !(self->lines <= (self->header_end + self->allow_leading_cols)))
&& (ex_fields > 0 && fields > ex_fields)
&& !(self->usecols)
&& !(((fields - 1) == ex_fields) &&
!self->stream[self->stream_len - 2])) {
// increment file line count
self->file_lines++;

Expand Down
6 changes: 6 additions & 0 deletions pandas/_libs/src/parser/tokenizer.h
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,12 @@ typedef struct parser_t {
int64_t header_start; // header row start
uint64_t header_end; // header row end

int allow_leading_cols; // Boolean: 1: can infer index col, 0: no index col
int skip_header_end; // Boolean: 1: Header=None, 0 Header is not None
// This is used because header_end
// is uint64_t so there is no valid NULL value
// (i.e. header_end == -1).

void *skipset;
PyObject *skipfunc;
int64_t skip_first_N_rows;
Expand Down
6 changes: 5 additions & 1 deletion pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -883,7 +883,7 @@ def _rows_to_cols(self, content):
# Check that there are no rows with too many
# elements in their row (rows with too few
# elements are padded with NaN).
if max_len > col_len and self.index_col is not False and self.usecols is None:
if max_len > col_len and self.usecols is None:

footers = self.skipfooter if self.skipfooter else 0
bad_lines = []
Expand All @@ -894,6 +894,10 @@ def _rows_to_cols(self, content):

for (i, l) in iter_content:
actual_len = len(l)
# Check and remove trailing delimiters see gh-2442
if actual_len == (col_len + 1) and l[-1] == "":
l.pop()
actual_len -= 1

if actual_len > col_len:
if self.error_bad_lines or self.warn_bad_lines:
Expand Down
5 changes: 2 additions & 3 deletions pandas/tests/io/parser/common/test_common_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -667,11 +667,10 @@ def test_blank_lines_between_header_and_data_rows(all_parsers, nrows):
def test_no_header_two_extra_columns(all_parsers):
# GH 26218
column_names = ["one", "two", "three"]
ref = DataFrame([["foo", "bar", "baz"]], columns=column_names)
stream = StringIO("foo,bar,baz,bam,blah")
parser = all_parsers
df = parser.read_csv(stream, header=None, names=column_names, index_col=False)
tm.assert_frame_equal(df, ref)
with pytest.raises(ParserError, match="Expected 3 fields in line 1, saw 5"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principal I am not against changing this, but doing only for this case would cause this to fail and

data="""
a,b,c
1,2,3,4
5,6,7,8
"""

df = pd.read_csv(StringIO(data), header=None,
                 index_col=False,
                 engine="python"
                 )

to work. ALso not sure if we can do this without deprecating

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what you mean by work? Testing this example on my PR I get Expected 3 fields in line 3, saw 4 with all parsers, which I believe should be the intended behavior.

I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation.

cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it.

parser.read_csv(stream, header=None, names=column_names, index_col=False)


def test_read_csv_names_not_accepting_sets(all_parsers):
Expand Down
19 changes: 19 additions & 0 deletions pandas/tests/io/parser/test_index_col.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
import numpy as np
import pytest

from pandas.errors import ParserError

from pandas import (
DataFrame,
Index,
Expand Down Expand Up @@ -283,3 +285,20 @@ def test_multiindex_columns_index_col_with_data(all_parsers):
index=Index(["data"]),
)
tm.assert_frame_equal(result, expected)


def test_index_col_false_error(all_parsers):
# GH#40333
parser = all_parsers
with pytest.raises(ParserError, match="Expected 3 fields in line 2, saw 4"):
parser.read_csv(StringIO("a,b,c\n0,1,2,3\n1,2,3"), index_col=False)


def test_index_col_false_error_ignore(all_parsers):
# GH#40333
parser = all_parsers
result = parser.read_csv(
StringIO("a,b,c\n0,1,2,3\n1,2,3"), index_col=False, error_bad_lines=False
)
expected = DataFrame({"a": [1], "b": [2], "c": [3]})
tm.assert_frame_equal(result, expected)