Detect Parsing errors in read_csv first row with index_col=False #40629

njriasan · 2021-03-25T06:32:30Z

closes BUG: read_csv not erroring on a bad line with extra columns #40333
closes error_bad_lines is ignored if names argument is used in read_csv function #22144
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

mzeitlin11

Thanks for the pr @njriasan! Can you revert the unrelated changes - it makes it harder to review since it complicates the diff.

pandas/_libs/parsers.pyx

doc/source/whatsnew/v1.3.0.rst

njriasan · 2021-04-06T08:41:22Z

Thanks for the pr @njriasan! Can you revert the unrelated changes - it makes it harder to review since it complicates the diff.

Hi @mzeitlin11. I reverted the unrelated changes. Sorry for the delay.

njriasan · 2021-04-06T08:57:42Z

I think there may be a logical error in my PR based upon knowledge gaps in my Pandas understanding. I'll investigate any failed tests tomorrow and I may ask for further clarification if possible.

njriasan · 2021-04-07T09:07:58Z

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

mzeitlin11 · 2021-04-18T22:22:25Z

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

Sorry @njriasan, not too familiar with this part of the codebase. Definitely agree on the consistency point, but not sure about throwing an exception. Especially in the case of error_bad_lines=False, if the case wasn't previously raising, we probably don't want to make it start raising without a deprecation cycle. Maybe this pr can just focus on the error_bad_lines=True case, for which I think we'd definitely want to raise.

doc/source/whatsnew/v1.3.0.rst

pandas/_libs/parsers.pyx

mzeitlin11 · 2021-04-18T22:27:01Z

pandas/_libs/parsers.pyx

+        bint skip_header_end        # Boolean: 1: Header=None,
+                                    # 0 Header is not None
+                                    # This is used because header_end is
+                                    # uint64_t so there is no valid NULL


What do you mean by header_end being uint64_t? Based on this diff, it doesn't look like it used to be uint64_t (since it held -1), and I don't see anything in this pr changing the type of header_end

The type of header_end is still uint64_t (see the struct parser_t definition in parsers.pyx or the definition in tokenizer.h). There was previously a section of code that initialized the value to -1, which is clearly incorrect because the type is unsigned (I believe it is uint64_t to be consistent with lines and handle very large files).

This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1.

Thanks for explaining, good to fix that!

pandas/_libs/src/parser/tokenizer.c

pandas/_libs/src/parser/tokenizer.h

njriasan · 2021-04-18T23:05:39Z

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

Sorry @njriasan, not too familiar with this part of the codebase. Definitely agree on the consistency point, but not sure about throwing an exception. Especially in the case of error_bad_lines=False, if the case wasn't previously raising, we probably don't want to make it start raising without a deprecation cycle. Maybe this pr can just focus on the error_bad_lines=True case, for which I think we'd definitely want to raise.

So this PR should also fix #22144, because the changes to the Python version fix this PR as well (so I modified the C code to be consistent). This does require reverting the test for #26218 to raise an exception, which based on the comments in the issue and 22144 is the desired behavior.

This should not raise an error in the case of error_bad_lines=False, but I can add an extra test to check this explicitly if you would like. I understand it's probably worse practice to try and fix both issues in 1 PR, but I'm not sure its possible to cleanly decouple these issues in the Python parser.

njriasan · 2021-04-18T23:07:39Z

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

mzeitlin11 · 2021-04-18T23:24:49Z

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

Looks like a build failure. For example in https://github.com/pandas-dev/pandas/pull/40629/checks?check_run_id=2376337837 seeing

pandas/_libs/src/parser/tokenizer.c: In function ‘end_line’:
334
pandas/_libs/src/parser/tokenizer.c:451:50: error: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘int’ [-Werror=sign-compare]
335
  451 |         !((self->skip_header_end && (self->lines < self->allow_leading_cols))
336
      |                                                  ^
337
cc1: all warnings being treated as errors

You might not see this locally if compile flags are different so that certain warnings don't error locally

njriasan · 2021-04-19T00:31:13Z

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

Looks like a build failure. For example in https://github.com/pandas-dev/pandas/pull/40629/checks?check_run_id=2376337837 seeing
pandas/_libs/src/parser/tokenizer.c: In function ‘end_line’:
334
pandas/_libs/src/parser/tokenizer.c:451:50: error: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘int’ [-Werror=sign-compare]
335
  451 |         !((self->skip_header_end && (self->lines < self->allow_leading_cols))
336
      |                                                  ^
337
cc1: all warnings being treated as errors
You might not see this locally if compile flags are different so that certain warnings don't error locally

Thanks @mzeitlin11 this fixed it! All the tests pass now. Can you let me know what are the next steps to get this ready to merge?

mzeitlin11 · 2021-04-19T00:50:50Z

Thanks @mzeitlin11 this fixed it! All the tests pass now. Can you let me know what are the next steps to get this ready to merge?

More rounds of review from others :)

mzeitlin11 · 2021-04-20T15:43:17Z

cc @phofl just noticed #38587 if you have any thoughts on approach here

phofl · 2021-04-20T16:00:06Z

Will look more closely later, but I don't think we should change the behavior without deprecating first. Based on work experience when working with proprietary systems this is not that unusual unfortunately.

In #38587 we weren't even sure if we would like to deprecate or just warn

phofl · 2021-04-20T20:15:25Z

pandas/tests/io/parser/common/test_common_basic.py

    stream = StringIO("foo,bar,baz,bam,blah")
    parser = all_parsers
-    df = parser.read_csv(stream, header=None, names=column_names, index_col=False)
-    tm.assert_frame_equal(df, ref)
+    with pytest.raises(ParserError, match="Expected 3 fields in line 1, saw 5"):


In principal I am not against changing this, but doing only for this case would cause this to fail and

data=""" a,b,c 1,2,3,4 5,6,7,8 """ df = pd.read_csv(StringIO(data), header=None, index_col=False, engine="python" )

to work. ALso not sure if we can do this without deprecating

Can you elaborate on what you mean by work? Testing this example on my PR I get Expected 3 fields in line 3, saw 4 with all parsers, which I believe should be the intended behavior.

I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process.

Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing?

Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation.

cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it.

njriasan · 2021-04-28T06:08:30Z

Hi I'm just wondering if this feature change needs to go through a deprecation change, what are the next steps for me?

phofl · 2021-04-30T22:11:24Z

Two options here:

If you can fix the bug without breaking the test, you can do this here and open up another pr raising a FutureWarning for the behaviro which is tested but looks wrong.

If you can not fix this without breaking the test, you could raise a FutureWarning in this case and put off the pr to submit for 2.0

github-actions · 2021-05-31T00:17:25Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

jreback · 2021-10-04T00:20:01Z

@njriasan still working on this? if so pls merge master

mroeschke · 2021-11-06T03:03:13Z

Thanks for the PR, but it appears to have gone stale. If you'd like to continue, please merge master, and address the the comments (around the FutureWarning), and we can reopen

njriasan changed the title ~~BUG: Support for checking the first row for errors with index_col=Fal…~~ Detect Parsing errors in read_csv first row with index_col=False Mar 25, 2021

BUG: Support for checking the first row for errors with index_col=Fal…

478bbc7

…se (pandas-dev#40333)

njriasan force-pushed the nick/error_bad_lines_index_col_false branch from a1ed0de to 478bbc7 Compare March 25, 2021 07:10

mzeitlin11 reviewed Mar 25, 2021

View reviewed changes

pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved

mzeitlin11 reviewed Mar 25, 2021

View reviewed changes

doc/source/whatsnew/v1.3.0.rst Outdated Show resolved Hide resolved

mzeitlin11 added Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 25, 2021

Update PR changes

a1b2e5c

fixed trailing delimiter test

806e4f6

Nicholas J Riasanovsky and others added 3 commits April 7, 2021 03:11

fixed trailing delimiter detection

0d55f5d

fixed bug in name detection

1151b93

Merge branch 'master' into nick/error_bad_lines_index_col_false

6756639

njriasan requested a review from mzeitlin11 April 18, 2021 21:35

mzeitlin11 reviewed Apr 18, 2021

View reviewed changes

Nicholas J Riasanovsky added 2 commits April 18, 2021 15:47

fixed index col tests

185f62e

Made the requested changes

a305268

njriasan requested a review from mzeitlin11 April 18, 2021 23:06

fixed compiler warning

7d973e6

njriasan mentioned this pull request Apr 19, 2021

BUG: read_csv not erroring on a bad line with extra columns #40333

Closed

3 tasks

Merge branch 'master' into nick/error_bad_lines_index_col_false

d77ccde

fixed pre-commit bug

546e106

phofl reviewed Apr 20, 2021

View reviewed changes

github-actions bot added the Stale label May 31, 2021

lithomas1 mentioned this pull request Jun 2, 2021

QST: Inconsistent behaviour in checking number of fields per row while read_csv() #41754

Closed

simonjayhawkins added Needs Review and removed Stale labels Jun 16, 2021

mroeschke closed this Nov 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect Parsing errors in read_csv first row with index_col=False #40629

Detect Parsing errors in read_csv first row with index_col=False #40629

njriasan commented Mar 25, 2021 •

edited

Loading

mzeitlin11 left a comment

njriasan commented Apr 6, 2021 •

edited

Loading

njriasan commented Apr 6, 2021

njriasan commented Apr 7, 2021

mzeitlin11 commented Apr 18, 2021

mzeitlin11 Apr 18, 2021

njriasan Apr 18, 2021 •

edited

Loading

mzeitlin11 Apr 19, 2021

njriasan commented Apr 18, 2021 •

edited

Loading

njriasan commented Apr 18, 2021

mzeitlin11 commented Apr 18, 2021

njriasan commented Apr 19, 2021

mzeitlin11 commented Apr 19, 2021 •

edited

Loading

mzeitlin11 commented Apr 20, 2021

phofl commented Apr 20, 2021

phofl Apr 20, 2021

njriasan Apr 20, 2021

phofl Apr 20, 2021

gfyoung Apr 20, 2021

njriasan commented Apr 28, 2021

phofl commented Apr 30, 2021

github-actions bot commented May 31, 2021

jreback commented Oct 4, 2021

mroeschke commented Nov 6, 2021

Detect Parsing errors in read_csv first row with index_col=False #40629

Detect Parsing errors in read_csv first row with index_col=False #40629

Conversation

njriasan commented Mar 25, 2021 • edited Loading

mzeitlin11 left a comment

Choose a reason for hiding this comment

njriasan commented Apr 6, 2021 • edited Loading

njriasan commented Apr 6, 2021

njriasan commented Apr 7, 2021

mzeitlin11 commented Apr 18, 2021

mzeitlin11 Apr 18, 2021

Choose a reason for hiding this comment

njriasan Apr 18, 2021 • edited Loading

Choose a reason for hiding this comment

mzeitlin11 Apr 19, 2021

Choose a reason for hiding this comment

njriasan commented Apr 18, 2021 • edited Loading

njriasan commented Apr 18, 2021

mzeitlin11 commented Apr 18, 2021

njriasan commented Apr 19, 2021

mzeitlin11 commented Apr 19, 2021 • edited Loading

mzeitlin11 commented Apr 20, 2021

phofl commented Apr 20, 2021

phofl Apr 20, 2021

Choose a reason for hiding this comment

njriasan Apr 20, 2021

Choose a reason for hiding this comment

phofl Apr 20, 2021

Choose a reason for hiding this comment

gfyoung Apr 20, 2021

Choose a reason for hiding this comment

njriasan commented Apr 28, 2021

phofl commented Apr 30, 2021

github-actions bot commented May 31, 2021

jreback commented Oct 4, 2021

mroeschke commented Nov 6, 2021

njriasan commented Mar 25, 2021 •

edited

Loading

njriasan commented Apr 6, 2021 •

edited

Loading

njriasan Apr 18, 2021 •

edited

Loading

njriasan commented Apr 18, 2021 •

edited

Loading

mzeitlin11 commented Apr 19, 2021 •

edited

Loading