Skip to content

Detect Parsing errors in read_csv first row with index_col=False #40629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

njriasan
Copy link

@njriasan njriasan commented Mar 25, 2021

@njriasan njriasan changed the title BUG: Support for checking the first row for errors with index_col=Fal… Detect Parsing errors in read_csv first row with index_col=False Mar 25, 2021
@njriasan njriasan force-pushed the nick/error_bad_lines_index_col_false branch from a1ed0de to 478bbc7 Compare March 25, 2021 07:10
Copy link
Member

@mzeitlin11 mzeitlin11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @njriasan! Can you revert the unrelated changes - it makes it harder to review since it complicates the diff.

@mzeitlin11 mzeitlin11 added Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv labels Mar 25, 2021
@njriasan
Copy link
Author

njriasan commented Apr 6, 2021

Thanks for the pr @njriasan! Can you revert the unrelated changes - it makes it harder to review since it complicates the diff.

Hi @mzeitlin11. I reverted the unrelated changes. Sorry for the delay.

@njriasan
Copy link
Author

njriasan commented Apr 6, 2021

I think there may be a logical error in my PR based upon knowledge gaps in my Pandas understanding. I'll investigate any failed tests tomorrow and I may ask for further clarification if possible.

@njriasan
Copy link
Author

njriasan commented Apr 7, 2021

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

@njriasan njriasan requested a review from mzeitlin11 April 18, 2021 21:35
@mzeitlin11
Copy link
Member

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

Sorry @njriasan, not too familiar with this part of the codebase. Definitely agree on the consistency point, but not sure about throwing an exception. Especially in the case of error_bad_lines=False, if the case wasn't previously raising, we probably don't want to make it start raising without a deprecation cycle. Maybe this pr can just focus on the error_bad_lines=True case, for which I think we'd definitely want to raise.

bint skip_header_end # Boolean: 1: Header=None,
# 0 Header is not None
# This is used because header_end is
# uint64_t so there is no valid NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by header_end being uint64_t? Based on this diff, it doesn't look like it used to be uint64_t (since it held -1), and I don't see anything in this pr changing the type of header_end

Copy link
Author

@njriasan njriasan Apr 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of header_end is still uint64_t (see the struct parser_t definition in parsers.pyx or the definition in tokenizer.h). There was previously a section of code that initialized the value to -1, which is clearly incorrect because the type is unsigned (I believe it is uint64_t to be consistent with lines and handle very large files).

This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, good to fix that!

@njriasan
Copy link
Author

njriasan commented Apr 18, 2021

@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding test_no_header_two_extra_columns. Based upon the linked github issue: #26218, it seems like the intended behavior is to throw an exception when the number of columns doesn't match the number of names. My change fails this test, but it now provides that error for the Python parser. I understand the C parser should be consistent (I believe the change should be simple and I just need to initialize the number of expected fields to the number of names), but I wanted to confirm that my understanding was correct before proceeding.

Sorry @njriasan, not too familiar with this part of the codebase. Definitely agree on the consistency point, but not sure about throwing an exception. Especially in the case of error_bad_lines=False, if the case wasn't previously raising, we probably don't want to make it start raising without a deprecation cycle. Maybe this pr can just focus on the error_bad_lines=True case, for which I think we'd definitely want to raise.

So this PR should also fix #22144, because the changes to the Python version fix this PR as well (so I modified the C code to be consistent). This does require reverting the test for #26218 to raise an exception, which based on the comments in the issue and 22144 is the desired behavior.

This should not raise an error in the case of error_bad_lines=False, but I can add an extra test to check this explicitly if you would like. I understand it's probably worse practice to try and fix both issues in 1 PR, but I'm not sure its possible to cleanly decouple these issues in the Python parser.

@njriasan njriasan requested a review from mzeitlin11 April 18, 2021 23:06
@njriasan
Copy link
Author

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

@mzeitlin11
Copy link
Member

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

Looks like a build failure. For example in https://github.com/pandas-dev/pandas/pull/40629/checks?check_run_id=2376337837 seeing

pandas/_libs/src/parser/tokenizer.c: In function ‘end_line’:
334
pandas/_libs/src/parser/tokenizer.c:451:50: error: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘int’ [-Werror=sign-compare]
335
  451 |         !((self->skip_header_end && (self->lines < self->allow_leading_cols))
336
      |                                                  ^
337
cc1: all warnings being treated as errors

You might not see this locally if compile flags are different so that certain warnings don't error locally

@njriasan
Copy link
Author

@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the whatsnew file?

Looks like a build failure. For example in https://github.com/pandas-dev/pandas/pull/40629/checks?check_run_id=2376337837 seeing

pandas/_libs/src/parser/tokenizer.c: In function ‘end_line’:
334
pandas/_libs/src/parser/tokenizer.c:451:50: error: comparison of integer expressions of different signedness: ‘uint64_t’ {aka ‘long unsigned int’} and ‘int’ [-Werror=sign-compare]
335
  451 |         !((self->skip_header_end && (self->lines < self->allow_leading_cols))
336
      |                                                  ^
337
cc1: all warnings being treated as errors

You might not see this locally if compile flags are different so that certain warnings don't error locally

Thanks @mzeitlin11 this fixed it! All the tests pass now. Can you let me know what are the next steps to get this ready to merge?

@mzeitlin11
Copy link
Member

mzeitlin11 commented Apr 19, 2021

Thanks @mzeitlin11 this fixed it! All the tests pass now. Can you let me know what are the next steps to get this ready to merge?

More rounds of review from others :)

@mzeitlin11
Copy link
Member

cc @phofl just noticed #38587 if you have any thoughts on approach here

@phofl
Copy link
Member

phofl commented Apr 20, 2021

Will look more closely later, but I don't think we should change the behavior without deprecating first. Based on work experience when working with proprietary systems this is not that unusual unfortunately.

In #38587 we weren't even sure if we would like to deprecate or just warn

stream = StringIO("foo,bar,baz,bam,blah")
parser = all_parsers
df = parser.read_csv(stream, header=None, names=column_names, index_col=False)
tm.assert_frame_equal(df, ref)
with pytest.raises(ParserError, match="Expected 3 fields in line 1, saw 5"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principal I am not against changing this, but doing only for this case would cause this to fail and

data="""
a,b,c
1,2,3,4
5,6,7,8
"""

df = pd.read_csv(StringIO(data), header=None,
                 index_col=False,
                 engine="python"
                 )

to work. ALso not sure if we can do this without deprecating

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on what you mean by work? Testing this example on my PR I get Expected 3 fields in line 3, saw 4 with all parsers, which I believe should be the intended behavior.

I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation.

cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it.

@njriasan
Copy link
Author

Hi I'm just wondering if this feature change needs to go through a deprecation change, what are the next steps for me?

@phofl
Copy link
Member

phofl commented Apr 30, 2021

Two options here:

If you can fix the bug without breaking the test, you can do this here and open up another pr raising a FutureWarning for the behaviro which is tested but looks wrong.

If you can not fix this without breaking the test, you could raise a FutureWarning in this case and put off the pr to submit for 2.0

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

@njriasan still working on this? if so pls merge master

@mroeschke
Copy link
Member

Thanks for the PR, but it appears to have gone stale. If you'd like to continue, please merge master, and address the the comments (around the FutureWarning), and we can reopen

@mroeschke mroeschke closed this Nov 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
7 participants