-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Detect Parsing errors in read_csv first row with index_col=False #40629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect Parsing errors in read_csv first row with index_col=False #40629
Conversation
njriasan
commented
Mar 25, 2021
•
edited
Loading
edited
- closes BUG: read_csv not erroring on a bad line with extra columns #40333
- closes error_bad_lines is ignored if names argument is used in read_csv function #22144
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
a1ed0de
to
478bbc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pr @njriasan! Can you revert the unrelated changes - it makes it harder to review since it complicates the diff.
Hi @mzeitlin11. I reverted the unrelated changes. Sorry for the delay. |
I think there may be a logical error in my PR based upon knowledge gaps in my Pandas understanding. I'll investigate any failed tests tomorrow and I may ask for further clarification if possible. |
@mzeitlin11 I took a look at the tests that were failing and attempted to fix them. I have one question regarding |
Sorry @njriasan, not too familiar with this part of the codebase. Definitely agree on the consistency point, but not sure about throwing an exception. Especially in the case of |
pandas/_libs/parsers.pyx
Outdated
bint skip_header_end # Boolean: 1: Header=None, | ||
# 0 Header is not None | ||
# This is used because header_end is | ||
# uint64_t so there is no valid NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by header_end
being uint64_t
? Based on this diff, it doesn't look like it used to be uint64_t
(since it held -1), and I don't see anything in this pr changing the type of header_end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of header_end
is still uint64_t
(see the struct parser_t definition in parsers.pyx
or the definition in tokenizer.h
). There was previously a section of code that initialized the value to -1, which is clearly incorrect because the type is unsigned (I believe it is uint64_t to be consistent with lines and handle very large files).
This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, good to fix that!
So this PR should also fix #22144, because the changes to the Python version fix this PR as well (so I modified the C code to be consistent). This does require reverting the test for #26218 to raise an exception, which based on the comments in the issue and 22144 is the desired behavior. This should not raise an error in the case of |
@mzeitlin11 can you give me some advice on how to determine the build errors I'm getting on the tests? Everything runs locally on my machine and I haven't modified the failing module at all. Perhaps my issues are because I had to fix a merge issue in the |
Looks like a build failure. For example in https://github.com/pandas-dev/pandas/pull/40629/checks?check_run_id=2376337837 seeing
You might not see this locally if compile flags are different so that certain warnings don't error locally |
Thanks @mzeitlin11 this fixed it! All the tests pass now. Can you let me know what are the next steps to get this ready to merge? |
More rounds of review from others :) |
Will look more closely later, but I don't think we should change the behavior without deprecating first. Based on work experience when working with proprietary systems this is not that unusual unfortunately. In #38587 we weren't even sure if we would like to deprecate or just warn |
stream = StringIO("foo,bar,baz,bam,blah") | ||
parser = all_parsers | ||
df = parser.read_csv(stream, header=None, names=column_names, index_col=False) | ||
tm.assert_frame_equal(df, ref) | ||
with pytest.raises(ParserError, match="Expected 3 fields in line 1, saw 5"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principal I am not against changing this, but doing only for this case would cause this to fail and
data="""
a,b,c
1,2,3,4
5,6,7,8
"""
df = pd.read_csv(StringIO(data), header=None,
index_col=False,
engine="python"
)
to work. ALso not sure if we can do this without deprecating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what you mean by work? Testing this example on my PR I get Expected 3 fields in line 3, saw 4
with all parsers, which I believe should be the intended behavior.
I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation.
cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it.
Hi I'm just wondering if this feature change needs to go through a |
Two options here: If you can fix the bug without breaking the test, you can do this here and open up another pr raising a FutureWarning for the behaviro which is tested but looks wrong. If you can not fix this without breaking the test, you could raise a FutureWarning in this case and put off the pr to submit for 2.0 |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
@njriasan still working on this? if so pls merge master |
Thanks for the PR, but it appears to have gone stale. If you'd like to continue, please merge master, and address the the comments (around the |