Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Detect Parsing errors in read_csv first row with index_col=False #40629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect Parsing errors in read_csv first row with index_col=False #40629
Changes from 6 commits
478bbc7
a1b2e5c
806e4f6
0d55f5d
1151b93
6756639
185f62e
a305268
7d973e6
d77ccde
546e106
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by
header_end
beinguint64_t
? Based on this diff, it doesn't look like it used to beuint64_t
(since it held -1), and I don't see anything in this pr changing the type ofheader_end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of
header_end
is stilluint64_t
(see the struct parser_t definition inparsers.pyx
or the definition intokenizer.h
). There was previously a section of code that initialized the value to -1, which is clearly incorrect because the type is unsigned (I believe it is uint64_t to be consistent with lines and handle very large files).This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining, good to fix that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principal I am not against changing this, but doing only for this case would cause this to fail and
to work. ALso not sure if we can do this without deprecating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what you mean by work? Testing this example on my PR I get
Expected 3 fields in line 3, saw 4
with all parsers, which I believe should be the intended behavior.I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation.
cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it.