-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix segfault in csv tokenizer #32566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix segfault in csv tokenizer #32566
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for the PR @roberthdevries . We don't get a ton of contributions to the C extensions so much appreciate what you are doing here
columns=list("ab"), | ||
) | ||
csv = "\nheader\n\na,b\n\n\n1,2\n\n3,4" | ||
for nrows in range(1, 6): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this can you just parametrize on the values that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate which values you would like to see parameterized?
I basically took the example in the ticket to reproduce the issue and checked that for all sensible values for nrows
I get the expected outcome.
The reason for this is that this covers more cases that could cause errors like the one in the original ticket due to various edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roberthdevries : @WillAyd is asking that you use pytest.mark.parameterize
over the nrows
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -341,6 +341,17 @@ def test_empty_csv_input(self): | |||
df = read_csv(StringIO(), chunksize=20, header=None, names=["a", "b", "c"]) | |||
assert isinstance(df, TextFileReader) | |||
|
|||
def test_blank_lines_between_header_and_data_rows(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test needs to be in test_common.py and use the all_parsers fixture so it testsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
a9e6294
to
5007c28
Compare
Hello @roberthdevries! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-03-15 07:19:35 UTC |
1b4cc74
to
07a136f
Compare
@roberthdevries pls merge master; ping on green. |
This looks like the most sensible value as the number of words that is deleted is 0 therefore the char_count of characters to be skipped is also 0.
7ba3e24
to
c6b64c7
Compare
ping @jreback everything is green |
thanks @roberthdevries |
read_csv
) with blank lines between header and data rows quits Python interpreter #28071black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff