BUG: Fix segfault in csv tokenizer #32566

roberthdevries · 2020-03-09T22:07:06Z

closes Loading CSV files (using read_csv) with blank lines between header and data rows quits Python interpreter #28071
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

Great thanks for the PR @roberthdevries . We don't get a ton of contributions to the C extensions so much appreciate what you are doing here

pandas/_libs/src/parser/tokenizer.c

WillAyd · 2020-03-10T00:36:08Z

pandas/tests/io/parser/test_textreader.py

+            columns=list("ab"),
+        )
+        csv = "\nheader\n\na,b\n\n\n1,2\n\n3,4"
+        for nrows in range(1, 6):


Instead of doing this can you just parametrize on the values that matter?

Can you elaborate which values you would like to see parameterized?
I basically took the example in the ticket to reproduce the issue and checked that for all sensible values for nrows I get the expected outcome.
The reason for this is that this covers more cases that could cause errors like the one in the original ticket due to various edge cases.

@roberthdevries : @WillAyd is asking that you use pytest.mark.parameterize over the nrows parameter.

pandas/tests/io/parser/test_textreader.py

jreback · 2020-03-11T03:02:23Z

pandas/tests/io/parser/test_textreader.py

@@ -341,6 +341,17 @@ def test_empty_csv_input(self):
        df = read_csv(StringIO(), chunksize=20, header=None, names=["a", "b", "c"])
        assert isinstance(df, TextFileReader)

+    def test_blank_lines_between_header_and_data_rows(self):


this test needs to be in test_common.py and use the all_parsers fixture so it testsed.

pep8speaks · 2020-03-11T15:32:10Z

Hello @roberthdevries! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-15 07:19:35 UTC

jreback · 2020-03-15T00:40:06Z

@roberthdevries pls merge master; ping on green.

This looks like the most sensible value as the number of words that is deleted is 0 therefore the char_count of characters to be skipped is also 0.

roberthdevries · 2020-03-15T08:27:28Z

ping @jreback everything is green

jreback · 2020-03-16T02:21:33Z

thanks @roberthdevries

WillAyd requested changes Mar 10, 2020

View reviewed changes

WillAyd added the IO CSV read_csv, to_csv label Mar 10, 2020

jreback requested changes Mar 11, 2020

View reviewed changes

jreback requested a review from gfyoung March 11, 2020 03:02

datapythonista added the Bug label Mar 11, 2020

datapythonista changed the title ~~Fix segfault in csv tokenizer (issue #28071)~~ BUG: Fix segfault in csv tokenizer Mar 11, 2020

roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from a9e6294 to 5007c28 Compare March 11, 2020 15:32

roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from 1b4cc74 to 07a136f Compare March 12, 2020 21:40

roberthdevries requested review from jreback and WillAyd March 14, 2020 11:21

gfyoung approved these changes Mar 14, 2020

View reviewed changes

jreback added this to the 1.1 milestone Mar 15, 2020

jreback approved these changes Mar 15, 2020

View reviewed changes

roberthdevries added 10 commits March 15, 2020 08:17

Fix segfault in csv tokenizer (issue pandas-dev#28071)

6da5e2a

Add whatsnew entry

b1305b6

Add ticket number to test

79cd073

Untabify

f1aa154

Make linter happy

c0c1ac6

nrows is now parametrized

019b3fa

Test moved from test_textreader to test_common

4c50e0a

Skip the remainder of stream if there are no word_deletions

0f793ef

Set char_count to 0 when word_deletions is 0

3278c79

This looks like the most sensible value as the number of words that is deleted is 0 therefore the char_count of characters to be skipped is also 0.

Add comment to explain the reasoning to set char_count to 0

c6b64c7

roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from 7ba3e24 to c6b64c7 Compare March 15, 2020 07:19

jreback merged commit 95cd98b into pandas-dev:master Mar 16, 2020

roberthdevries deleted the fix-28071-read-csv-empty-lines-segfault branch March 16, 2020 20:08

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

BUG: Fix segfault in csv tokenizer (pandas-dev#32566)

a550e2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix segfault in csv tokenizer #32566

BUG: Fix segfault in csv tokenizer #32566

roberthdevries commented Mar 9, 2020

WillAyd left a comment

WillAyd Mar 10, 2020

roberthdevries Mar 10, 2020

gfyoung Mar 11, 2020

roberthdevries Mar 11, 2020

jreback Mar 11, 2020

roberthdevries Mar 11, 2020

pep8speaks commented Mar 11, 2020 •

edited

Loading

jreback commented Mar 15, 2020

roberthdevries commented Mar 15, 2020

jreback commented Mar 16, 2020

BUG: Fix segfault in csv tokenizer #32566

BUG: Fix segfault in csv tokenizer #32566

Conversation

roberthdevries commented Mar 9, 2020

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Mar 10, 2020

Choose a reason for hiding this comment

roberthdevries Mar 10, 2020

Choose a reason for hiding this comment

gfyoung Mar 11, 2020

Choose a reason for hiding this comment

roberthdevries Mar 11, 2020

Choose a reason for hiding this comment

jreback Mar 11, 2020

Choose a reason for hiding this comment

roberthdevries Mar 11, 2020

Choose a reason for hiding this comment

pep8speaks commented Mar 11, 2020 • edited Loading

Comment last updated at 2020-03-15 07:19:35 UTC

jreback commented Mar 15, 2020

roberthdevries commented Mar 15, 2020

jreback commented Mar 16, 2020

pep8speaks commented Mar 11, 2020 •

edited

Loading