Skip to content

BUG: Fix segfault in csv tokenizer #32566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

roberthdevries
Copy link
Contributor

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thanks for the PR @roberthdevries . We don't get a ton of contributions to the C extensions so much appreciate what you are doing here

columns=list("ab"),
)
csv = "\nheader\n\na,b\n\n\n1,2\n\n3,4"
for nrows in range(1, 6):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this can you just parametrize on the values that matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate which values you would like to see parameterized?
I basically took the example in the ticket to reproduce the issue and checked that for all sensible values for nrows I get the expected outcome.
The reason for this is that this covers more cases that could cause errors like the one in the original ticket due to various edge cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roberthdevries : @WillAyd is asking that you use pytest.mark.parameterize over the nrows parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Mar 10, 2020
@@ -341,6 +341,17 @@ def test_empty_csv_input(self):
df = read_csv(StringIO(), chunksize=20, header=None, names=["a", "b", "c"])
assert isinstance(df, TextFileReader)

def test_blank_lines_between_header_and_data_rows(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test needs to be in test_common.py and use the all_parsers fixture so it testsed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback requested a review from gfyoung March 11, 2020 03:02
@datapythonista datapythonista changed the title Fix segfault in csv tokenizer (issue #28071) BUG: Fix segfault in csv tokenizer Mar 11, 2020
@roberthdevries roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from a9e6294 to 5007c28 Compare March 11, 2020 15:32
@pep8speaks
Copy link

pep8speaks commented Mar 11, 2020

Hello @roberthdevries! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-03-15 07:19:35 UTC

@roberthdevries roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from 1b4cc74 to 07a136f Compare March 12, 2020 21:40
@jreback jreback added this to the 1.1 milestone Mar 15, 2020
@jreback
Copy link
Contributor

jreback commented Mar 15, 2020

@roberthdevries pls merge master; ping on green.

@roberthdevries roberthdevries force-pushed the fix-28071-read-csv-empty-lines-segfault branch from 7ba3e24 to c6b64c7 Compare March 15, 2020 07:19
@roberthdevries
Copy link
Contributor Author

ping @jreback everything is green

@jreback jreback merged commit 95cd98b into pandas-dev:master Mar 16, 2020
@jreback
Copy link
Contributor

jreback commented Mar 16, 2020

thanks @roberthdevries

@roberthdevries roberthdevries deleted the fix-28071-read-csv-empty-lines-segfault branch March 16, 2020 20:08
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading CSV files (using read_csv) with blank lines between header and data rows quits Python interpreter
6 participants