Skip to content

BUG: csv parsing with EOF byte on windows, Revert gh-16039 #16663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 11, 2017

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Jun 11, 2017

#16039 created a bug in which files containing byte-like data could break on Windows, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted.

Closes #16559 (confirmed with unit test).

cc @dgwynne

@jreback jreback added Bug IO CSV read_csv, to_csv Windows Windows OS labels Jun 11, 2017
if (nbytes > fs->size) {
nbytes = fs->size;
}
*bytes_read = fread((void *)src->buffer, sizeof(char), nbytes, src->fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this only read min(bytes left, nbytes)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, fread() will put up to sizeof(char) * nbytes into the buffer at src->buffer. if nbytes is greater than fs->size, then this will overflow. the rest of pandas may limit nbytes to fs->size before passing it to buffer_file_bytes(), but id argue that makes reading or auditing this code harder.

Copy link
Member Author

@gfyoung gfyoung Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we'll hit end-of-file first, and it will terminate then, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the file or the remainder of the file is shorter than fs->size, yes. that isn't the only situation buffer_file_bytes() is called in though.

say you have a 1MB file, and new_file_source() is called with 1000 as the buffer_size argument, then fs->size will be 1000 bytes long. if buffer_file_bytes() is called with nbytes as 2000, then you will read 2000 bytes into a 1000 byte long buffer because there at least that much data available in the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, got it. I'll add that to my follow-up PR then.

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

looks reasonable. 3.6/windows is failing & pls run the csv asv's just to confirm no regression.

@jreback jreback changed the title BUG: Revert gh-16039 BUG: csv parsing with EOF byte on windows, Revert gh-16039 Jun 11, 2017
@codecov
Copy link

codecov bot commented Jun 11, 2017

Codecov Report

Merging #16663 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16663      +/-   ##
==========================================
+ Coverage   90.93%   90.96%   +0.02%     
==========================================
  Files         161      161              
  Lines       49269    49266       -3     
==========================================
+ Hits        44802    44813      +11     
+ Misses       4467     4453      -14
Flag Coverage Δ
#multiple 88.72% <ø> (+0.02%) ⬆️
#single 40.22% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/generic.py 92.36% <0%> (+0.09%) ⬆️
pandas/plotting/_converter.py 65.05% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...045966a. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 11, 2017

Codecov Report

Merging #16663 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16663      +/-   ##
==========================================
- Coverage   90.93%   90.93%   -0.01%     
==========================================
  Files         161      161              
  Lines       49269    49270       +1     
==========================================
  Hits        44802    44802              
- Misses       4467     4468       +1
Flag Coverage Δ
#multiple 88.69% <100%> (-0.01%) ⬇️
#single 40.22% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/parsers.py 95.43% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.23% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 18a428d...69a4ad0. Read the comment docs.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 11, 2017

3.6/windows is failing

Actually, it's the 2.7 build on Windows (it passes just fine on Python 3.6)

pls run the csv asv's just to confirm no regression.

Didn't see any noticeable regressions on my end.

pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.
@gfyoung
Copy link
Member Author

gfyoung commented Jun 11, 2017

@jreback : All green and ready to go now.

@jreback jreback added this to the 0.20.3 milestone Jun 11, 2017
/*
On-disk FILE, uncompressed
*/

void *new_file_source(char *fname, size_t buffer_size) {
file_source *fs = (file_source *)malloc(sizeof(file_source));
if (fs == NULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs can still be a NULL pointer no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be on some architectures. We should protect against that.


// Only allocate this heap memory if we are not memory-mapping the file
fs->buffer = (char *)malloc((buffer_size + 1) * sizeof(char));

if (fs->buffer == NULL) {
goto err_free;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to free fs here? (what about closing the fh)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, those would both be good to do here, if possible.

/* XXX Eventually remove this print statement. */
fprintf(stderr, "new_file_buffer: mmap() failed.\n");
goto err_close;
free(mm);
mm = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't just return NULL here?

Copy link
Member Author

@gfyoung gfyoung Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can just return NULL here IINM.

@gfyoung
Copy link
Member Author

gfyoung commented Jun 11, 2017

@jreback : I think you raise a couple of good points, but I think I would prefer to merge this (so that the bug is patched) and look into those afterwards as a follow-up, if that works?

@jreback
Copy link
Contributor

jreback commented Jun 11, 2017

I think you raise a couple of good points, but I think I would prefer to merge this (so that the bug is patched) and look into those afterwards as a follow-up, if that works?

that's fine. marking for 0.20.3 as this is a bug fix. if you run into issues and we have to make a bigger change can always just move to 0.21

yes bullet-proofing memory is an issue here (though hard to nail these types of things down).

@jreback jreback merged commit c550372 into pandas-dev:master Jun 11, 2017
@gfyoung gfyoung deleted the revert-16039 branch June 11, 2017 23:51
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 12, 2017
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 12, 2017
@dgwynne
Copy link
Contributor

dgwynne commented Jun 12, 2017

hi guys, sorry im late to the party.

i am surprised by this, i thought that EOF was supposed to be represented outside the allowed set of characters you can have in a file, but after discussing this with a few people i get the impression that windows is special. as suggested by bug 5500, EOF on windows is 0x1a or 27, which can definitely end up in a file. on unix-like systems EOF tends to map to -1 which is supposed to outside what a char can represent.

one of the people i talked to pointed out that i dropped the binary flag when moving from fopen() ("b" in the mode string) to open(). in unix-like systems, the b flag is meaningless, but not on windows.

from what i can tell, ORing O_BINARY into the flags to open() might be enough to fix the problem. because O_BINARY may not be provided on other platforms, a common pattern for handling that is to put something like the following toward the top of the file:

#ifndef O_BINARY
#define O_BINARY 0
#endif /* O_BINARY */

@gfyoung
Copy link
Member Author

gfyoung commented Jun 12, 2017

@dgwynne :

windows is special

That's one way to put it. 😉

from what i can tell, ORing O_BINARY into the flags to open() might be enough to fix the problem

Hmm...interesting. Feel free to give it a shot. Then we can re-apply your PR with that patch if it works.

@dgwynne
Copy link
Contributor

dgwynne commented Jun 12, 2017

i don't have a windows box with a toolchain i can test with. if i generate a change against head, would you (or someone) be willing to try it?

@gfyoung
Copy link
Member Author

gfyoung commented Jun 12, 2017

We test the codebase against Appveyor, which runs Windows boxes. You can create an account for free, hook it up to your GitHub account, and it will run those tests for you before you submit the PR (or you could just submit immediately and let the tests run).

@dgwynne
Copy link
Contributor

dgwynne commented Jun 12, 2017

is there a test case that will catch this problem now?

@gfyoung
Copy link
Member Author

gfyoung commented Jun 12, 2017

Yes, I added one in this PR.

@dgwynne
Copy link
Contributor

dgwynne commented Jun 12, 2017

#16675 has been created. do i just wait now?

jreback pushed a commit that referenced this pull request Jun 13, 2017
* Revert "BUG: Revert gh-16039 (#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in #16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on #16675
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jul 6, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.

(cherry picked from commit c550372)
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jul 6, 2017
…as-dev#16675)

* Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in pandas-dev#16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on pandas-dev#16675

(cherry picked from commit d298414)
TomAugspurger pushed a commit that referenced this pull request Jul 7, 2017
gh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes gh-16559.

(cherry picked from commit c550372)
TomAugspurger pushed a commit that referenced this pull request Jul 7, 2017
* Revert "BUG: Revert gh-16039 (#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in #16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on #16675

(cherry picked from commit d298414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Windows Windows OS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants