-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: csv parsing with EOF byte on windows, Revert gh-16039 #16663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if (nbytes > fs->size) { | ||
nbytes = fs->size; | ||
} | ||
*bytes_read = fread((void *)src->buffer, sizeof(char), nbytes, src->fp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this only read min(bytes left, nbytes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, fread()
will put up to sizeof(char) * nbytes into the buffer at src->buffer. if nbytes is greater than fs->size, then this will overflow. the rest of pandas may limit nbytes to fs->size before passing it to buffer_file_bytes()
, but id argue that makes reading or auditing this code harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we'll hit end-of-file first, and it will terminate then, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the file or the remainder of the file is shorter than fs->size, yes. that isn't the only situation buffer_file_bytes()
is called in though.
say you have a 1MB file, and new_file_source()
is called with 1000 as the buffer_size argument, then fs->size will be 1000 bytes long. if buffer_file_bytes()
is called with nbytes as 2000, then you will read 2000 bytes into a 1000 byte long buffer because there at least that much data available in the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, got it. I'll add that to my follow-up PR then.
looks reasonable. 3.6/windows is failing & pls run the csv asv's just to confirm no regression. |
Codecov Report
@@ Coverage Diff @@
## master #16663 +/- ##
==========================================
+ Coverage 90.93% 90.96% +0.02%
==========================================
Files 161 161
Lines 49269 49266 -3
==========================================
+ Hits 44802 44813 +11
+ Misses 4467 4453 -14
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16663 +/- ##
==========================================
- Coverage 90.93% 90.93% -0.01%
==========================================
Files 161 161
Lines 49269 49270 +1
==========================================
Hits 44802 44802
- Misses 4467 4468 +1
Continue to review full report at Codecov.
|
Actually, it's the 2.7 build on Windows (it passes just fine on Python 3.6)
Didn't see any noticeable regressions on my end. |
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.
@jreback : All green and ready to go now. |
/* | ||
On-disk FILE, uncompressed | ||
*/ | ||
|
||
void *new_file_source(char *fname, size_t buffer_size) { | ||
file_source *fs = (file_source *)malloc(sizeof(file_source)); | ||
if (fs == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs can still be a NULL pointer no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be on some architectures. We should protect against that.
|
||
// Only allocate this heap memory if we are not memory-mapping the file | ||
fs->buffer = (char *)malloc((buffer_size + 1) * sizeof(char)); | ||
|
||
if (fs->buffer == NULL) { | ||
goto err_free; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to free fs here? (what about closing the fh)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those would both be good to do here, if possible.
/* XXX Eventually remove this print statement. */ | ||
fprintf(stderr, "new_file_buffer: mmap() failed.\n"); | ||
goto err_close; | ||
free(mm); | ||
mm = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't just return NULL here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you can just return NULL
here IINM.
@jreback : I think you raise a couple of good points, but I think I would prefer to merge this (so that the bug is patched) and look into those afterwards as a follow-up, if that works? |
that's fine. marking for 0.20.3 as this is a bug fix. if you run into issues and we have to make a bigger change can always just move to 0.21 yes bullet-proofing memory is an issue here (though hard to nail these types of things down). |
hi guys, sorry im late to the party. i am surprised by this, i thought that EOF was supposed to be represented outside the allowed set of characters you can have in a file, but after discussing this with a few people i get the impression that windows is special. as suggested by bug 5500, EOF on windows is 0x1a or 27, which can definitely end up in a file. on unix-like systems EOF tends to map to -1 which is supposed to outside what a char can represent. one of the people i talked to pointed out that i dropped the binary flag when moving from fopen() ("b" in the mode string) to open(). in unix-like systems, the b flag is meaningless, but not on windows. from what i can tell, ORing O_BINARY into the flags to open() might be enough to fix the problem. because O_BINARY may not be provided on other platforms, a common pattern for handling that is to put something like the following toward the top of the file:
|
@dgwynne :
That's one way to put it. 😉
Hmm...interesting. Feel free to give it a shot. Then we can re-apply your PR with that patch if it works. |
i don't have a windows box with a toolchain i can test with. if i generate a change against head, would you (or someone) be willing to try it? |
We test the codebase against Appveyor, which runs Windows boxes. You can create an account for free, hook it up to your GitHub account, and it will run those tests for you before you submit the PR (or you could just submit immediately and let the tests run). |
is there a test case that will catch this problem now? |
Yes, I added one in this PR. |
#16675 has been created. do i just wait now? |
* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559. (cherry picked from commit c550372)
…as-dev#16675) * Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in pandas-dev#16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on pandas-dev#16675 (cherry picked from commit d298414)
* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675 (cherry picked from commit d298414)
#16039 created a bug in which files containing byte-like data could break on Windows, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted.
Closes #16559 (confirmed with unit test).
cc @dgwynne