-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
revert #16663, which was a revert of #16039 #16675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code.
You need to re-add the test that I wrote to ensure that your change is correct. |
i did think it was going too well... |
Codecov Report
@@ Coverage Diff @@
## master #16675 +/- ##
==========================================
- Coverage 90.92% 90.92% -0.01%
==========================================
Files 161 161
Lines 49272 49271 -1
==========================================
- Hits 44803 44802 -1
Misses 4469 4469
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16675 +/- ##
==========================================
- Coverage 90.92% 90.92% -0.01%
==========================================
Files 161 161
Lines 49272 49271 -1
==========================================
- Hits 44803 44802 -1
Misses 4469 4469
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16675 +/- ##
==========================================
- Coverage 90.92% 90.92% -0.01%
==========================================
Files 161 161
Lines 49272 49276 +4
==========================================
+ Hits 44803 44804 +1
- Misses 4469 4472 +3
Continue to review full report at Codecov.
|
@dgwynne : LGTM, well done re-patching your patch 😄 |
@gfyoung : thanks. sorry for the breakage though. it never even occurred to me that there'd be encoding issues like this. it's also encouraging to see how quickly such issues are handled. reverting the diff was obviously right at the time. |
@dgwynne : No worries! We can't see everything that breaks if we don't test for it. ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approach lgtm. couple of comments.
@@ -54,7 +54,6 @@ Indexing | |||
I/O | |||
^^^ | |||
|
|||
- Bug in ``pd.read_csv()`` in which files containing EOF characters mid-field could fail with the C engine on Windows (:issue:`16039`, :issue:`16559`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-add this (and add this PR number on as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
pandas/_libs/src/parser/io.c
Outdated
fs->initial_file_pos = ftell(fs->fp); | ||
fs->fd = open(fname, O_RDONLY | O_BINARY); | ||
if (fs->fd == -1) { | ||
goto err_free; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would in-line these instead of goto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not keen on replacing the goto with duplicate code in all the error paths. in my experience using goto in this specific situation, ie, unwinding the generation of state like this, has more benefits than caveats. inlining the cleanup leads to larger code, and is more error prone to maintain. if you add the allocation of something else in the future, you have to judge and fix the cleanup in multiple places rather than adding to the stack of goto targets.
if you feel strongly about it ill change it. after all, im just visiting :). i just wanted to discuss why i did it like this in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am fine either way
but right now it's about half and half
so if u can remove some calls and use the goto that's fine (or go the other way imho is actually more explicit)
code size makes almost no difference here
correct is the important issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
pandas/_libs/src/parser/io.c
Outdated
if (mm->fd == -1) { | ||
fprintf(stderr, "new_file_buffer: open(%s) failed. errno =%d\n", | ||
fname, errno); | ||
goto err_free; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
pandas/_libs/src/parser/io.c
Outdated
if (fstat(mm->fd, &stat) == -1) { | ||
fprintf(stderr, "new_file_buffer: fstat() failed. errno =%d\n", | ||
errno); | ||
goto err_close; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
pandas/_libs/src/parser/io.c
Outdated
/* XXX Eventually remove this print statement. */ | ||
fprintf(stderr, "new_file_buffer: mmap() failed.\n"); | ||
free(mm); | ||
mm = NULL; | ||
goto err_close; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
thanks @dgwynne ! |
…as-dev#16675) * Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in pandas-dev#16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on pandas-dev#16675 (cherry picked from commit d298414)
* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675 (cherry picked from commit d298414)
i believe the issue with #16039 was the loss of the semantics provided by the "b" part of the mode in
fopen()
. this is a nop on unix-like systems, but appears to be significant on windows.the equivalent functionality on windows in
open()
is theO_BINARY
flag. this adds that to theopen()
call, and provides compat for it on platforms without the flag by defining it to 0.while here, fix an fd leak in
new_file_source()
if the buffer allocation fails.git diff upstream/master --name-only -- '*.py' | flake8 --diff