-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
move io.c from using unbuffered fread()s to read()s. #16039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
https://travis-ci.org/pandas-dev/pandas/jobs/222988790 Is there any kind of test which can detect this? (or just perf on some systems)? |
As you implied, it isn't a hard failure. The problem is performance on Solaris libc based systems is bad. A test of the bad performance is simply to use |
Codecov Report
@@ Coverage Diff @@
## master #16039 +/- ##
==========================================
+ Coverage 90.99% 91% +<.01%
==========================================
Files 153 153
Lines 50481 50481
==========================================
+ Hits 45937 45938 +1
+ Misses 4544 4543 -1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16039 +/- ##
==========================================
- Coverage 90.84% 90.82% -0.03%
==========================================
Files 159 159
Lines 50791 50791
==========================================
- Hits 46143 46132 -11
- Misses 4648 4659 +11
Continue to review full report at Codecov.
|
I'm not sure, but I don't think the new build failures are the result of b36c1ff, which was basically whitespace tweaks. Alternatively fixing lint has made a later check run which did hit a problem. However, I don't think that is the case here. |
can you show running an asv (perf testing) on csv benchmarks (just to assert that nothing has changed). |
ill try and figure out asv. in the meantime, here are some runs before and after the change. the py code i tested with:
this is a zone (container) on a solaris derivative box (https://www.joyent.com/smartos): simple wc (to warm the caches?):
before:
after:
pretty brutal. on a linux box: wc:
before:
after:
im surprised the linux test shows a difference. |
pandas already buffers reads coming from io.c itself, so it previously used setbuf() to disable buffering inside fread(). however, certain implementations of unbuffered stdio reads are sub-optimal. for example, fread() in solaris ends up doing a read() for each individual byte of the underlying filedescriptor, which turns out to be very slow. instead, this code now open()s a file descritor and read()s directly into the buffer that pandas has already allocated. this is effectively what other libcs (eg, glibc) do underneath an unbuffered fread() anyway, but this is more explicit. while here, this tweaks the mmap backend to use open() too, and also properly checks for mmap failure by comparing its result to MAP_FAILED instead of NULL. closes #16039
i ran |
this looks good perf wise (neutral on my tests), so if its better on other platforms +1. |
thanks @dgwynne |
* PERF: move io.c from using unbuffered fread()s to read()s. pandas already buffers reads coming from io.c itself, so it previously used setbuf() to disable buffering inside fread(). however, certain implementations of unbuffered stdio reads are sub-optimal. for example, fread() in solaris ends up doing a read() for each individual byte of the underlying filedescriptor, which turns out to be very slow. instead, this code now open()s a file descritor and read()s directly into the buffer that pandas has already allocated. this is effectively what other libcs (eg, glibc) do underneath an unbuffered fread() anyway, but this is more explicit. while here, this tweaks the mmap backend to use open() too, and also properly checks for mmap failure by comparing its result to MAP_FAILED instead of NULL. closes pandas-dev#16039 * DOC: add whatsnew
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.
* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675
pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559. (cherry picked from commit c550372)
…as-dev#16675) * Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in pandas-dev#16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on pandas-dev#16675 (cherry picked from commit d298414)
* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675 (cherry picked from commit d298414)
pandas already buffers reads coming from io.c itself, so it previously
used setbuf() to disable buffering inside fread(). however, certain
implementations of unbuffered stdio reads are sub-optimal. for example,
fread() in solaris ends up doing a read() for each individual byte of
the underlying filedescriptor, which turns out to be very slow.
instead, this code now open()s a file descritor and read()s directly
into the buffer that pandas has already allocated. this is effectively
what other libcs (eg, glibc) do underneath an unbuffered fread() anyway,
but this is more explicit.
while here, this tweaks the mmap backend to use open() too, and also
properly checks for mmap failure by comparing its result to MAP_FAILED
instead of NULL.
git diff upstream/master --name-only -- '*.py' | flake8 --diff