move io.c from using unbuffered fread()s to read()s. #16039

dgwynne · 2017-04-18T01:25:36Z

pandas already buffers reads coming from io.c itself, so it previously
used setbuf() to disable buffering inside fread(). however, certain
implementations of unbuffered stdio reads are sub-optimal. for example,
fread() in solaris ends up doing a read() for each individual byte of
the underlying filedescriptor, which turns out to be very slow.

instead, this code now open()s a file descritor and read()s directly
into the buffer that pandas has already allocated. this is effectively
what other libcs (eg, glibc) do underneath an unbuffered fread() anyway,
but this is more explicit.

while here, this tweaks the mmap backend to use open() too, and also
properly checks for mmap failure by comparing its result to MAP_FAILED
instead of NULL.

closes #xxxx
tests added / passed
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
whatsnew entry

jreback · 2017-04-18T01:47:16Z

https://travis-ci.org/pandas-dev/pandas/jobs/222988790
couple of cpplint errors.

Is there any kind of test which can detect this? (or just perf on some systems)?

dgwynne · 2017-04-18T02:48:32Z

As you implied, it isn't a hard failure. The problem is performance on Solaris libc based systems is bad. A test of the bad performance is simply to use read_csv() on such a system and compare the runtime before and after the change.

codecov · 2017-04-19T00:52:00Z

Codecov Report

Merging #16039 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16039      +/-   ##
==========================================
+ Coverage   90.99%      91%   +<.01%     
==========================================
  Files         153      153              
  Lines       50481    50481              
==========================================
+ Hits        45937    45938       +1     
+ Misses       4544     4543       -1

Flag	Coverage Δ
#multiple	`88.76% <ø> (ø)`	⬆️
#single	`40.43% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/util/testing.py	`80.73% <0%> (+0.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1031f...b36c1ff. Read the comment docs.

codecov · 2017-04-19T00:52:27Z

Codecov Report

Merging #16039 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #16039      +/-   ##
==========================================
- Coverage   90.84%   90.82%   -0.03%     
==========================================
  Files         159      159              
  Lines       50791    50791              
==========================================
- Hits        46143    46132      -11     
- Misses       4648     4659      +11

Flag	Coverage Δ
#multiple	`88.6% <ø> (-0.03%)`	⬇️
#single	`40.36% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/plotting/_converter.py	`63.54% <0%> (-1.82%)`	⬇️
pandas/core/common.py	`91.03% <0%> (+0.34%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b17e286...b29b032. Read the comment docs.

dgwynne · 2017-04-19T01:37:43Z

I'm not sure, but I don't think the new build failures are the result of b36c1ff, which was basically whitespace tweaks. Alternatively fixing lint has made a later check run which did hit a problem. However, I don't think that is the case here.

jreback · 2017-04-19T10:15:32Z

can you show running an asv (perf testing) on csv benchmarks (just to assert that nothing has changed).

dgwynne · 2017-04-20T04:03:56Z

ill try and figure out asv. in the meantime, here are some runs before and after the change.

the py code i tested with:

import pandas as pd

data = pd.read_csv('yellow_tripdata_2016-12.csv')
print('Length: {} rows.'.format(len(data)))

this is a zone (container) on a solaris derivative box (https://www.joyent.com/smartos):

simple wc (to warm the caches?):

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time wc -l yellow_tripdata_2016-12.csv 
10449409 yellow_tripdata_2016-12.csv

real    0m0.973s
user    0m0.600s
sys     0m0.371s

before:

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time python pd.py 
Length: 10449408 rows.

real    35m8.845s
user    3m24.646s
sys     31m42.769s

after:

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time python pd.py 
Length: 10449408 rows.

real    0m36.593s
user    0m30.888s
sys     0m4.038s

pretty brutal.

on a linux box:

wc:

uqdgwynn@mango ~$ time wc -l yellow_tripdata_2016-12.csv 
10449409 yellow_tripdata_2016-12.csv

real    0m0.945s
user    0m0.409s
sys     0m0.517s

before:

uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    1m25.830s
user    0m56.329s
sys     0m26.963s
uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    1m17.969s
user    1m2.792s
sys     0m14.248s

after:

uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    0m47.950s
user    0m37.004s
sys     0m8.825s

im surprised the linux test shows a difference.

pandas already buffers reads coming from io.c itself, so it previously used setbuf() to disable buffering inside fread(). however, certain implementations of unbuffered stdio reads are sub-optimal. for example, fread() in solaris ends up doing a read() for each individual byte of the underlying filedescriptor, which turns out to be very slow. instead, this code now open()s a file descritor and read()s directly into the buffer that pandas has already allocated. this is effectively what other libcs (eg, glibc) do underneath an unbuffered fread() anyway, but this is more explicit. while here, this tweaks the mmap backend to use open() too, and also properly checks for mmap failure by comparing its result to MAP_FAILED instead of NULL. closes #16039

dgwynne · 2017-04-20T10:46:33Z

i ran asv continuous -f 1.1 -E virtualenv origin/parser-no-stdio HEAD in the asv_bench directory, and after each individual benchmark it produced "BENCHMARKS NOT SIGNIFICANTLY CHANGED." in green.

jreback · 2017-04-20T10:46:34Z

this looks good perf wise (neutral on my tests), so if its better on other platforms +1.

jreback · 2017-04-20T22:34:42Z

thanks @dgwynne

* PERF: move io.c from using unbuffered fread()s to read()s. pandas already buffers reads coming from io.c itself, so it previously used setbuf() to disable buffering inside fread(). however, certain implementations of unbuffered stdio reads are sub-optimal. for example, fread() in solaris ends up doing a read() for each individual byte of the underlying filedescriptor, which turns out to be very slow. instead, this code now open()s a file descritor and read()s directly into the buffer that pandas has already allocated. this is effectively what other libcs (eg, glibc) do underneath an unbuffered fread() anyway, but this is more explicit. while here, this tweaks the mmap backend to use open() too, and also properly checks for mmap failure by comparing its result to MAP_FAILED instead of NULL. closes pandas-dev#16039 * DOC: add whatsnew

pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559.

gh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes gh-16559.

@jreback

* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675

pandas-devgh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes pandas-devgh-16559. (cherry picked from commit c550372)

@jreback

…as-dev#16675) * Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in pandas-dev#16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on pandas-dev#16675 (cherry picked from commit d298414)

gh-16039 created a bug in which files containing byte-like data could break, as EOF characters mid-field (despite being quoted) would cause premature line breaks. Given that this PR was a performance patch, this commit can be safely reverted. Closes gh-16559. (cherry picked from commit c550372)

@jreback

* Revert "BUG: Revert gh-16039 (#16663)" This reverts commit c550372. * always treat files as binary to cope with windows and EOF. on windows, EOF can appear "in band" if the file is considered text. when moving from fread() to read(), i lost the "b" part of the mode. at the time i believed this was a nop, since unix doesnt treat files differently based on that flag. this adds O_BINARY to the flags to open to restore the behaviour lost when taking "b" away from fopen. if a platform doesn't provide O_BINARY, this defines it to 0 so it can still be used without effect later on in the code. * dont leak the fd in new_file_source() if buffer allocation fails. * reapply the test for EOF in the middle of a stream. part of c550372 * pass rb to _get_handle on python 3, otherwise stick to r. part of c550372 * replace goto with inline unwinding of state. requested by @jreback in #16675 feedback. * describe the fixes to the read_csv() backend and issue numbers. requested by @jreback in feedback on #16675 (cherry picked from commit d298414)

jreback added the IO CSV read_csv, to_csv label Apr 18, 2017

jreback added the Compat pandas objects compatability with Numpy or Python functions label Apr 19, 2017

dgwynne and others added 2 commits April 20, 2017 06:18

DOC: add whatsnew

b29b032

jreback added this to the 0.20.0 milestone Apr 20, 2017

jreback merged commit ff00b55 into pandas-dev:master Apr 20, 2017

dgwynne deleted the parser-no-stdio branch April 20, 2017 23:44

This was referenced Jun 11, 2017

Old read_csv() & EOF character issue back #16559

Closed

BUG: csv parsing with EOF byte on windows, Revert gh-16039 #16663

Merged

dgwynne mentioned this pull request Jun 12, 2017

revert #16663, which was a revert of #16039 #16675

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

move io.c from using unbuffered fread()s to read()s. #16039

move io.c from using unbuffered fread()s to read()s. #16039

Uh oh!

dgwynne commented Apr 18, 2017

Uh oh!

jreback commented Apr 18, 2017

Uh oh!

dgwynne commented Apr 18, 2017

Uh oh!

codecov bot commented Apr 19, 2017

Uh oh!

codecov bot commented Apr 19, 2017 •

edited

Loading

Uh oh!

dgwynne commented Apr 19, 2017

Uh oh!

jreback commented Apr 19, 2017

Uh oh!

dgwynne commented Apr 20, 2017

Uh oh!

dgwynne commented Apr 20, 2017

Uh oh!

jreback commented Apr 20, 2017

Uh oh!

jreback commented Apr 20, 2017

Uh oh!

Uh oh!

Uh oh!

move io.c from using unbuffered fread()s to read()s. #16039

move io.c from using unbuffered fread()s to read()s. #16039

Uh oh!

Conversation

dgwynne commented Apr 18, 2017

Uh oh!

jreback commented Apr 18, 2017

Uh oh!

dgwynne commented Apr 18, 2017

Uh oh!

codecov bot commented Apr 19, 2017

Codecov Report

Uh oh!

codecov bot commented Apr 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dgwynne commented Apr 19, 2017

Uh oh!

jreback commented Apr 19, 2017

Uh oh!

dgwynne commented Apr 20, 2017

Uh oh!

dgwynne commented Apr 20, 2017

Uh oh!

jreback commented Apr 20, 2017

Uh oh!

jreback commented Apr 20, 2017

Uh oh!

Uh oh!

codecov bot commented Apr 19, 2017 •

edited

Loading