Skip to content

move io.c from using unbuffered fread()s to read()s. #16039

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 20, 2017
Merged

move io.c from using unbuffered fread()s to read()s. #16039

merged 2 commits into from
Apr 20, 2017

Conversation

dgwynne
Copy link
Contributor

@dgwynne dgwynne commented Apr 18, 2017

pandas already buffers reads coming from io.c itself, so it previously
used setbuf() to disable buffering inside fread(). however, certain
implementations of unbuffered stdio reads are sub-optimal. for example,
fread() in solaris ends up doing a read() for each individual byte of
the underlying filedescriptor, which turns out to be very slow.

instead, this code now open()s a file descritor and read()s directly
into the buffer that pandas has already allocated. this is effectively
what other libcs (eg, glibc) do underneath an unbuffered fread() anyway,
but this is more explicit.

while here, this tweaks the mmap backend to use open() too, and also
properly checks for mmap failure by comparing its result to MAP_FAILED
instead of NULL.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master --name-only -- '*.py' | flake8 --diff
  • whatsnew entry

@jreback
Copy link
Contributor

jreback commented Apr 18, 2017

https://travis-ci.org/pandas-dev/pandas/jobs/222988790
couple of cpplint errors.

Is there any kind of test which can detect this? (or just perf on some systems)?

@jreback jreback added the IO CSV read_csv, to_csv label Apr 18, 2017
@dgwynne
Copy link
Contributor Author

dgwynne commented Apr 18, 2017

As you implied, it isn't a hard failure. The problem is performance on Solaris libc based systems is bad. A test of the bad performance is simply to use read_csv() on such a system and compare the runtime before and after the change.

@codecov
Copy link

codecov bot commented Apr 19, 2017

Codecov Report

Merging #16039 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16039      +/-   ##
==========================================
+ Coverage   90.99%      91%   +<.01%     
==========================================
  Files         153      153              
  Lines       50481    50481              
==========================================
+ Hits        45937    45938       +1     
+ Misses       4544     4543       -1
Flag Coverage Δ
#multiple 88.76% <ø> (ø) ⬆️
#single 40.43% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/util/testing.py 80.73% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd1031f...b36c1ff. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 19, 2017

Codecov Report

Merging #16039 into master will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16039      +/-   ##
==========================================
- Coverage   90.84%   90.82%   -0.03%     
==========================================
  Files         159      159              
  Lines       50791    50791              
==========================================
- Hits        46143    46132      -11     
- Misses       4648     4659      +11
Flag Coverage Δ
#multiple 88.6% <ø> (-0.03%) ⬇️
#single 40.36% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/plotting/_converter.py 63.54% <0%> (-1.82%) ⬇️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b17e286...b29b032. Read the comment docs.

@dgwynne
Copy link
Contributor Author

dgwynne commented Apr 19, 2017

I'm not sure, but I don't think the new build failures are the result of b36c1ff, which was basically whitespace tweaks. Alternatively fixing lint has made a later check run which did hit a problem. However, I don't think that is the case here.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2017

can you show running an asv (perf testing) on csv benchmarks (just to assert that nothing has changed).

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Apr 19, 2017
@dgwynne
Copy link
Contributor Author

dgwynne commented Apr 20, 2017

ill try and figure out asv. in the meantime, here are some runs before and after the change.

the py code i tested with:

import pandas as pd

data = pd.read_csv('yellow_tripdata_2016-12.csv')
print('Length: {} rows.'.format(len(data)))

this is a zone (container) on a solaris derivative box (https://www.joyent.com/smartos):

simple wc (to warm the caches?):

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time wc -l yellow_tripdata_2016-12.csv 
10449409 yellow_tripdata_2016-12.csv

real    0m0.973s
user    0m0.600s
sys     0m0.371s

before:

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time python pd.py 
Length: 10449408 rows.

real    35m8.845s
user    3m24.646s
sys     31m42.769s

after:

[admin@ade240c5-d3bf-40b3-8b24-1940e686775a ~]$ time python pd.py 
Length: 10449408 rows.

real    0m36.593s
user    0m30.888s
sys     0m4.038s

pretty brutal.

on a linux box:

wc:

uqdgwynn@mango ~$ time wc -l yellow_tripdata_2016-12.csv 
10449409 yellow_tripdata_2016-12.csv

real    0m0.945s
user    0m0.409s
sys     0m0.517s

before:

uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    1m25.830s
user    0m56.329s
sys     0m26.963s
uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    1m17.969s
user    1m2.792s
sys     0m14.248s

after:

uqdgwynn@mango ~$ time python pd.py 
Length: 10449408 rows.

real    0m47.950s
user    0m37.004s
sys     0m8.825s

im surprised the linux test shows a difference.

dgwynne and others added 2 commits April 20, 2017 06:18
pandas already buffers reads coming from io.c itself, so it previously
used setbuf() to disable buffering inside fread(). however, certain
implementations of unbuffered stdio reads are sub-optimal. for example,
fread() in solaris ends up doing a read() for each individual byte of
the underlying filedescriptor, which turns out to be very slow.

instead, this code now open()s a file descritor and read()s directly
into the buffer that pandas has already allocated. this is effectively
what other libcs (eg, glibc) do underneath an unbuffered fread() anyway,
but this is more explicit.

while here, this tweaks the mmap backend to use open() too, and also
properly checks for mmap failure by comparing its result to MAP_FAILED
instead of NULL.

closes #16039
@dgwynne
Copy link
Contributor Author

dgwynne commented Apr 20, 2017

i ran asv continuous -f 1.1 -E virtualenv origin/parser-no-stdio HEAD in the asv_bench directory, and after each individual benchmark it produced "BENCHMARKS NOT SIGNIFICANTLY CHANGED." in green.

@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

this looks good perf wise (neutral on my tests), so if its better on other platforms +1.

@jreback jreback added this to the 0.20.0 milestone Apr 20, 2017
@jreback jreback merged commit ff00b55 into pandas-dev:master Apr 20, 2017
@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

thanks @dgwynne

@dgwynne dgwynne deleted the parser-no-stdio branch April 20, 2017 23:44
pcluo pushed a commit to pcluo/pandas that referenced this pull request May 22, 2017
* PERF: move io.c from using unbuffered fread()s to read()s.

pandas already buffers reads coming from io.c itself, so it previously
used setbuf() to disable buffering inside fread(). however, certain
implementations of unbuffered stdio reads are sub-optimal. for example,
fread() in solaris ends up doing a read() for each individual byte of
the underlying filedescriptor, which turns out to be very slow.

instead, this code now open()s a file descritor and read()s directly
into the buffer that pandas has already allocated. this is effectively
what other libcs (eg, glibc) do underneath an unbuffered fread() anyway,
but this is more explicit.

while here, this tweaks the mmap backend to use open() too, and also
properly checks for mmap failure by comparing its result to MAP_FAILED
instead of NULL.

closes pandas-dev#16039

* DOC: add whatsnew
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 11, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 11, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 11, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jun 11, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.
jreback pushed a commit that referenced this pull request Jun 11, 2017
gh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes gh-16559.
jreback pushed a commit that referenced this pull request Jun 13, 2017
* Revert "BUG: Revert gh-16039 (#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in #16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on #16675
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jul 6, 2017
pandas-devgh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes pandas-devgh-16559.

(cherry picked from commit c550372)
TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jul 6, 2017
…as-dev#16675)

* Revert "BUG: Revert pandas-devgh-16039 (pandas-dev#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in pandas-dev#16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on pandas-dev#16675

(cherry picked from commit d298414)
TomAugspurger pushed a commit that referenced this pull request Jul 7, 2017
gh-16039 created a bug in which files containing
byte-like data could break, as EOF characters mid-field
(despite being quoted) would cause premature line breaks.

Given that this PR was a performance patch, this
commit can be safely reverted.

Closes gh-16559.

(cherry picked from commit c550372)
TomAugspurger pushed a commit that referenced this pull request Jul 7, 2017
* Revert "BUG: Revert gh-16039 (#16663)"

This reverts commit c550372.

* always treat files as binary to cope with windows and EOF.

on windows, EOF can appear "in band" if the file is considered text. when
moving from fread() to read(), i lost the "b" part of the mode. at the
time i believed this was a nop, since unix doesnt treat files differently
based on that flag.

this adds O_BINARY to the flags to open to restore the behaviour lost
when taking "b" away from fopen. if a platform doesn't provide O_BINARY,
this defines it to 0 so it can still be used without effect later on in
the code.

* dont leak the fd in new_file_source() if buffer allocation fails.

* reapply the test for EOF in the middle of a stream.

part of c550372

* pass rb to _get_handle on python 3, otherwise stick to r.

part of c550372

* replace goto with inline unwinding of state.

requested by @jreback in #16675 feedback.

* describe the fixes to the read_csv() backend and issue numbers.

requested by @jreback in feedback on #16675

(cherry picked from commit d298414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants