PERF: Explore even faster path for df.to_csv #3186

ghost · 2013-03-27T13:22:22Z

iotop and a simple-mined c program indicates we're nowhere
near IO-bound in df.to_csv, at about ~10-15x.

It might be possible to speed things up considerably with a fast path
for special cases (numerical only) that don't need fancy quoting and other
bells and whistles provided by the underlying csv python module.

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char **argv)
{
    int i;
    FILE *f;
    char fmt[] = "%f,%f,%f,%f,%f\n";
    while (1) {
    f = fopen("out.csv","wb");
    for(i=0;i<1000000;i++) {
        fprintf(f,fmt, 1.0,2.0,3.0,4.0,5.0);
    }
    fclose(f);
    }
}

sustains about 30MB/s on my machine (without even batching writes)
vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().

need to check if it's the stringifying, quoting logic, memory layout, or something
else that constitutes the difference.

Should also yield insights for any future binary serialization format
implemented.

cpcloud · 2013-07-09T15:44:20Z

what if you were to build up buffers of some specified chunk size using iovec structs and make a call into writev/readv, i.e. scatter/gather? is that what you mean by batched writes?

ghost · 2013-07-10T02:30:31Z

By batched writes I meant that the c code does not buffer pending write data
before issuing a write. presumably the OS might do this for you rather then
going out to disk everytime, so it's just functional call overhead.

The iovec idea sound interesting, but how do you know the perf difference
has anything to do with the issue that addresses?

cpcloud · 2013-07-10T03:50:52Z

what was the code you used to benchmark to_csv?

ghost · 2013-07-10T05:54:41Z

I used iotop.

wesm · 2013-07-10T06:02:45Z

probably make sense to write a C to-csv routine for the simplest of to_csv outputs (maybe not support custom formatter functions to start) at some point. the IO cost will probably outweight the irregular memory access patterns.

michaelaye · 2014-05-09T23:59:15Z

I'm interested in making this happen, here's what I found so far, most likely this is obvious to you, but I would need a hint to improve this:

df = pd.DataFrame(randn(10000, 30))
# this is the slow part, guess those loops are dodgy...
def df_to_string(df):
    s = '\n'.join([','.join(df.irow(i).astype('string')) for i in xrange(len(df))])
    return s

This is the fast part and I'm a cython noob, so be gentle:

cimport cython
from libc.stdio cimport fopen, FILE, fclose, fprintf

def c_write_to_file(filename, content):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string
    cdef char* line = content

    cdef FILE* cfile
    cfile = fopen(fname, "w")
    if cfile == NULL:
        return

    fprintf(cfile, line)
    fclose(cfile)
    return []

Here's some benchmarks I took:

def df_to_csv_cython(df):
    content = df_to_string(df)
    c_write_to_file('test_out_c.txt', content)

%timeit df_to_csv_cython(df):
1 loops, best of 3: 1.67 s per loop

%timeit df.to_csv('test_csv_out_pandas.csv')
1 loops, best of 3: 416 ms per loop

So, what needs to be improved is the dataframe to string conversion, but I guess you guys knew that already, I just had to dig down to what the actual bottleneck is.
Can you provide any hints how I should proceed?

michaelaye · 2014-05-10T00:02:33Z

Forgot the benchmarking of the cython write, it's blazing fast, once the content string is available:

%timeit c_write_to_file('test_out_cython.txt', content)
100 loops, best of 3: 12.2 ms per loop

jreback · 2014-05-10T00:07:30Z

no

you just need to change lib.write_rows_csv to a new version (it's a bit trickier because u have to decide a bit higher up in the formatted to user the fast path so u don't create the csv writer at all - but for proof of concept that didn't matter)

all the conversions and such already happen by then

take the same data that is passed to write_rows_csv

and just write a new version that takes that data and actually writes it to the file handle

jreback · 2014-05-10T00:09:58Z

yep

in fact u can almost reuse write_to_csv almost entirely

it's the call to the write that is slow

(because I think it does a lot of conversions and such that don't matter for a plain vanilla csv)

michaelaye · 2014-05-13T21:31:04Z

which 'write_to_csv' do you mean here? I think I understand now that I have to reimplement lib.pyx's write_csv_rows and make a write_csv_rows_fast there. But from below line profile I took for CSVFormatter._save_chunk() I guess I have to change _save_chunk() as well:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1279                                               def _save_chunk(self, start_i, end_i):
  1280                                           
  1281         4           11      2.8      0.0          data_index = self.data_index
  1282                                           
  1283                                                   # create the data for a chunk
  1284         4            8      2.0      0.0          slicer = slice(start_i, end_i)
  1285         8           20      2.5      0.0          for i in range(len(self.blocks)):
  1286         4            4      1.0      0.0              b = self.blocks[i]
  1287         4            5      1.2      0.0              d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1288         4            4      1.0      0.0                                    float_format=self.float_format,
  1289         4        59994  14998.5     13.6                                    date_format=self.date_format)
  1290                                           
  1291       124          296      2.4      0.1              for i, item in enumerate(b.items):
  1292                                           
  1293                                                           # self.data is a preallocated list
  1294       120         3337     27.8      0.8                  self.data[self.column_map[b][i]] = d[i]
  1295                                           
  1296         4            9      2.2      0.0          ix = data_index.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1297         4            4      1.0      0.0                                          float_format=self.float_format,
  1298         4         1010    252.5      0.2                                          date_format=self.date_format)
  1299                                           
  1300         4       377245  94311.2     85.4          lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

At least the b.to_native_types() takes quite some time.

jreback · 2014-05-13T21:53:42Z

no most of the slowness is with write_csv_rows

since _save_chunk calls this it is included in its time as well (a caller has the time of itself plus sum of its callees)

michaelaye · 2014-05-13T22:53:09Z

well, not even talking about absolute time, but isn't it correct that _save_chunk spends 13% of its time for the b.to_native_types() call? It is almost second order effect (85/13 = 6.5) but not completely negligible. Okay, so I assume in above comment you meant that I can reuse lib.write_csv_rows() and kickout what I don't need for a fast simple write-out?

Funnily enough, I profiled that a python write of a long string is actually faster than a cython write-out of a long string, I guess due to cython overhead:

%timeit c_write_to_file('test_out_cython.txt', content)
1 loops, best of 3: 144 ms per loop
%timeit python_write_to_file('test_out_python.txt', content)
10 loops, best of 3: 67 ms per loop

PS: mention myself to find this issue easier: @michaelaye

jreback · 2014-05-13T22:58:14Z

I would simply copy lib.write_csv_rows and make a fast version

convert to native types is necessary for proper dtype handling

worry about that later

always optimizes biggest time sync first

wesm · 2016-09-29T22:27:47Z

We will need to tackle this in the course of working on libpandas. I suggest we create a new set of issues around more optimally writing to CSV once we are ready to do that

swt2c · 2019-08-06T17:41:07Z

I know this is closed, but I would still like to work toward improving to_csv in Pandas 0.x.

I have been doing some profiling and so far I've found that the biggest (by far) CPU bottleneck in write_csv_rows() is this list assignment:

                row[1 + i] = data[i][j]

Per individual call, it doesn't take long, but since this gets executed for every cell it adds up. It's not immediately obvious how to improve this, though, as Pandas seems to store its data in columns, whereas we need to get data out in rows.

swt2c · 2019-08-06T21:47:11Z

Also, to_csv is much slower when writing a sparse data frame. In that case, the performance is dominated by _to_native_types().

ghost mentioned this issue Jan 12, 2014

Smarter formatting of timedelta and datetime columns #5701

Merged

jreback modified the milestones: 0.15.0, Someday May 6, 2014

jreback mentioned this issue May 6, 2014

Performance issue with DataFrame.to_csv() #7051

Closed

michaelaye mentioned this issue May 17, 2014

improving df.to_csv performance #7147

Closed

michaelaye mentioned this issue Aug 7, 2014

csv writer engine 'fast' has a bug #7956

Closed

jreback modified the milestones: 0.16.0, 0.17.0 Jan 26, 2015

jreback mentioned this issue Apr 13, 2016

improving the speed of to_csv #12885

Closed

wesm mentioned this issue Sep 29, 2016

Much faster to_csv implementation (in libpandas) wesm/pandas2#34

Open

wesm closed this as completed Sep 29, 2016

jorisvandenbossche modified the milestones: No action, Next Major Release Sep 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Explore even faster path for df.to_csv #3186

PERF: Explore even faster path for df.to_csv #3186

ghost commented Mar 27, 2013

cpcloud commented Jul 9, 2013

ghost commented Jul 10, 2013

cpcloud commented Jul 10, 2013

ghost commented Jul 10, 2013

wesm commented Jul 10, 2013

michaelaye commented May 9, 2014

michaelaye commented May 10, 2014

jreback commented May 10, 2014

jreback commented May 10, 2014

michaelaye commented May 13, 2014

jreback commented May 13, 2014

michaelaye commented May 13, 2014

jreback commented May 13, 2014

wesm commented Sep 29, 2016

swt2c commented Aug 6, 2019

swt2c commented Aug 6, 2019

PERF: Explore even faster path for df.to_csv #3186

PERF: Explore even faster path for df.to_csv #3186

Comments

ghost commented Mar 27, 2013

cpcloud commented Jul 9, 2013

ghost commented Jul 10, 2013

cpcloud commented Jul 10, 2013

ghost commented Jul 10, 2013

wesm commented Jul 10, 2013

michaelaye commented May 9, 2014

michaelaye commented May 10, 2014

jreback commented May 10, 2014

jreback commented May 10, 2014

michaelaye commented May 13, 2014

jreback commented May 13, 2014

michaelaye commented May 13, 2014

jreback commented May 13, 2014

wesm commented Sep 29, 2016

swt2c commented Aug 6, 2019

swt2c commented Aug 6, 2019