Skip to content

PERF: Explore even faster path for df.to_csv #3186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Mar 27, 2013 · 16 comments
Closed

PERF: Explore even faster path for df.to_csv #3186

ghost opened this issue Mar 27, 2013 · 16 comments
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance

Comments

@ghost
Copy link

ghost commented Mar 27, 2013

iotop and a simple-mined c program indicates we're nowhere
near IO-bound in df.to_csv, at about ~10-15x.

It might be possible to speed things up considerably with a fast path
for special cases (numerical only) that don't need fancy quoting and other
bells and whistles provided by the underlying csv python module.

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char **argv)
{
    int i;
    FILE *f;
    char fmt[] = "%f,%f,%f,%f,%f\n";
    while (1) {
    f = fopen("out.csv","wb");
    for(i=0;i<1000000;i++) {
        fprintf(f,fmt, 1.0,2.0,3.0,4.0,5.0);
    }
    fclose(f);
    }
}

sustains about 30MB/s on my machine (without even batching writes)
vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().

need to check if it's the stringifying, quoting logic, memory layout, or something
else that constitutes the difference.

Should also yield insights for any future binary serialization format
implemented.

@cpcloud
Copy link
Member

cpcloud commented Jul 9, 2013

what if you were to build up buffers of some specified chunk size using iovec structs and make a call into writev/readv, i.e. scatter/gather? is that what you mean by batched writes?

@ghost
Copy link
Author

ghost commented Jul 10, 2013

By batched writes I meant that the c code does not buffer pending write data
before issuing a write. presumably the OS might do this for you rather then
going out to disk everytime, so it's just functional call overhead.

The iovec idea sound interesting, but how do you know the perf difference
has anything to do with the issue that addresses?

@cpcloud
Copy link
Member

cpcloud commented Jul 10, 2013

what was the code you used to benchmark to_csv?

@ghost
Copy link
Author

ghost commented Jul 10, 2013

I used iotop.

@wesm
Copy link
Member

wesm commented Jul 10, 2013

probably make sense to write a C to-csv routine for the simplest of to_csv outputs (maybe not support custom formatter functions to start) at some point. the IO cost will probably outweight the irregular memory access patterns.

@michaelaye
Copy link
Contributor

I'm interested in making this happen, here's what I found so far, most likely this is obvious to you, but I would need a hint to improve this:

df = pd.DataFrame(randn(10000, 30))
# this is the slow part, guess those loops are dodgy...
def df_to_string(df):
    s = '\n'.join([','.join(df.irow(i).astype('string')) for i in xrange(len(df))])
    return s

This is the fast part and I'm a cython noob, so be gentle:

cimport cython
from libc.stdio cimport fopen, FILE, fclose, fprintf

def c_write_to_file(filename, content):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string
    cdef char* line = content

    cdef FILE* cfile
    cfile = fopen(fname, "w")
    if cfile == NULL:
        return

    fprintf(cfile, line)
    fclose(cfile)
    return []

Here's some benchmarks I took:

def df_to_csv_cython(df):
    content = df_to_string(df)
    c_write_to_file('test_out_c.txt', content)

%timeit df_to_csv_cython(df):
1 loops, best of 3: 1.67 s per loop

%timeit df.to_csv('test_csv_out_pandas.csv')
1 loops, best of 3: 416 ms per loop

So, what needs to be improved is the dataframe to string conversion, but I guess you guys knew that already, I just had to dig down to what the actual bottleneck is.
Can you provide any hints how I should proceed?

@michaelaye
Copy link
Contributor

Forgot the benchmarking of the cython write, it's blazing fast, once the content string is available:

%timeit c_write_to_file('test_out_cython.txt', content)
100 loops, best of 3: 12.2 ms per loop

@jreback
Copy link
Contributor

jreback commented May 10, 2014

no

you just need to change lib.write_rows_csv to a new version (it's a bit trickier because u have to decide a bit higher up in the formatted to user the fast path so u don't create the csv writer at all - but for proof of concept that didn't matter)

all the conversions and such already happen by then

take the same data that is passed to write_rows_csv

and just write a new version that takes that data and actually writes it to the file handle

@jreback
Copy link
Contributor

jreback commented May 10, 2014

yep

in fact u can almost reuse write_to_csv almost entirely

it's the call to the write that is slow

(because I think it does a lot of conversions and such that don't matter for a plain vanilla csv)

@michaelaye
Copy link
Contributor

which 'write_to_csv' do you mean here? I think I understand now that I have to reimplement lib.pyx's write_csv_rows and make a write_csv_rows_fast there. But from below line profile I took for CSVFormatter._save_chunk() I guess I have to change _save_chunk() as well:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1279                                               def _save_chunk(self, start_i, end_i):
  1280                                           
  1281         4           11      2.8      0.0          data_index = self.data_index
  1282                                           
  1283                                                   # create the data for a chunk
  1284         4            8      2.0      0.0          slicer = slice(start_i, end_i)
  1285         8           20      2.5      0.0          for i in range(len(self.blocks)):
  1286         4            4      1.0      0.0              b = self.blocks[i]
  1287         4            5      1.2      0.0              d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1288         4            4      1.0      0.0                                    float_format=self.float_format,
  1289         4        59994  14998.5     13.6                                    date_format=self.date_format)
  1290                                           
  1291       124          296      2.4      0.1              for i, item in enumerate(b.items):
  1292                                           
  1293                                                           # self.data is a preallocated list
  1294       120         3337     27.8      0.8                  self.data[self.column_map[b][i]] = d[i]
  1295                                           
  1296         4            9      2.2      0.0          ix = data_index.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1297         4            4      1.0      0.0                                          float_format=self.float_format,
  1298         4         1010    252.5      0.2                                          date_format=self.date_format)
  1299                                           
  1300         4       377245  94311.2     85.4          lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

At least the b.to_native_types() takes quite some time.

@jreback
Copy link
Contributor

jreback commented May 13, 2014

no most of the slowness is with write_csv_rows

since _save_chunk calls this it is included in its time as well (a caller has the time of itself plus sum of its callees)

@michaelaye
Copy link
Contributor

well, not even talking about absolute time, but isn't it correct that _save_chunk spends 13% of its time for the b.to_native_types() call? It is almost second order effect (85/13 = 6.5) but not completely negligible. Okay, so I assume in above comment you meant that I can reuse lib.write_csv_rows() and kickout what I don't need for a fast simple write-out?

Funnily enough, I profiled that a python write of a long string is actually faster than a cython write-out of a long string, I guess due to cython overhead:

%timeit c_write_to_file('test_out_cython.txt', content)
1 loops, best of 3: 144 ms per loop
%timeit python_write_to_file('test_out_python.txt', content)
10 loops, best of 3: 67 ms per loop

PS: mention myself to find this issue easier: @michaelaye

@jreback
Copy link
Contributor

jreback commented May 13, 2014

I would simply copy lib.write_csv_rows and make a fast version

convert to native types is necessary for proper dtype handling

worry about that later

always optimizes biggest time sync first

@wesm
Copy link
Member

wesm commented Sep 29, 2016

We will need to tackle this in the course of working on libpandas. I suggest we create a new set of issues around more optimally writing to CSV once we are ready to do that

@wesm wesm closed this as completed Sep 29, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: No action, Next Major Release Sep 29, 2016
@swt2c
Copy link
Contributor

swt2c commented Aug 6, 2019

I know this is closed, but I would still like to work toward improving to_csv in Pandas 0.x.

I have been doing some profiling and so far I've found that the biggest (by far) CPU bottleneck in write_csv_rows() is this list assignment:

                row[1 + i] = data[i][j]

Per individual call, it doesn't take long, but since this gets executed for every cell it adds up. It's not immediately obvious how to improve this, though, as Pandas seems to store its data in columns, whereas we need to get data out in rows.

@swt2c
Copy link
Contributor

swt2c commented Aug 6, 2019

Also, to_csv is much slower when writing a sparse data frame. In that case, the performance is dominated by _to_native_types().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants