-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Explore even faster path for df.to_csv #3186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
what if you were to build up buffers of some specified chunk size using |
By batched writes I meant that the c code does not buffer pending write data The iovec idea sound interesting, but how do you know the perf difference |
what was the code you used to benchmark |
I used iotop. |
probably make sense to write a C to-csv routine for the simplest of to_csv outputs (maybe not support custom formatter functions to start) at some point. the IO cost will probably outweight the irregular memory access patterns. |
I'm interested in making this happen, here's what I found so far, most likely this is obvious to you, but I would need a hint to improve this: df = pd.DataFrame(randn(10000, 30))
# this is the slow part, guess those loops are dodgy...
def df_to_string(df):
s = '\n'.join([','.join(df.irow(i).astype('string')) for i in xrange(len(df))])
return s This is the fast part and I'm a cython noob, so be gentle: cimport cython
from libc.stdio cimport fopen, FILE, fclose, fprintf
def c_write_to_file(filename, content):
filename_byte_string = filename.encode("UTF-8")
cdef char* fname = filename_byte_string
cdef char* line = content
cdef FILE* cfile
cfile = fopen(fname, "w")
if cfile == NULL:
return
fprintf(cfile, line)
fclose(cfile)
return [] Here's some benchmarks I took: def df_to_csv_cython(df):
content = df_to_string(df)
c_write_to_file('test_out_c.txt', content)
%timeit df_to_csv_cython(df):
1 loops, best of 3: 1.67 s per loop
%timeit df.to_csv('test_csv_out_pandas.csv')
1 loops, best of 3: 416 ms per loop So, what needs to be improved is the dataframe to string conversion, but I guess you guys knew that already, I just had to dig down to what the actual bottleneck is. |
Forgot the benchmarking of the cython write, it's blazing fast, once the %timeit c_write_to_file('test_out_cython.txt', content)
100 loops, best of 3: 12.2 ms per loop |
no you just need to change lib.write_rows_csv to a new version (it's a bit trickier because u have to decide a bit higher up in the formatted to user the fast path so u don't create the csv writer at all - but for proof of concept that didn't matter) all the conversions and such already happen by then take the same data that is passed to write_rows_csv and just write a new version that takes that data and actually writes it to the file handle |
yep in fact u can almost reuse write_to_csv almost entirely it's the call to the write that is slow (because I think it does a lot of conversions and such that don't matter for a plain vanilla csv) |
which 'write_to_csv' do you mean here? I think I understand now that I have to reimplement lib.pyx's Line # Hits Time Per Hit % Time Line Contents
==============================================================
1279 def _save_chunk(self, start_i, end_i):
1280
1281 4 11 2.8 0.0 data_index = self.data_index
1282
1283 # create the data for a chunk
1284 4 8 2.0 0.0 slicer = slice(start_i, end_i)
1285 8 20 2.5 0.0 for i in range(len(self.blocks)):
1286 4 4 1.0 0.0 b = self.blocks[i]
1287 4 5 1.2 0.0 d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
1288 4 4 1.0 0.0 float_format=self.float_format,
1289 4 59994 14998.5 13.6 date_format=self.date_format)
1290
1291 124 296 2.4 0.1 for i, item in enumerate(b.items):
1292
1293 # self.data is a preallocated list
1294 120 3337 27.8 0.8 self.data[self.column_map[b][i]] = d[i]
1295
1296 4 9 2.2 0.0 ix = data_index.to_native_types(slicer=slicer, na_rep=self.na_rep,
1297 4 4 1.0 0.0 float_format=self.float_format,
1298 4 1010 252.5 0.2 date_format=self.date_format)
1299
1300 4 377245 94311.2 85.4 lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer) At least the b.to_native_types() takes quite some time. |
no most of the slowness is with write_csv_rows since _save_chunk calls this it is included in its time as well (a caller has the time of itself plus sum of its callees) |
well, not even talking about absolute time, but isn't it correct that _save_chunk spends 13% of its time for the b.to_native_types() call? It is almost second order effect (85/13 = 6.5) but not completely negligible. Okay, so I assume in above comment you meant that I can reuse Funnily enough, I profiled that a python write of a long string is actually faster than a cython write-out of a long string, I guess due to cython overhead: %timeit c_write_to_file('test_out_cython.txt', content)
1 loops, best of 3: 144 ms per loop
%timeit python_write_to_file('test_out_python.txt', content)
10 loops, best of 3: 67 ms per loop PS: mention myself to find this issue easier: @michaelaye |
I would simply copy lib.write_csv_rows and make a fast version convert to native types is necessary for proper dtype handling worry about that later always optimizes biggest time sync first |
We will need to tackle this in the course of working on libpandas. I suggest we create a new set of issues around more optimally writing to CSV once we are ready to do that |
I know this is closed, but I would still like to work toward improving I have been doing some profiling and so far I've found that the biggest (by far) CPU bottleneck in
Per individual call, it doesn't take long, but since this gets executed for every cell it adds up. It's not immediately obvious how to improve this, though, as Pandas seems to store its data in columns, whereas we need to get data out in rows. |
Also, |
iotop and a simple-mined c program indicates we're nowhere
near IO-bound in df.to_csv, at about ~10-15x.
It might be possible to speed things up considerably with a fast path
for special cases (numerical only) that don't need fancy quoting and other
bells and whistles provided by the underlying csv python module.
sustains about 30MB/s on my machine (without even batching writes)
vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().
need to check if it's the stringifying, quoting logic, memory layout, or something
else that constitutes the difference.
Should also yield insights for any future binary serialization format
implemented.
The text was updated successfully, but these errors were encountered: