ENH #6416: performance improvements on write #6420

mangecoeur · 2014-02-20T14:24:40Z

tradoff higher memory use for faster writes. This replaces the earlier PR where the history was a mess!

…er memory use for faster writes.

jorisvandenbossche · 2014-02-20T14:41:53Z

pandas/io/sql.py

+                data = dict((k, self.maybe_asscalar(v))
+                            for k, v in t[1].iteritems())
+                data_list.append(data)
+                #self.pd_sql.execute(ins, **data)


can be removed?

jreback · 2014-02-20T14:45:31Z

@mangecoeur instead of using iterrows.....just do the loop directly; will be much faster as you are not creating a Series (which you then decompose) each time

e.g. just do:

        columns = self.columns
        for k, v in zip(self.index, self.values):
          # work with k,v here

this munges dtypes btw...e.g. everything gets put into a single dtype. I don't think this matters though?

you might want to consider doing this by dtype, e.g. use df.as_blocks (which you then select the correct column out of each block). the advantage is that they are already dtype separated

mangecoeur · 2014-02-20T15:02:17Z

@jreback ok i see what you mean. will look into it.

jreback · 2014-02-20T15:08:47Z

@mangecoeur this latter approach is used in to_csv actually, see here: https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1285
(as well as chunking the writes to use a constant amount of memory). (so see CSVFormatter._save). You can basically do this almost exactly I think. This also has substitution routines for example NaT and such (though may need slightly modifications in to_native_types to not stringify (or you might want to do this part inline as you want to return SQLAlchemy types). But you could do something like this in the sql.py:

class Block(object):

      def __init__(self, block):
            self.block = block

       def format(self, slicer):
            # return a 'formated block suitable for direct insertion to sql


class DatetimeBlock(Block):

      def format(self, slicer):
          ....

etc

mangecoeur · 2014-02-20T15:12:56Z

@jreback I think we can keep it simple, since the column types are defined on the SQLAlchemy side by the DB table, sqlalchemy deals with converting python values to SQL types already, all we need is to supply a list of row dicts. Maybe we can optimize this more later.

jreback · 2014-02-20T15:16:12Z

sure....just some thoughts....always profile of course!

jorisvandenbossche · 2014-02-20T15:18:15Z

@mangecoeur Maybe you can add a vbench to see the effect of this PR?

mangecoeur · 2014-02-20T15:22:42Z

@jorisvandenbossche not got vbench working yet. was just from daily use found it was much faster this way :P will get round to it soonish

mangecoeur · 2014-02-20T17:06:04Z

@jreback had a look at that method, turns out that if you iterate over values you get problems with datetime conversion which using iterrows appears to solve. Would need to test on pure numeric data for perf comparison to see if its work trying to fix datetime issue.

jreback · 2014-02-20T17:14:50Z

@mangecoeur yeh....as I said, the self.values call in and of itself bascially munges everything togther. We can merge this and then put up a future investigate perf issue.

so ready to merge then?

mangecoeur · 2014-02-20T17:46:38Z

@jreback ok, just cleaned up based on the comments, should be good to go once travis has done its thing.

ENH #6416: performance improvements on write

jreback · 2014-02-20T18:14:20Z

@mangecoeur thanks!

ENH pandas-dev#6416: performance improvements on write - tradoff high…

6680b6b

…er memory use for faster writes.

mangecoeur mentioned this pull request Feb 20, 2014

ENH #6416 improve performance on SQL insert #6417

Closed

jorisvandenbossche reviewed Feb 20, 2014
View reviewed changes

ENH pandas-dev#6416 cleanup for PR

c67ae75

jreback added a commit that referenced this pull request Feb 20, 2014

Merge pull request #6420 from mangecoeur/sql-perf

cfc90d7

ENH #6416: performance improvements on write

jreback merged commit cfc90d7 into pandas-dev:master Feb 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH #6416: performance improvements on write #6420

ENH #6416: performance improvements on write #6420

mangecoeur commented Feb 20, 2014

jorisvandenbossche Feb 20, 2014

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

jorisvandenbossche commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

ENH #6416: performance improvements on write #6420

ENH #6416: performance improvements on write #6420

Conversation

mangecoeur commented Feb 20, 2014

jorisvandenbossche Feb 20, 2014

Choose a reason for hiding this comment

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

jorisvandenbossche commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014

mangecoeur commented Feb 20, 2014

jreback commented Feb 20, 2014