Skip to content

ENH #6416: performance improvements on write #6420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Feb 20, 2014

Conversation

mangecoeur
Copy link
Contributor

  • tradoff higher memory use for faster writes. This replaces the earlier PR where the history was a mess!

data = dict((k, self.maybe_asscalar(v))
for k, v in t[1].iteritems())
data_list.append(data)
#self.pd_sql.execute(ins, **data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be removed?

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@mangecoeur instead of using iterrows.....just do the loop directly; will be much faster as you are not creating a Series (which you then decompose) each time

e.g. just do:

        columns = self.columns
        for k, v in zip(self.index, self.values):
          # work with k,v here

this munges dtypes btw...e.g. everything gets put into a single dtype. I don't think this matters though?

you might want to consider doing this by dtype, e.g. use df.as_blocks (which you then select the correct column out of each block). the advantage is that they are already dtype separated

@mangecoeur
Copy link
Contributor Author

@jreback ok i see what you mean. will look into it.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@mangecoeur this latter approach is used in to_csv actually, see here: https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L1285
(as well as chunking the writes to use a constant amount of memory). (so see CSVFormatter._save). You can basically do this almost exactly I think. This also has substitution routines for example NaT and such (though may need slightly modifications in to_native_types to not stringify (or you might want to do this part inline as you want to return SQLAlchemy types). But you could do something like this in the sql.py:

class Block(object):

      def __init__(self, block):
            self.block = block

       def format(self, slicer):
            # return a 'formated block suitable for direct insertion to sql


class DatetimeBlock(Block):

      def format(self, slicer):
          ....

etc

@mangecoeur
Copy link
Contributor Author

@jreback I think we can keep it simple, since the column types are defined on the SQLAlchemy side by the DB table, sqlalchemy deals with converting python values to SQL types already, all we need is to supply a list of row dicts. Maybe we can optimize this more later.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

sure....just some thoughts....always profile of course!

@jorisvandenbossche
Copy link
Member

@mangecoeur Maybe you can add a vbench to see the effect of this PR?

@mangecoeur
Copy link
Contributor Author

@jorisvandenbossche not got vbench working yet. was just from daily use found it was much faster this way :P will get round to it soonish

@mangecoeur
Copy link
Contributor Author

@jreback had a look at that method, turns out that if you iterate over values you get problems with datetime conversion which using iterrows appears to solve. Would need to test on pure numeric data for perf comparison to see if its work trying to fix datetime issue.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@mangecoeur yeh....as I said, the self.values call in and of itself bascially munges everything togther. We can merge this and then put up a future investigate perf issue.

so ready to merge then?

@mangecoeur
Copy link
Contributor Author

@jreback ok, just cleaned up based on the comments, should be good to go once travis has done its thing.

jreback added a commit that referenced this pull request Feb 20, 2014
ENH #6416: performance improvements on write
@jreback jreback merged commit cfc90d7 into pandas-dev:master Feb 20, 2014
@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@mangecoeur thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants