Skip to content

Faster SQL implementation in d6stack library #28817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JulianWgs opened this issue Oct 7, 2019 · 7 comments
Closed

Faster SQL implementation in d6stack library #28817

JulianWgs opened this issue Oct 7, 2019 · 7 comments
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance

Comments

@JulianWgs
Copy link

Code Sample

https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb

Problem description

In the example it currently takes 28s to upload data to an SQL database. With the d6stack implementation it only takes 4.7s with postgres and 7.1s with mysql which is 4 to 6 times faster than standard pandas.

Expected Output

Improved speed uploading data to an SQL database right out of the box.

@TomAugspurger
Copy link
Contributor

What's the actual next step here? You'll need to flesh out your proposal before we can decide whether it's a good idea / worth doing.

@jorisvandenbossche
Copy link
Member

The important piece of information is also to know what they do differently to achieve this better performance, to know if this is something we can do in pandas.

@rockg
Copy link
Contributor

rockg commented Oct 8, 2019

From looking at the code it uses the bulk insert operations for the respective database types.So it writes a csv file and then loads that into the database using LOAD DATA, BULK INSERT, etc. See here. I know I created something similar as loading through insert statements is so slow.

@WillAyd
Copy link
Member

WillAyd commented Oct 8, 2019

I think this issue is too open ended so should refine the ask or close out. You can already write your own bulk load command as shown in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method

Though that has type preservation downsides

@JulianWgs
Copy link
Author

JulianWgs commented Oct 8, 2019

There is already a parameter called method to switch between normal and multi insert. Im assuming that multi is not an implementation of bulk insert? Then I would propose a new value, where you can choose to use "bulk".

The if-statement for the different methods can be found here:

pandas/pandas/io/sql.py

Lines 724 to 734 in bee17d5

def insert(self, chunksize=None, method=None):
# set insert method
if method is None:
exec_insert = self._execute_insert
elif method == "multi":
exec_insert = self._execute_insert_multi
elif callable(method):
exec_insert = partial(method, self)
else:
raise ValueError("Invalid parameter `method`: {}".format(method))

Implementing this should be done within Pandas, because the d6stack library is a big dependency and implementing shouldn't lead to too many duplicate lines of code.

The implementation of the different methods can be found here:

pandas/pandas/io/sql.py

Lines 660 to 682 in bee17d5

def _execute_insert(self, conn, keys, data_iter):
"""Execute SQL statement inserting data
Parameters
----------
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : generator of list
Each item contains a list of values to be inserted
"""
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(self.table.insert(), data)
def _execute_insert_multi(self, conn, keys, data_iter):
"""Alternative to _execute_insert for DBs support multivalue INSERT.
Note: multi-value insert is usually faster for analytics DBs
and tables containing a few columns
but performance degrades quickly with increase of columns.
"""
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(self.table.insert(data))

@jbrockmendel jbrockmendel added IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance labels Oct 16, 2019
@JulianWgs
Copy link
Author

d6stack implements database specific function (Link)

  • postgres implementation uses raw_connection
  • mysql and mssql use local file

I don't think it is worth to implement these for the gain in performance.

The issue can be closed, if the pandas maintainers agree.

@mroeschke
Copy link
Member

Looks like the project is fairly inactivate and not much interest from the core team. Closing as it seems not worth pursuing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

7 participants