Faster SQL implementation in d6stack library #28817

JulianWgs · 2019-10-07T10:12:30Z

Code Sample

https://github.com/d6t/d6tstack/blob/master/examples-sql.ipynb

Problem description

In the example it currently takes 28s to upload data to an SQL database. With the d6stack implementation it only takes 4.7s with postgres and 7.1s with mysql which is 4 to 6 times faster than standard pandas.

Expected Output

Improved speed uploading data to an SQL database right out of the box.

TomAugspurger · 2019-10-07T11:53:25Z

What's the actual next step here? You'll need to flesh out your proposal before we can decide whether it's a good idea / worth doing.

jorisvandenbossche · 2019-10-07T13:04:26Z

The important piece of information is also to know what they do differently to achieve this better performance, to know if this is something we can do in pandas.

rockg · 2019-10-08T04:40:02Z

From looking at the code it uses the bulk insert operations for the respective database types.So it writes a csv file and then loads that into the database using LOAD DATA, BULK INSERT, etc. See here. I know I created something similar as loading through insert statements is so slow.

WillAyd · 2019-10-08T04:46:45Z

I think this issue is too open ended so should refine the ask or close out. You can already write your own bulk load command as shown in the documentation:

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#insertion-method

Though that has type preservation downsides

JulianWgs · 2019-10-08T07:27:22Z

There is already a parameter called method to switch between normal and multi insert. Im assuming that multi is not an implementation of bulk insert? Then I would propose a new value, where you can choose to use "bulk".

The if-statement for the different methods can be found here:

pandas/pandas/io/sql.py

Lines 724 to 734 in bee17d5

    
           def insert(self, chunksize=None, method=None): 
        
               # set insert method 
        
               if method is None: 
        
                   exec_insert = self._execute_insert 
        
               elif method == "multi": 
        
                   exec_insert = self._execute_insert_multi 
        
               elif callable(method): 
        
                   exec_insert = partial(method, self) 
        
               else: 
        
                   raise ValueError("Invalid parameter `method`: {}".format(method))

Implementing this should be done within Pandas, because the d6stack library is a big dependency and implementing shouldn't lead to too many duplicate lines of code.

The implementation of the different methods can be found here:

pandas/pandas/io/sql.py

Lines 660 to 682 in bee17d5

    
               def _execute_insert(self, conn, keys, data_iter): 
        
                   """Execute SQL statement inserting data 
        
                   Parameters 
        
                   ---------- 
        
                   conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection 
        
                   keys : list of str 
        
                      Column names 
        
                   data_iter : generator of list 
        
                      Each item contains a list of values to be inserted 
        
                   """ 
        
                   data = [dict(zip(keys, row)) for row in data_iter] 
        
                   conn.execute(self.table.insert(), data) 
        
               def _execute_insert_multi(self, conn, keys, data_iter): 
        
                   """Alternative to _execute_insert for DBs support multivalue INSERT. 
        
                   Note: multi-value insert is usually faster for analytics DBs 
        
                   and tables containing a few columns 
        
                   but performance degrades quickly with increase of columns. 
        
                   """ 
        
                   data = [dict(zip(keys, row)) for row in data_iter] 
        
                   conn.execute(self.table.insert(data))

JulianWgs · 2020-09-04T17:31:09Z

d6stack implements database specific function (Link)

postgres implementation uses raw_connection
mysql and mssql use local file

I don't think it is worth to implement these for the gain in performance.

The issue can be closed, if the pandas maintainers agree.

mroeschke · 2021-07-21T05:25:28Z

Looks like the project is fairly inactivate and not much interest from the core team. Closing as it seems not worth pursuing.

jbrockmendel added IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance labels Oct 16, 2019

xhochy mentioned this issue Oct 5, 2020

ENH: Pluggable SQL performance #36893

Open

mroeschke closed this as completed Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster SQL implementation in d6stack library #28817

Faster SQL implementation in d6stack library #28817

JulianWgs commented Oct 7, 2019

TomAugspurger commented Oct 7, 2019

jorisvandenbossche commented Oct 7, 2019

rockg commented Oct 8, 2019

WillAyd commented Oct 8, 2019

JulianWgs commented Oct 8, 2019 •

edited

Loading

JulianWgs commented Sep 4, 2020

mroeschke commented Jul 21, 2021

Faster SQL implementation in d6stack library #28817

Faster SQL implementation in d6stack library #28817

Comments

JulianWgs commented Oct 7, 2019

Code Sample

Problem description

Expected Output

TomAugspurger commented Oct 7, 2019

jorisvandenbossche commented Oct 7, 2019

rockg commented Oct 8, 2019

WillAyd commented Oct 8, 2019

JulianWgs commented Oct 8, 2019 • edited Loading

JulianWgs commented Sep 4, 2020

mroeschke commented Jul 21, 2021

JulianWgs commented Oct 8, 2019 •

edited

Loading