ENH: Pluggable SQL performance #36893

xhochy · 2020-10-05T15:34:35Z

Currently the pandas SQL logic is using SQLAlchemy with results being returned as Python objects before being converted to a DataFrame. While the API is simple, it doesn't have good performance characteristics due to the intermediate Python objects. There exist currently some faster alternatives with inconsistent and more complicated APIs.

In addition to not having a uniform API, these implementations are only concerned about fast result en-/decoding. Functionality like automatic table creation as we have in pandas.DataFrame.to_sql doesn't exist there.

Thus it would be nice to have a way to use these connector implementations behind the standard pandas API.

Faster alternatives

bcpandas: Use BCP to insert data into MS SQL Server
turbodbc: Fast access to databases which have an ODBC driver via Apache Arrow (fetchallarrow().to_pandas()), e.g. MS SQL or Exasol.
snowflake-connector-python: Brings native Apache Arrow to speed via fetch_all_pandas()
pyarrow.jvm / JDBC: Use pyarrow's JVM module to get faster access to JDBC results via Arrow
postgres-copy-arrow-decode: Not yet opensourced (shame on me): Cython-based encoder/decoder for Postgres' COPY BINARY command that decodes Postgres' binary protocol from/to Arrow. Works together with psycopg2 and gives roughly a 2x speedup and type stability on the COPY CSV method in pandas's docs.
PostgresAdapter: NumPy support for Postgres connections
d6tstack: Fast insert into Postgres/MySQL/MSSQL via CSV files

General implementation idea

pandas users should only deal with read_sql and to_sql in its current fashion.
There shouldn't be any new hard dependencies in pandas.
The SQLAlchemy engine is a nice uniform interface to specify a database connection, keep this.
We only need a limited set of operations implemented by the performance backend, basically to_sql(DataFrame) and read_sql(query) -> DataFrame. Table creation, index adjustment and further convenience functionality can still be handled by the high-level SQLAlchemy layer.

Implementation idea (1) – Dispatch on `type(engine.raw_connection().connection)`

SQLAlchemy exposes the underlying connection of the database driver via engine.raw_connection(). This is a useful way to detect how we connect to the database. We could provide a registry where each backend implementation provides a function supports_connection(engine.raw_connection().connection) -> bool to determine whether it can be used.

Pro:

Users doesn't need to change their code. If the backend is loaded, they will automatically get the speedup.

Con:

Users need to take care that the backend is loaded, otherwise queries will work but stay slow.
Only one implementation per database connection class can be implemented

Implementation idea (2) – Extend the `method=` param

pandas.DataFrame.to_sql already has a method parameter where the user can supply a callable that is used to insert the data into the Database. Currently the callable gets a row-iterator and not a DataFrame. Thus this interface is already hard-wired that the intermediate result needs to be converted into Python objects. Instead of providing a row-iterator, we could pass the original DataFrame to this method

Pro:

Clear control on which method is used
Backend implementations could be used via method=turbodbc.pandas.to_sql

Con:

Potentially a breaking change on the method paramter or a second parameter needs to be added that is doing nearly the same things is introduced.
Needs explicit usage for the speedup.

Implementation idea (3) - Introduce `engine=` param

As we have with the the Parquet and CSV IO implementations, we could also go for providing an engine parameter where users could easily switch based on the name of an implementation. A prototype implementation would look like:

import pandas as pd

class DatabaseEngine:

    name = "fastengine"
    
    @staticmethod
    def supports_connecton(connection) -> bool:  # for engine="auto"
        return isinstance(connection, FastConnection)
        
    def to_sql(engine, df: pd.DataFrame, table: str):
        …
        
    def from_sql(engine, query: str) -> pd.DataFrame:
        …
        
pd.register_sql_backend(DatabaseEngine.name, DatabaseEngine)

Pro:

In contrast to (1), here you would get an error when the backend was not loaded.
Clear control on which method is used.
User doesn't need to provide the exact function but only the name of the engine
We could provide an engine="auto" setting which on explicit usage tries to find a matching backend and will otherwise fallback to the plain SQLAlchemy implementation.
We can provide some of these engines as part of pandas, others can come from third-party libraries.

Con:

Needs explicit usage for the speedup.

Personally, I would prefer this approach.

Related issues

Use Turbodbc/Arrow for read_sql_table, would be fixed by this proposal
Faster SQL implementation in d6stack library, would be fixed by this proposal
Improve type handling in read_sql and read_sql_table, related to this proposal but won't be fully fixed

The text was updated successfully, but these errors were encountered:

jreback · 2020-10-06T02:34:42Z

+1 on the engine= idea (3) (though the 'auto' would need some thought / hints). This is in-line with other ways of selecting backends and performance.

TomAugspurger · 2020-10-06T11:38:09Z

Happy to have improvements here, and I trust your judgement on the best API for users. The engine proposal sounds reasonable.

yehoshuadimarsky · 2020-11-16T03:19:32Z

+1 on the idea, but my only question is wouldn't this in effect be the Pandas project taking a stance on which "side" or extension projects they like and which they don't? With so many alternatives out there (as @xhochy specified in the original post), how to pick what gets included? As opposed to read_csv for example where the only options for engine are Python or C, and we don't have to take a stance on the worthiness of other projects

yehoshuadimarsky · 2020-11-16T03:21:44Z

I'd be happy to work on this PR though, I have a fair amount of experience with the Pandas <> SQL interface and backend code. But would want more feedback from the core maintainers first if you think this is worth the time and effort, and that it will get merged.

Full disclosure - I'm the author of one of the libraries mentioned (bcpandas).

jreback · 2020-11-16T03:52:00Z

+1 on the idea, but my only question is wouldn't this in effect be the Pandas project taking a stance on which "side" or extension projects they like and which they don't? With so many alternatives out there (as @xhochy specified in the original post), how to pick what gets included? As opposed to read_csv for example where the only options for engine are Python or C, and we don't have to take a stance on the worthiness of other projects

well we would have to start somewhere - you can be the first!

we don't need to take a stance per se - would likely accept any compatible and well tested engines

we did this for excel reading for example and now have a number of community supported engines

yehoshuadimarsky · 2020-11-16T03:56:18Z

Ok, good point.

As far as I see it, we will go with implementing option 3 - specifying an engine option.

Regarding implementation, the prototype by @xhochy is great, and I see we already have a base class for this in the SQL module that currently only supports 2 subclasses - SQLAlchemy and SQLite. So all we would have to do is create more subclasses to implement this base class. Do you agree?

pandas/pandas/io/sql.py

Lines 1105 to 1120 in 613f098

    
           class PandasSQL(PandasObject): 
        
               """ 
        
               Subclasses Should define read_sql and to_sql. 
        
               """ 
        
               def read_sql(self, *args, **kwargs): 
        
                   raise ValueError( 
        
                       "PandasSQL must be created with an SQLAlchemy " 
        
                       "connectable or sqlite connection" 
        
                   ) 
        
               def to_sql(self, *args, **kwargs): 
        
                   raise ValueError( 
        
                       "PandasSQL must be created with an SQLAlchemy " 
        
                       "connectable or sqlite connection" 
        
                   )

Also I would heavily ~~copy/paste~~ borrow from the Parquet module, including the tests.

yehoshuadimarsky · 2020-11-16T03:56:39Z

take

xhochy · 2020-11-16T09:11:58Z

I think most of the implementations would subclass from the SQLAlchemy engine again. We would like to reuse the table (re)creation routines and similar convenience patterns from it and only overload the actual "data retrieval" / "data push" part.

yehoshuadimarsky · 2020-12-22T17:49:16Z

On second thought, not sure I'm equipped to tackle this. Have never used any of the engines proposed other than bcpandas and that's not even its own engine

AbhayGoyal · 2021-01-18T20:20:54Z

Hey, I have also never used the engines but would be happy to startup. Would really need your help here.

yehoshuadimarsky · 2021-01-19T00:59:58Z

Hey, I have also never used the engines but would be happy to startup. Would really need your help here.

Open to working together on this if you want

AbhayGoyal · 2021-01-19T06:04:50Z

Open to working together on this if you want

I guess we should start with SQLAlchemy right?

erfannariman · 2021-01-19T11:09:55Z

Happy to help as well if you guys need more hands.

xhochy · 2021-01-22T20:39:14Z

Open to working together on this if you want

I guess we should start with SQLAlchemy right?

You could refactor some things out in the current SQLAlchemy code so that you have places where an engine could easily hook in. For example for bcpandas, you would only want to overwrite the "to_sql" hook, thus this would be a nice start for an engine.

yehoshuadimarsky · 2021-01-24T02:51:47Z

@xhochy will this code in BCPandas mess things up with circular imports?

https://github.com/yehoshuadimarsky/bcpandas/blob/481267404bdb1508a98205a506c3390f9ac5de64/bcpandas/main.py#L14-L15

(not sure why it's not rendering the preview snippet inline)

xhochy · 2021-01-27T11:58:21Z

No that shouldn't be a problem. In the final implementation, I would not expect that pandas would depend on bcpandas or that bcpandas needs to be imported to use it as an engine.

Personally, I would like to see the use of Python's entrypoint mechanism as a way to declare possible engines. https://amir.rachum.com/blog/2017/07/28/python-entry-points/ is a good introduction for that topic and how it could be used. With that you could define in the package metadata possible engines that pandas can detect without the need for circular imports.

proinsias · 2021-03-17T14:52:03Z

@yehoshuadimarsky & @AbhayGoyal – do you have what you need to make progress with this?

yehoshuadimarsky · 2021-03-17T16:54:53Z

@yehoshuadimarsky & @AbhayGoyal – do you have what you need to make progress with this?

Yes - not sure where to start.

I can try implementing a bcpandas engine option like we do in the Parquet part, but would be tricky without ripping out some of the old SQLAlchemy stuff. If the tests all pass, then is it ok?
Also, I'm not clear on @xhochy's response on how bcpandas won't be a circular import.
Finally, I don't have any exposure to the other engines like turbodbc

yehoshuadimarsky · 2021-03-21T04:20:21Z

Started work on this here https://github.com/yehoshuadimarsky/pandas/tree/sql-engine.

So far, mostly just refactored the SQLAlchemy parts to make an entry point for other engines, and got the existing test suite to pass on my machine.

jreback · 2021-03-21T05:00:43Z

smaller / refactoring PRs are good to push separately

yehoshuadimarsky · 2021-03-21T16:08:33Z

smaller / refactoring PRs are good to push separately

Good idea - just pushed a PR as a first step to refactor the existing code, before adding new engines. Will add bcpandas in a subsequent PR once this is approved.

yehoshuadimarsky · 2021-04-16T15:29:49Z

Almost done with first part, just stuck on a CI testing failure - anyone able to help? #40556 (comment)

xhochy · 2021-04-16T15:32:19Z

@yehoshuadimarsky I'll take a look in the next days!

yehoshuadimarsky · 2021-04-27T14:17:51Z

@yehoshuadimarsky I'll take a look in the next days!

Any luck @xhochy?

xhochy · 2021-04-27T15:10:41Z

@yehoshuadimarsky I'll take a look in the next days!

Any luck @xhochy?

Sorry, done now!

xhochy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 5, 2020

jreback added API Design IO SQL to_sql, read_sql, read_sql_query Performance Memory or execution speed performance and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2020

github-actions bot assigned yehoshuadimarsky Nov 16, 2020

yehoshuadimarsky mentioned this issue Mar 21, 2021

ENH: Pluggable SQL performance via new SQL engine keyword #40556

Merged

5 tasks

yehoshuadimarsky mentioned this issue May 6, 2021

whatsnew about new engine param for to_sql (follow up to #40556) #41353

Merged

4 tasks

yehoshuadimarsky mentioned this issue May 30, 2021

ENH: Load SQL engine backends via entrypoints #41728

Open

Hoeze mentioned this issue Jun 7, 2021

Feature request: to_pandas()/to_arrow() PyMySQL/mysqlclient#443

Open

fangchenli mentioned this issue Dec 14, 2021

API: Breaking Changes in 3.0 (without deprecations) #44823

Open

8 tasks

mroeschke mentioned this issue Jan 28, 2022

ENH: Integration with DuckDB #45678

Open

mroeschke mentioned this issue Jul 14, 2022

Support Iceberg in Pandas bodo-ai/Bodo-Pandas-Collaboration#9

Open

cdcadman mentioned this issue Nov 8, 2022

Refactor sqlalchemy code in pandas.io.sql to help prepare for sqlalchemy 2.0. #49531

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Pluggable SQL performance #36893

ENH: Pluggable SQL performance #36893

xhochy commented Oct 5, 2020

jreback commented Oct 6, 2020

TomAugspurger commented Oct 6, 2020

yehoshuadimarsky commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

jreback commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

xhochy commented Nov 16, 2020

yehoshuadimarsky commented Dec 22, 2020

AbhayGoyal commented Jan 18, 2021

yehoshuadimarsky commented Jan 19, 2021

AbhayGoyal commented Jan 19, 2021

erfannariman commented Jan 19, 2021

xhochy commented Jan 22, 2021

yehoshuadimarsky commented Jan 24, 2021 •

edited

Loading

xhochy commented Jan 27, 2021

proinsias commented Mar 17, 2021

yehoshuadimarsky commented Mar 17, 2021

yehoshuadimarsky commented Mar 21, 2021

jreback commented Mar 21, 2021

yehoshuadimarsky commented Mar 21, 2021

yehoshuadimarsky commented Apr 16, 2021

xhochy commented Apr 16, 2021

yehoshuadimarsky commented Apr 27, 2021

xhochy commented Apr 27, 2021

ENH: Pluggable SQL performance #36893

ENH: Pluggable SQL performance #36893

Comments

xhochy commented Oct 5, 2020

Faster alternatives

General implementation idea

Implementation idea (1) – Dispatch on type(engine.raw_connection().connection)

Implementation idea (2) – Extend the method= param

Implementation idea (3) - Introduce engine= param

Related issues

jreback commented Oct 6, 2020

TomAugspurger commented Oct 6, 2020

yehoshuadimarsky commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

jreback commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

yehoshuadimarsky commented Nov 16, 2020

xhochy commented Nov 16, 2020

yehoshuadimarsky commented Dec 22, 2020

AbhayGoyal commented Jan 18, 2021

yehoshuadimarsky commented Jan 19, 2021

AbhayGoyal commented Jan 19, 2021

erfannariman commented Jan 19, 2021

xhochy commented Jan 22, 2021

yehoshuadimarsky commented Jan 24, 2021 • edited Loading

xhochy commented Jan 27, 2021

proinsias commented Mar 17, 2021

yehoshuadimarsky commented Mar 17, 2021

yehoshuadimarsky commented Mar 21, 2021

jreback commented Mar 21, 2021

yehoshuadimarsky commented Mar 21, 2021

yehoshuadimarsky commented Apr 16, 2021

xhochy commented Apr 16, 2021

yehoshuadimarsky commented Apr 27, 2021

xhochy commented Apr 27, 2021

Implementation idea (1) – Dispatch on `type(engine.raw_connection().connection)`

Implementation idea (2) – Extend the `method=` param

Implementation idea (3) - Introduce `engine=` param

yehoshuadimarsky commented Jan 24, 2021 •

edited

Loading