-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Pluggable SQL performance #36893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 on the |
Happy to have improvements here, and I trust your judgement on the best API for users. The |
+1 on the idea, but my only question is wouldn't this in effect be the Pandas project taking a stance on which "side" or extension projects they like and which they don't? With so many alternatives out there (as @xhochy specified in the original post), how to pick what gets included? As opposed to |
I'd be happy to work on this PR though, I have a fair amount of experience with the Pandas <> SQL interface and backend code. But would want more feedback from the core maintainers first if you think this is worth the time and effort, and that it will get merged. Full disclosure - I'm the author of one of the libraries mentioned (bcpandas). |
well we would have to start somewhere - you can be the first! we don't need to take a stance per se - would likely accept any compatible and well tested engines we did this for excel reading for example and now have a number of community supported engines |
Ok, good point. As far as I see it, we will go with implementing option 3 - specifying an Regarding implementation, the prototype by @xhochy is great, and I see we already have a base class for this in the SQL module that currently only supports 2 subclasses - SQLAlchemy and SQLite. So all we would have to do is create more subclasses to implement this base class. Do you agree? Lines 1105 to 1120 in 613f098
Also I would heavily |
take |
I think most of the implementations would subclass from the SQLAlchemy engine again. We would like to reuse the table (re)creation routines and similar convenience patterns from it and only overload the actual "data retrieval" / "data push" part. |
On second thought, not sure I'm equipped to tackle this. Have never used any of the engines proposed other than |
Hey, I have also never used the engines but would be happy to startup. Would really need your help here. |
Open to working together on this if you want |
I guess we should start with SQLAlchemy right? |
Happy to help as well if you guys need more hands. |
You could refactor some things out in the current SQLAlchemy code so that you have places where an engine could easily hook in. For example for |
@xhochy will this code in BCPandas mess things up with circular imports? (not sure why it's not rendering the preview snippet inline) |
No that shouldn't be a problem. In the final implementation, I would not expect that Personally, I would like to see the use of Python's entrypoint mechanism as a way to declare possible engines. https://amir.rachum.com/blog/2017/07/28/python-entry-points/ is a good introduction for that topic and how it could be used. With that you could define in the package metadata possible engines that |
@yehoshuadimarsky & @AbhayGoyal – do you have what you need to make progress with this? |
Yes - not sure where to start.
|
Started work on this here https://github.com/yehoshuadimarsky/pandas/tree/sql-engine. So far, mostly just refactored the SQLAlchemy parts to make an entry point for other engines, and got the existing test suite to pass on my machine. |
smaller / refactoring PRs are good to push separately |
Good idea - just pushed a PR as a first step to refactor the existing code, before adding new engines. Will add |
Almost done with first part, just stuck on a CI testing failure - anyone able to help? #40556 (comment) |
@yehoshuadimarsky I'll take a look in the next days! |
Any luck @xhochy? |
Sorry, done now! |
Currently the pandas SQL logic is using SQLAlchemy with results being returned as Python objects before being converted to a DataFrame. While the API is simple, it doesn't have good performance characteristics due to the intermediate Python objects. There exist currently some faster alternatives with inconsistent and more complicated APIs.
In addition to not having a uniform API, these implementations are only concerned about fast result en-/decoding. Functionality like automatic table creation as we have in
pandas.DataFrame.to_sql
doesn't exist there.Thus it would be nice to have a way to use these connector implementations behind the standard pandas API.
Faster alternatives
fetchallarrow().to_pandas()
), e.g. MS SQL or Exasol.fetch_all_pandas()
pyarrow
's JVM module to get faster access to JDBC results via Arrowpostgres-copy-arrow-decode
: Not yet opensourced (shame on me): Cython-based encoder/decoder for Postgres'COPY BINARY
command that decodes Postgres' binary protocol from/to Arrow. Works together withpsycopg2
and gives roughly a 2x speedup and type stability on theCOPY CSV
method inpandas
's docs.General implementation idea
pandas
users should only deal withread_sql
andto_sql
in its current fashion.to_sql(DataFrame)
andread_sql(query) -> DataFrame
. Table creation, index adjustment and further convenience functionality can still be handled by the high-level SQLAlchemy layer.Implementation idea (1) – Dispatch on
type(engine.raw_connection().connection)
SQLAlchemy exposes the underlying connection of the database driver via
engine.raw_connection()
. This is a useful way to detect how we connect to the database. We could provide a registry where each backend implementation provides a functionsupports_connection(engine.raw_connection().connection) -> bool
to determine whether it can be used.Pro:
Con:
Implementation idea (2) – Extend the
method=
parampandas.DataFrame.to_sql
already has amethod
parameter where the user can supply a callable that is used to insert the data into the Database. Currently the callable gets a row-iterator and not a DataFrame. Thus this interface is already hard-wired that the intermediate result needs to be converted into Python objects. Instead of providing a row-iterator, we could pass the original DataFrame to this methodPro:
method=turbodbc.pandas.to_sql
Con:
method
paramter or a second parameter needs to be added that is doing nearly the same things is introduced.Implementation idea (3) - Introduce
engine=
paramAs we have with the the Parquet and CSV IO implementations, we could also go for providing an
engine
parameter where users could easily switch based on the name of an implementation. A prototype implementation would look like:Pro:
engine="auto"
setting which on explicit usage tries to find a matching backend and will otherwise fallback to the plain SQLAlchemy implementation.Con:
Personally, I would prefer this approach.
Related issues
The text was updated successfully, but these errors were encountered: