-
Notifications
You must be signed in to change notification settings - Fork 103
Add SQLAlchemy Dialect #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
21d3a79
to
4148374
Compare
Thanks for the feature. 😊 Is there an ETA when this will be merged? |
Within the next couple weeks @BMiHe 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a bunch of comments, but mostly due to ignorance. I will try to run the code and see what happens.
Any plans to support SQLAlchemy 2.x as 1.x series is now discontinued. |
@oke-aditya Yes. The dialect already supports usage with SQLAlchemy 2.0's API. We'll actually bump the dependency version in a future release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Hi Guys, with the latest 2.4.0 can we do multi table transactions ? rollback ? |
@ahsankhawaja The version of this connector is not relevant to your question. Databricks doesn't use transactions, so the connector doesn't support them either. |
Databricks does transactions https://learn.microsoft.com/en-us/azure/databricks/lakehouse/acid#--how-are-transactions-scoped-on-azure-databricks or is there something else that you were referring to? Also the fact that Python connector underneath using PEP 249 – Python Database API Specification v2.0, which has connection object with commit / rollbacks (think these are not implemented yet) https://peps.python.org/pep-0249/#connection-methods ? Thanks for taking time to answer |
@ahsankhawaja Good questions! But no, Databricks does not support SQL transactions.
The documentation you linked references ACID transactions, which are a feature of the storage layer called Delta Lake. These are different than SQL transactions. It's the same word but refers to a different concept.
PEP-249 only requires commit support when the Database back-end supports it. Quoting from the doc you linked: "Database modules that do not support transactions should implement this method with void functionality." Which is exactly what this connector does. |
So this means we can't use this connector to do full CRUD operations on delta tables? |
No, it doesn't mean that at all! You can absolutely do CRUD on delta tables.
@ahsankhawaja's question was about SQL transaction support i.e. writing a query that includes |
as @susodapop said @oke-aditya yes can do, I have build an API on top of it that does select, crud ops so I can expose my Lakehouse to any language / application, I was more intersted in using Lakehouse as backend for web apps, so we can have same platform doing all things, I saw Databricks released RESTAPI support as well other day https://www.databricks.com/blog/2023/03/07/databricks-sql-statement-execution-api-announcing-public-preview.html, but that was just basic select no crud there. this connector is awesome, I wish at some point we can add transactions support in there. Awesome work Jess |
Please communicate this with your contact at Databricks! I know there's interest in Multi-statement transaction support. The best way to increase its priority is to ask for it concretely. The more customers ask for it the more traction it receives. That needs to happen through your Databricks contact rather than this open source forum, that way it can be routed to the correct places internally :) |
Description
This pull request implements a first-party SQLAlchemy dialect compatible with Databricks SQL. It aims to be a drop-in replacement for
sqlalchemy-databricks
that implements more of the Databricks API, particularly around table reflection, Alembic usage, and data ingestion with pandas.Adding a dialect for SQLAlchemy is not a well-documented process so this work was guided by the included e2e tests. I implemented only those methods of the dialect needed to pass our tests.
What's already supported
Most of the functionality is demonstrated in the e2e tests included in this pull request. The below list we derived from those test method names:
pandas.to_sql
(note that this does not work for DataFrames with indexes)This connector also aims to support Alembic for programmatic delta table schema maintenance. This behaviour is not yet backed by integration tests, which will follow in a subsequent PR as we learn more about customer use cases there. That said, the following behaviours have been tested manually:
alembic revision --autogenerate
alembic upgrade <revision hash>
andalembic downgrade <revision hash>
What's not supported
MAP
,ARRAY
, andSTRUCT
types: this dialect can read these types out as strings. But you cannot define a SQLAlchemy model withdatabricks.sqlalchemy.dialect.types.DatabricksMap
(e.g.) because we haven't implemented the logic necessary to layer these. This is a priority for development.information_schema
in Unity Catalog, Databricks SQL supports foreign key and primary key constraints. This dialect can write these constraints but the ability for alembic to reflect and modify them programmatically has not been tested.Basic usage
IMPORTANT⚠️ The connection string format has changed since the earliest commits. The prefix is now
databricks://
and notdatabricks+thrift://
Use ORM to create a table and insert records
Bulk insert data
Basic alembic workflow
After you have installed
databricks-sql-connector
that includes the dialect, you can runalembic init
to generate anenv.py
andalembic.ini
file. You should not need to modifyalembic.ini
but you need to modifyenv.py
to do the following:MetaData
object against which you declared your models and settarget_metadata
equal to it.run_migrations_offline
to import your SQLAlchemy connection string and seturl
equal to itrun_migrations_online
to use a connectable engineHere is an example
env.py
where the needed information is available in a file calledmain.py
at the same directory level asenv.py
:You can make your initial migration by running
alembic revision --autogenerate -m "Initial"
. This will generate a fresh revision in theversions
directory and you should see your model described. To generate the resulting table(s) in Databricks you should runalembic upgrade head
. The Alembic tutorial is a good place to learn about creating subsequent revisions, downgrading etc.