Add SQLAlchemy Dialect #57

susodapop · 2022-10-14T18:03:28Z

Description

This pull request implements a first-party SQLAlchemy dialect compatible with Databricks SQL. It aims to be a drop-in replacement for sqlalchemy-databricks that implements more of the Databricks API, particularly around table reflection, Alembic usage, and data ingestion with pandas.

Adding a dialect for SQLAlchemy is not a well-documented process so this work was guided by the included e2e tests. I implemented only those methods of the dialect needed to pass our tests.

What's already supported

Most of the functionality is demonstrated in the e2e tests included in this pull request. The below list we derived from those test method names:

Create and drop tables with SQLAlchemy Core
Create and drop tables with SQLAlchemy ORM
Read created tables via reflection
Modify column nullability
Insert records manually
Insert records with pandas.to_sql (note that this does not work for DataFrames with indexes)

This connector also aims to support Alembic for programmatic delta table schema maintenance. This behaviour is not yet backed by integration tests, which will follow in a subsequent PR as we learn more about customer use cases there. That said, the following behaviours have been tested manually:

Autogenerate revisions with alembic revision --autogenerate
Upgrade and downgrade between revisions with alembic upgrade <revision hash> and alembic downgrade <revision hash>

What's not supported

MAP, ARRAY, and STRUCT types: this dialect can read these types out as strings. But you cannot define a SQLAlchemy model with databricks.sqlalchemy.dialect.types.DatabricksMap (e.g.) because we haven't implemented the logic necessary to layer these. This is a priority for development.
Constraints: with the addition of support the information_schema in Unity Catalog, Databricks SQL supports foreign key and primary key constraints. This dialect can write these constraints but the ability for alembic to reflect and modify them programmatically has not been tested.

Basic usage

IMPORTANT ⚠️ The connection string format has changed since the earliest commits. The prefix is now databricks:// and not databricks+thrift://

from sqlalchemy import create_engine

engine = create_engine("databricks://token:dapi*****@*****.cloud.databricks.com/?http_path=******&catalog=****&schema=***")
engine.execute("select something")

Use ORM to create a table and insert records

host="****"
http_path="***"
access_token="***"
catalog="***"
schema="***"


import datetime
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy import Column, String, Integer, BOOLEAN, create_engine, select

engine = create_engine(f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}")
session  = Session(bind=engine)
base = declarative_base(bind=engine)


class SampleObject(base):

    __tablename__ = "PySQLTest_{}".format(datetime.datetime.utcnow().strftime("%s"))

    name = Column(String(255), primary_key=True)
    episodes = Column(Integer),
    some_bool = Column(BOOLEAN)

base.metadata.create_all()

sample_object_1 = SampleObject(name="Bim Adewunmi", episodes=6, some_bool=True)
sample_object_2 = SampleObject(name="Miki Meek", episodes=12, some_bool=False)
session.add(sample_object_1)
session.add(sample_object_2)
session.commit()

stmt = select(SampleObject).where(SampleObject.name.in_(["Bim Adewunmi", "Miki Meek"]))

output = [i for i in session.scalars(stmt)]
assert len(output) == 2

base.metadata.drop_all()

Bulk insert data

import os, datetime, random
from sqlalchemy import create_engine, select, insert, Column, MetaData, Table
from sqlalchemy.types import Integer, String

HOST = os.environ.get("host")
HTTP_PATH = os.environ.get("http_path")
ACCESS_TOKEN = os.environ.get("access_token")
CATALOG = os.environ.get("catalog")
SCHEMA = os.environ.get("schema")

db_engine = create_engine(f"databricks://token:{ACCESS_TOKEN}@{HOST}?http_path={HTTP_PATH}&catalog={CATALOG}&schema={SCHEMA}")
metadata_obj = MetaData(bind=db_engine)

table_name = "PySQLTest_{}".format(datetime.datetime.utcnow().strftime("%s"))
names = ["Bim", "Miki", "Sarah", "Ira"]
rows = [{"name": names[i%3], "number": random.choice(range(10000))} for i in range(10000)]

SampleTable = Table(
        table_name,
        metadata_obj,
        Column("name", String(255)),
        Column("number", Integer)
)

# Create SampleTable ~5 seconds
metadata_obj.create_all()

# Insert 10k rows takes < 3 seconds
db_engine.execute(insert(SampleTable).values(rows))

results = db_engine.execute(select(SampleTable)).all()

assert len(results) == 10_000

# Drop the SampleTable
metadata_obj.drop_all()

Basic alembic workflow

After you have installed databricks-sql-connector that includes the dialect, you can run alembic init to generate an env.py and alembic.ini file. You should not need to modify alembic.ini but you need to modify env.py to do the following:

Import the SQLAlchemy MetaData object against which you declared your models and set target_metadata equal to it.
Update run_migrations_offline to import your SQLAlchemy connection string and set url equal to it
Update run_migrations_online to use a connectable engine

Here is an example env.py where the needed information is available in a file called main.py at the same directory level as env.py:

from logging.config import fileConfig

from sqlalchemy import engine_from_config
from sqlalchemy import pool

from alembic import context

# this is the Alembic Config object, which provides
# access to the values within the .ini file in use.
config = context.config

# Interpret the config file for Python logging.
# This line sets up loggers basically.
if config.config_file_name is not None:
    fileConfig(config.config_file_name)

# add your model's MetaData object here
# for 'autogenerate' support
# from myapp import mymodel
# target_metadata = mymodel.Base.metadata
from main import base
target_metadata = base.metadata

# other values from the config, defined by the needs of env.py,
# can be acquired:
# my_important_option = config.get_main_option("my_important_option")
# ... etc.


def run_migrations_offline() -> None:
    """Run migrations in 'offline' mode.

    This configures the context with just a URL
    and not an Engine, though an Engine is acceptable
    here as well.  By skipping the Engine creation
    we don't even need a DBAPI to be available.

    Calls to context.execute() here emit the given string to the
    script output.

    """
    from main import sqla_uri
    url = sqla_uri
    context.configure(
        url=url,
        target_metadata=target_metadata,
        literal_binds=True,
        dialect_opts={"paramstyle": "named"},
    )

    with context.begin_transaction():
        context.run_migrations()


def run_migrations_online() -> None:
    """Run migrations in 'online' mode.

    In this scenario we need to create an Engine
    and associate a connection with the context.

    """
    from main import engine
    connectable = engine

    with connectable.connect() as connection:
        context.configure(
            connection=connection, target_metadata=target_metadata
        )

        with context.begin_transaction():
            context.run_migrations()


if context.is_offline_mode():
    run_migrations_offline()
else:
    run_migrations_online()

You can make your initial migration by running alembic revision --autogenerate -m "Initial". This will generate a fresh revision in the versions directory and you should see your model described. To generate the resulting table(s) in Databricks you should run alembic upgrade head. The Alembic tutorial is a good place to learn about creating subsequent revisions, downgrading etc.

MHzl · 2023-01-26T16:29:26Z

Thanks for the feature. 😊 Is there an ETA when this will be merged?

susodapop · 2023-01-31T22:37:49Z

Within the next couple weeks @BMiHe 👍

andrefurlan-db

I added a bunch of comments, but mostly due to ignorance. I will try to run the code and see what happens.

README.md

src/databricks/sqlalchemy/dialect/__init__.py

src/databricks/sqlalchemy/dialect/base.py

src/databricks/sqlalchemy/dialect/__init__.py

src/databricks/sqlalchemy/dialect/base.py

src/databricks/sqlalchemy/dialect/compiler.py

oke-aditya · 2023-02-13T07:03:05Z

Any plans to support SQLAlchemy 2.x as 1.x series is now discontinued.

susodapop · 2023-02-13T16:37:28Z

Any plans to support SQLAlchemy 2.x as 1.x series is now discontinued.

@oke-aditya Yes. The dialect already supports usage with SQLAlchemy 2.0's API. We'll actually bump the dependency version in a future release.

examples/sqlalchemy.py

andrefurlan-db

Looks good to me.

Signed-off-by: Jesse Whitehouse <[email protected]>

ahsankhawaja · 2023-03-02T02:56:12Z

Hi Guys, with the latest 2.4.0 can we do multi table transactions ? rollback ?
Many Thanks

susodapop · 2023-03-06T15:47:07Z

@ahsankhawaja The version of this connector is not relevant to your question. Databricks doesn't use transactions, so the connector doesn't support them either.

ahsankhawaja · 2023-03-07T13:53:42Z

Databricks does transactions https://learn.microsoft.com/en-us/azure/databricks/lakehouse/acid#--how-are-transactions-scoped-on-azure-databricks

or is there something else that you were referring to?

Also the fact that Python connector underneath using PEP 249 – Python Database API Specification v2.0, which has connection object with commit / rollbacks (think these are not implemented yet) https://peps.python.org/pep-0249/#connection-methods ?

Thanks for taking time to answer

susodapop · 2023-03-09T19:07:00Z

@ahsankhawaja Good questions! But no, Databricks does not support SQL transactions.

Databricks does transactions https://learn.microsoft.com/en-us/azure/databricks/lakehouse/acid#--how-are-transactions-scoped-on-azure-databricks

The documentation you linked references ACID transactions, which are a feature of the storage layer called Delta Lake. These are different than SQL transactions. It's the same word but refers to a different concept.

Also the fact that Python connector underneath using PEP 249

PEP-249 only requires commit support when the Database back-end supports it. Quoting from the doc you linked: "Database modules that do not support transactions should implement this method with void functionality." Which is exactly what this connector does.

oke-aditya · 2023-03-09T19:50:57Z

So this means we can't use this connector to do full CRUD operations on delta tables?
Only Select is possible??

susodapop · 2023-03-09T19:58:50Z

So this means we can't use this connector to do full CRUD operations on delta tables?

No, it doesn't mean that at all! You can absolutely do CRUD on delta tables.

databricks-sql-connector is just a way to write SQL statements and send them to a Databricks cluster. Any valid SQL will work. That includes SELECT, INSERT, DELETE, GRANT, SET and dozens of other keywords (the whole language spec is here). For context, if you've ever run a query in Databricks SQL through your browser it used databricks-sql-connector :)

@ahsankhawaja's question was about SQL transaction support i.e. writing a query that includes BEGIN TRANSACTION and COMMIT TRANSACTION statements. Spark SQL / Databricks SQL don't have this syntax so they won't work with databricks-sql-connector.

ahsankhawaja · 2023-03-10T09:32:29Z

as @susodapop said @oke-aditya yes can do, I have build an API on top of it that does select, crud ops so I can expose my Lakehouse to any language / application, I was more intersted in using Lakehouse as backend for web apps, so we can have same platform doing all things, I saw Databricks released RESTAPI support as well other day https://www.databricks.com/blog/2023/03/07/databricks-sql-statement-execution-api-announcing-public-preview.html, but that was just basic select no crud there. this connector is awesome, I wish at some point we can add transactions support in there.

Awesome work Jess

susodapop · 2023-03-10T16:35:13Z

I wish at some point we can add transactions support in there.

Please communicate this with your contact at Databricks! I know there's interest in Multi-statement transaction support. The best way to increase its priority is to ask for it concretely. The more customers ask for it the more traction it receives. That needs to happen through your Databricks contact rather than this open source forum, that way it can be routed to the correct places internally :)

This was referenced Oct 14, 2022

Add SQLAlchemy Support #14

Closed

Sqlalchemy dev POC #30

Closed

susodapop force-pushed the PECO-231 branch 2 times, most recently from 21d3a79 to 4148374 Compare October 17, 2022 15:15

susodapop force-pushed the PECO-231 branch from 59af5d3 to f602866 Compare December 1, 2022 18:54

susodapop mentioned this pull request Dec 15, 2022

Support transactions (and pandas.to_sql / read_sql_table) #72

Closed

susodapop mentioned this pull request Jan 18, 2023

Request for Async Support and Dependency Update #82

Closed

susodapop marked this pull request as ready for review January 31, 2023 22:44

susodapop requested review from arikfr, moderakh and yunbodeng-db as code owners January 31, 2023 22:44

yunbodeng-db requested a review from andrefurlan-db January 31, 2023 23:43

andrefurlan-db reviewed Feb 2, 2023

View reviewed changes

susodapop mentioned this pull request Feb 10, 2023

SQLAlchemy dialect native bool support #73

Closed

andrefurlan-db reviewed Feb 14, 2023

View reviewed changes

examples/sqlalchemy.py Outdated Show resolved Hide resolved

andrefurlan-db approved these changes Feb 14, 2023

View reviewed changes

Jesse Whitehouse added 2 commits February 17, 2023 16:37

Add SQLAlchemy dialect

b34e044

Signed-off-by: Jesse Whitehouse <[email protected]>

Update poetry after rebase onto main

f436815

Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop force-pushed the PECO-231 branch from c373212 to f436815 Compare February 17, 2023 22:42

Black source files

44d3b22

Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop merged commit 3eaaac9 into main Feb 17, 2023

susodapop deleted the PECO-231 branch February 17, 2023 23:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQLAlchemy Dialect #57

Add SQLAlchemy Dialect #57

susodapop commented Oct 14, 2022 •

edited

Loading

MHzl commented Jan 26, 2023

susodapop commented Jan 31, 2023

andrefurlan-db left a comment

oke-aditya commented Feb 13, 2023

susodapop commented Feb 13, 2023

andrefurlan-db left a comment

ahsankhawaja commented Mar 2, 2023

susodapop commented Mar 6, 2023

ahsankhawaja commented Mar 7, 2023 •

edited

Loading

susodapop commented Mar 9, 2023

oke-aditya commented Mar 9, 2023

susodapop commented Mar 9, 2023

ahsankhawaja commented Mar 10, 2023

susodapop commented Mar 10, 2023

Add SQLAlchemy Dialect #57

Add SQLAlchemy Dialect #57

Conversation

susodapop commented Oct 14, 2022 • edited Loading

Description

What's already supported

What's not supported

Basic usage

Use ORM to create a table and insert records

Bulk insert data

Basic alembic workflow

MHzl commented Jan 26, 2023

susodapop commented Jan 31, 2023

andrefurlan-db left a comment

Choose a reason for hiding this comment

oke-aditya commented Feb 13, 2023

susodapop commented Feb 13, 2023

andrefurlan-db left a comment

Choose a reason for hiding this comment

ahsankhawaja commented Mar 2, 2023

susodapop commented Mar 6, 2023

ahsankhawaja commented Mar 7, 2023 • edited Loading

susodapop commented Mar 9, 2023

oke-aditya commented Mar 9, 2023

susodapop commented Mar 9, 2023

ahsankhawaja commented Mar 10, 2023

susodapop commented Mar 10, 2023

susodapop commented Oct 14, 2022 •

edited

Loading

ahsankhawaja commented Mar 7, 2023 •

edited

Loading