Skip to content

Add SQLAlchemy Dialect #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
[![PyPI](https://img.shields.io/pypi/v/databricks-sql-connector?style=flat-square)](https://pypi.org/project/databricks-sql-connector/)
[![Downloads](https://pepy.tech/badge/databricks-sql-connector)](https://pepy.tech/project/databricks-sql-connector)

The Databricks SQL Connector for Python allows you to develop Python applications that connect to Databricks clusters and SQL warehouses. It is a Thrift-based client with no dependencies on ODBC or JDBC. It conforms to the [Python DB API 2.0 specification](https://www.python.org/dev/peps/pep-0249/).
The Databricks SQL Connector for Python allows you to develop Python applications that connect to Databricks clusters and SQL warehouses. It is a Thrift-based client with no dependencies on ODBC or JDBC. It conforms to the [Python DB API 2.0 specification](https://www.python.org/dev/peps/pep-0249/) and exposes a [SQLAlchemy](https://www.sqlalchemy.org/) dialect for use with tools like `pandas` and `alembic` which use SQLAlchemy to execute DDL.

This connector uses Arrow as the data-exchange format, and supports APIs to directly fetch Arrow tables. Arrow tables are wrapped in the `ArrowQueue` class to provide a natural API to get several rows at a time.

Expand All @@ -24,16 +24,27 @@ For the latest documentation, see

Install the library with `pip install databricks-sql-connector`

Example usage:
Note: Don't hard-code authentication secrets into your Python. Use environment variables

```bash
export DATABRICKS_HOST=********.databricks.com
export DATABRICKS_HTTP_PATH=/sql/1.0/endpoints/****************
export DATABRICKS_TOKEN=dapi********************************
```

Example usage:
```python
import os
from databricks import sql

connection = sql.connect(
server_hostname='********.databricks.com',
http_path='/sql/1.0/endpoints/****************',
access_token='dapi********************************')
host = os.getenv("DATABRICKS_HOST)
http_path = os.getenv("DATABRICKS_HTTP_PATH)
access_token = os.getenv("DATABRICKS_ACCESS_TOKEN)

connection = sql.connect(
server_hostname=host,
http_path=http_path,
access_token=access_token)

cursor = connection.cursor()

Expand Down
3 changes: 2 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,5 @@ To run all of these examples you can clone the entire repository to your disk. O
- **`persistent_oauth.py`** shows a more advanced example of authenticating by OAuth while Bring Your Own IDP is in public preview. In this case, it shows how to use a sublcass of `OAuthPersistence` to reuse an OAuth token across script executions.
- **`set_user_agent.py`** shows how to customize the user agent header used for Thrift commands. In
this example the string `ExamplePartnerTag` will be added to the the user agent on every request.
- **`staging_ingestion.py`** shows how the connector handles Databricks' experimental staging ingestion commands `GET`, `PUT`, and `REMOVE`.
- **`staging_ingestion.py`** shows how the connector handles Databricks' experimental staging ingestion commands `GET`, `PUT`, and `REMOVE`.
- **`sqlalchemy.py`** shows a basic example of connecting to Databricks with [SQLAlchemy](https://www.sqlalchemy.org/).
92 changes: 92 additions & 0 deletions examples/sqlalchemy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
"""
databricks-sql-connector includes a SQLAlchemy dialect compatible with Databricks SQL.
It aims to be a drop-in replacement for the crflynn/sqlalchemy-databricks project, that implements
more of the Databricks API, particularly around table reflection, Alembic usage, and data
ingestion with pandas.

Because of the extent of SQLAlchemy's capabilities it isn't feasible to provide examples of every
usage in a single script, so we only provide a basic one here. More examples are found in our test
suite at tests/e2e/sqlalchemy/test_basic.py and in the PR that implements this change:

https://github.com/databricks/databricks-sql-python/pull/57

# What's already supported

Most of the functionality is demonstrated in the e2e tests mentioned above. The below list we
derived from those test method names:

- Create and drop tables with SQLAlchemy Core
- Create and drop tables with SQLAlchemy ORM
- Read created tables via reflection
- Modify column nullability
- Insert records manually
- Insert records with pandas.to_sql (note that this does not work for DataFrames with indexes)

This connector also aims to support Alembic for programmatic delta table schema maintenance. This
behaviour is not yet backed by integration tests, which will follow in a subsequent PR as we learn
more about customer use cases there. That said, the following behaviours have been tested manually:

- Autogenerate revisions with alembic revision --autogenerate
- Upgrade and downgrade between revisions with `alembic upgrade <revision hash>` and
`alembic downgrade <revision hash>`

# Known Gaps
- MAP, ARRAY, and STRUCT types: this dialect can read these types out as strings. But you cannot
define a SQLAlchemy model with databricks.sqlalchemy.dialect.types.DatabricksMap (e.g.) because
we haven't implemented them yet.
- Constraints: with the addition of information_schema to Unity Catalog, Databricks SQL supports
foreign key and primary key constraints. This dialect can write these constraints but the ability
for alembic to reflect and modify them programmatically has not been tested.
"""

import os
from sqlalchemy.orm import declarative_base, Session
from sqlalchemy import Column, String, Integer, BOOLEAN, create_engine, select

host = os.getenv("DATABRICKS_SERVER_HOSTNAME")
http_path = os.getenv("DATABRICKS_HTTP_PATH")
access_token = os.getenv("DATABRICKS_TOKEN")
catalog = os.getenv("DATABRICKS_CATALOG")
schema = os.getenv("DATABRICKS_SCHEMA")


# Extra arguments are passed untouched to the driver
# See thrift_backend.py for complete list
extra_connect_args = {
"_tls_verify_hostname": True,
"_user_agent_entry": "PySQL Example Script",
}

engine = create_engine(
f"databricks://token:{access_token}@{host}?http_path={http_path}&catalog={catalog}&schema={schema}",
connect_args=extra_connect_args,
)
session = Session(bind=engine)
base = declarative_base(bind=engine)


class SampleObject(base):

__tablename__ = "mySampleTable"

name = Column(String(255), primary_key=True)
episodes = Column(Integer)
some_bool = Column(BOOLEAN)


base.metadata.create_all()

sample_object_1 = SampleObject(name="Bim Adewunmi", episodes=6, some_bool=True)
sample_object_2 = SampleObject(name="Miki Meek", episodes=12, some_bool=False)

session.add(sample_object_1)
session.add(sample_object_2)

session.commit()

stmt = select(SampleObject).where(SampleObject.name.in_(["Bim Adewunmi", "Miki Meek"]))

output = [i for i in session.scalars(stmt)]
assert len(output) == 2

base.metadata.drop_all()
Loading