Skip to content

Commit 258b961

Browse files
author
Jesse
authored
Merge staging ingestion into main (#78)
Follow up to #67 and #64 * Regenerate TCLIService using latest TCLIService.thrift from DBR (#64) * SI: Implement GET, PUT, and REMOVE (#67) * Re-lock dependencies after merging `main` Signed-off-by: Jesse Whitehouse <[email protected]>
1 parent 3a4d175 commit 258b961

File tree

14 files changed

+22767
-1177
lines changed

14 files changed

+22767
-1177
lines changed

CONTRIBUTING.md

+7
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ export access_token=""
112112
There are several e2e test suites available:
113113
- `PySQLCoreTestSuite`
114114
- `PySQLLargeQueriesSuite`
115+
- `PySQLStagingIngestionTestSuite`
115116
- `PySQLRetryTestSuite.HTTP503Suite` **[not documented]**
116117
- `PySQLRetryTestSuite.HTTP429Suite` **[not documented]**
117118
- `PySQLUnityCatalogTestSuite` **[not documented]**
@@ -122,6 +123,12 @@ To execute the core test suite:
122123
poetry run python -m pytest tests/e2e/driver_tests.py::PySQLCoreTestSuite
123124
```
124125

126+
The `PySQLCoreTestSuite` namespace contains tests for all of the connector's basic features and behaviours. This is the default namespace where tests should be written unless they require specially configured clusters or take an especially long-time to execute by design.
127+
128+
The `PySQLLargeQueriesSuite` namespace contains long-running query tests and is kept separate. In general, if the `PySQLCoreTestSuite` passes then these tests will as well.
129+
130+
The `PySQLStagingIngestionTestSuite` namespace requires a cluster running DBR version > 12.x which supports staging ingestion commands.
131+
125132
The suites marked `[not documented]` require additional configuration which will be documented at a later time.
126133
### Code formatting
127134

examples/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -35,4 +35,5 @@ To run all of these examples you can clone the entire repository to your disk. O
3535
- **`interactive_oauth.py`** shows the simplest example of authenticating by OAuth (no need for a PAT generated in the DBSQL UI) while Bring Your Own IDP is in public preview. When you run the script it will open a browser window so you can authenticate. Afterward, the script fetches some sample data from Databricks and prints it to the screen. For this script, the OAuth token is not persisted which means you need to authenticate every time you run the script.
3636
- **`persistent_oauth.py`** shows a more advanced example of authenticating by OAuth while Bring Your Own IDP is in public preview. In this case, it shows how to use a sublcass of `OAuthPersistence` to reuse an OAuth token across script executions.
3737
- **`set_user_agent.py`** shows how to customize the user agent header used for Thrift commands. In
38-
this example the string `ExamplePartnerTag` will be added to the the user agent on every request.
38+
this example the string `ExamplePartnerTag` will be added to the the user agent on every request.
39+
- **`staging_ingestion.py`** shows how the connector handles Databricks' experimental staging ingestion commands `GET`, `PUT`, and `REMOVE`.

examples/staging_ingestion.py

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
from databricks import sql
2+
import os
3+
4+
"""
5+
Databricks experimentally supports data ingestion of local files via a cloud staging location.
6+
Ingestion commands will work on DBR >12. And you must include a staging_allowed_local_path kwarg when
7+
calling sql.connect().
8+
9+
Use databricks-sql-connector to PUT files into the staging location where Databricks can access them:
10+
11+
PUT '/path/to/local/data.csv' INTO 'stage://tmp/[email protected]/salesdata/september.csv' OVERWRITE
12+
13+
Files in a staging location can also be retrieved with a GET command
14+
15+
GET 'stage://tmp/[email protected]/salesdata/september.csv' TO 'data.csv'
16+
17+
and deleted with a REMOVE command:
18+
19+
REMOVE 'stage://tmp/[email protected]/salesdata/september.csv'
20+
21+
Ingestion queries are passed to cursor.execute() like any other query. For GET and PUT commands, a local file
22+
will be read or written. For security, this local file must be contained within, or descended from, a
23+
staging_allowed_local_path of the connection.
24+
25+
Additionally, the connection can only manipulate files within the cloud storage location of the authenticated user.
26+
27+
To run this script:
28+
29+
1. Set the INGESTION_USER constant to the account email address of the authenticated user
30+
2. Set the FILEPATH constant to the path of a file that will be uploaded (this example assumes its a CSV file)
31+
3. Run this file
32+
33+
Note: staging_allowed_local_path can be either a Pathlike object or a list of Pathlike objects.
34+
"""
35+
36+
INGESTION_USER = "[email protected]"
37+
FILEPATH = "example.csv"
38+
39+
# FILEPATH can be relative to the current directory.
40+
# Resolve it into an absolute path
41+
_complete_path = os.path.realpath(FILEPATH)
42+
43+
if not os.path.exists(_complete_path):
44+
45+
# It's easiest to save a file in the same directory as this script. But any path to a file will work.
46+
raise Exception(
47+
"You need to set FILEPATH in this script to a file that actually exists."
48+
)
49+
50+
# Set staging_allowed_local_path equal to the directory that contains FILEPATH
51+
staging_allowed_local_path = os.path.split(_complete_path)[0]
52+
53+
with sql.connect(
54+
server_hostname=os.getenv("DATABRICKS_SERVER_HOSTNAME"),
55+
http_path=os.getenv("DATABRICKS_HTTP_PATH"),
56+
access_token=os.getenv("DATABRICKS_TOKEN"),
57+
staging_allowed_local_path=staging_allowed_local_path,
58+
) as connection:
59+
60+
with connection.cursor() as cursor:
61+
62+
# Ingestion commands are executed like any other SQL.
63+
# Here's a sample PUT query. You can remove OVERWRITE at the end to avoid silently overwriting data.
64+
query = f"PUT '{_complete_path}' INTO 'stage://tmp/{INGESTION_USER}/pysql_examples/demo.csv' OVERWRITE"
65+
66+
print(f"Uploading {FILEPATH} to staging location")
67+
cursor.execute(query)
68+
print("Upload was successful")
69+
70+
temp_fp = os.path.realpath("temp.csv")
71+
72+
# Here's a sample GET query. Note that `temp_fp` must also be contained within, or descended from,
73+
# the staging_allowed_local_path.
74+
query = (
75+
f"GET 'stage://tmp/{INGESTION_USER}/pysql_examples/demo.csv' TO '{temp_fp}'"
76+
)
77+
78+
print(f"Fetching from staging location into new file called temp.csv")
79+
cursor.execute(query)
80+
print("Download was successful")
81+
82+
# Here's a sample REMOVE query. It cleans up the the demo.csv created in our first query
83+
query = f"REMOVE 'stage://tmp/{INGESTION_USER}/pysql_examples/demo.csv'"
84+
85+
print("Removing demo.csv from staging location")
86+
cursor.execute(query)
87+
print("Remove was successful")

0 commit comments

Comments
 (0)