SI: Implement put operations #67

susodapop · 2022-11-15T23:13:02Z

Description

This pull request implements basic PUT operations.

PUT is supported with tests
GET is supported with tests
REMOVE is supported with tests

Forthcoming: Based on our internal design doc we need to restrict which local files may be PUT using uploadsBasePath. This will come in a follow-up PR.

The scope of PECO-397 was to implement PUT only. But I think we should include implementing GET in this PR so we can verify in the e2e test that the file we attempted to PUT was successfully captured on the server. Right now we're just checking for a 200 status code. Which is better than nothing but only means that the server said everything is okay. But maybe the server is wrong...

Update: I've now implemented GET in this pull request so we can verify that a PUT operation succeeded.

Update: I've applied all PR feedback minus adding a usage example. To make the example I need to know a bit more about how we want to address multi-end-user cases (do we need more than one uploads_base_path, should we rename uploads_base_path, and should we illustrate how to perform retries of the staging operations or simply retry them automatically?)

doesn't set `isStagingOperation` flag to True...researching Signed-off-by: Jesse Whitehouse <[email protected]>

This upgrade now captures the isStagingOperation flag. Staging Ops still don't work because the flag shows false. Researching... Signed-off-by: Jesse Whitehouse <[email protected]>

Signed-off-by: Jesse Whitehouse <[email protected]>

…hanged Signed-off-by: Jesse Whitehouse <[email protected]>

Stub out delete and get tests so it's clear what has and has not been done. Signed-off-by: Jesse Whitehouse <[email protected]>

Signed-off-by: Jesse Whitehouse <[email protected]>

src/databricks/sql/client.py

susodapop · 2022-11-16T23:17:26Z

src/databricks/sql/client.py

@@ -331,6 +379,10 @@ def execute(
            self.buffer_size_bytes,
            self.arraysize,
        )
+
+        if execute_response.is_staging_operation:


Question for reviewers: is there any specifically desired end-state for the cursor after a staging operation? Maybe we return a new NamedTuple StagingOperationResult with properties of .successful:boolean and perhaps a copy of the operation and localFile that were used?

I don't quite get this question, but the cursor for now will return just one row and we should have reached the end of this cursor.

@susodapop could you please with an a sample code explain how this will provide different experience to the end user?

tests/e2e/driver_tests.py

when the server indicates isStagingOperation=True in the ExecuteResponse Signed-off-by: Jesse Whitehouse <[email protected]>

poetry run python -m pytest tests/unit/tests.py This command failed with only relative inputs. However poetry run python -m pytest tests/unit would succeed Signed-off-by: Jesse Whitehouse <[email protected]>

Signed-off-by: Jesse Whitehouse <[email protected]>

src/databricks/sql/client.py

CONTRIBUTING.md

src/databricks/sql/client.py

HSCandYH · 2022-11-17T23:15:11Z

src/databricks/sql/client.py

+        req_func = operation_map[operation]
+
+        if local_file:
+            raw_data = open(local_file, "rb")


local_file is super interesting here, customer might try different schemes I bet.
dbfs://bla, https://bla, files://bla

Each of those schemes has a related "open" function and client side shall try to understand their ask and support/decline their request.

I also would like to know how much of those schemes we plan to support in near term and how this function would grow

Preliminary spec is to only support upload of local files. Nothing in dbfs or from an arbitrary URL. That restriction isn't implemented here because it's part of a separate ticket. The basic idea is that uploads will only be possible when a user configures an uploadsBasePath pointing to a mounted volume.

I agree that we need a way to hook this behaviour for other file origins, however. I'm going to noodle how we can make this sufficiently generic for now.

cc: @moderakh

ingestion only supports uploading from local file system not elsewhere. anything else must fail.

Signed-off-by: Jesse Whitehouse <[email protected]>

Fix lifecycle e2e test so it honours these requirements Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop · 2022-11-23T20:52:52Z

I've pushed in the uploadsBasePath restriction and associated tests. I have a few questions for the group though:

Should we allow uploads_base_path to be a list of allowed paths?
Do we need to require uploads_base_path for REMOVE operations?
Should we re-name uploads_base_path to allowed_local_dirs since it's technically restricting not only what data we PUT but what data we GET?
Right now we requireuploads_base_path be provide and be a path-like string. Do we need any other explicit constraints? For example we could prevent users from specifying the root directory '/' as their uploads_base_path. Is that desireable?

cc @moderakh @HSCandYH

src/databricks/sql/client.py

xiaonanyang-db

LGTM except some minor comments and tests request

src/databricks/sql/client.py

tests/e2e/driver_tests.py

moderakh

Thanks @susodapop high level looks good.

could you please verify what happens if uploads_base_path = /Users/user1 and row.localFile = /Users/user1/../user2 maybe add a test.

also please add a sample code for the cursor usage after staging operation.

also could you please once a comment is addressed, resolve the comment. not clear to me which comments are addressed and which one not addressed in the PR.

pyproject.toml

moderakh · 2022-12-14T14:17:26Z

src/databricks/sql/client.py

+                "You must provide an uploads_base_path when initialising a connection to perform ingestion commands"
+            )
+
+        row = self.active_result_set.fetchone()


I know self.active_result_set is introduced in this PR.
so this is merely a generic question rather that specific to staging.

if we are using a field member self.active_result_set for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql. is this understanding correct?

I'm confused by the first part of your question:

I know self.active_result_set is introduced in this PR.

I don't believe this is correct. active_result_set has been present since the first version of this library. It's present on main right now.

if we are using a field member self.active_result_set for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql

You're pulling on a valid thread. But I disagree with this assessment. In general pysql works fine with multi-threading. In fact, multi-threading is required if you want to cancel a running query (which is reflected in PySQLCoreTestSuite.test_cancel_during_execute).

The specific scenario where active_result_set state would affect multi-threaded applications is if multiple threads are working with the same cursor. Is that a desirable usage pattern? I think there is usually one cursor per thread, in which case there's no issue with shared state.

moderakh · 2022-12-14T14:20:26Z

src/databricks/sql/client.py

+        row = self.active_result_set.fetchone()
+
+        if getattr(row, "localFile", None):
+            if os.path.commonpath([row.localFile, uploads_base_path]) != uploads_base_path:


what happens if uploads_base_path = /Users/user1 and row.localFile = /Users/user1/../user2 what does this method return?

could you please add some tests for that?

Good catch.

tl;dr I updated the code in 34a0362 so that it resolves any relative paths before checking for their common_path. I added a test to prove this.

Before

/Users/user1 and /Users/user1/../user2 show a common path of /Users/user1 which is wrong.

After

/Users/user1 and /Users/user1/../user2 show a common path of /Users which is correct.

@moderakh Are there other cases we should consider?

moderakh · 2022-12-14T14:24:39Z

src/databricks/sql/client.py

@@ -331,6 +379,10 @@ def execute(
            self.buffer_size_bytes,
            self.arraysize,
        )
+
+        if execute_response.is_staging_operation:


@susodapop could you please with an a sample code explain how this will provide different experience to the end user?

src/databricks/sql/thrift_backend.py

tests/e2e/driver_tests.py

Signed-off-by: Jesse Whitehouse <[email protected]>

Changed per PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

…_path Applies feedback from PR review Signed-off-by: Jesse Whitehouse <[email protected]>

…ITE not set Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

Added after PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

moderakh · 2022-12-20T23:23:55Z

thanks @susodapop my answers inline:

I've pushed in the uploadsBasePath restriction and associated tests. I have a few questions for the group though:

Should we allow uploads_base_path to be a list of allowed paths?

yes

Do we need to require uploads_base_path for REMOVE operations?

yes

Should we re-name uploads_base_path to allowed_local_dirs since it's technically restricting not only what data we PUT but what data we GET?

I thinks we should make this plural but allowed_local_dirs is not clear to what feature we are referring.
maybe uploads_base_paths ?

Right now we requireuploads_base_path be provide and be a path-like string. Do we need any other explicit constraints? For example we could prevent users from specifying the root directory '/' as their uploads_base_path. Is that desireable?

I don't think we can do that that requires knowledge of specific OS filesystem structure and home user path. We rely on the end user to configure that.

cc @moderakh @HSCandYH

Signed-off-by: Jesse Whitehouse <[email protected]>

or a list of strings. Signed-off-by: Jesse Whitehouse <[email protected]>

Signed-off-by: Jesse Whitehouse <[email protected]>

Follow up to #67 and #64 * Regenerate TCLIService using latest TCLIService.thrift from DBR (#64) * SI: Implement GET, PUT, and REMOVE (#67) * Re-lock dependencies after merging `main` Signed-off-by: Jesse Whitehouse <[email protected]>

Jesse Whitehouse added 7 commits November 15, 2022 11:03

Basic PUT operation. Currently this never executes because the server

28c3a59

doesn't set `isStagingOperation` flag to True...researching Signed-off-by: Jesse Whitehouse <[email protected]>

Bump Spark CLI service protocol version being used.

1b245b1

This upgrade now captures the isStagingOperation flag. Staging Ops still don't work because the flag shows false. Researching... Signed-off-by: Jesse Whitehouse <[email protected]>

Log when attempting a staging operation

1239def

Signed-off-by: Jesse Whitehouse <[email protected]>

Fix failing unit tests since function signature for ExecuteResponse c…

b605cce

…hanged Signed-off-by: Jesse Whitehouse <[email protected]>

Add e2e test for put.

3ed84d8

Stub out delete and get tests so it's clear what has and has not been done. Signed-off-by: Jesse Whitehouse <[email protected]>

Bail on tests if staging_ingestion_user is not set

57b8a34

Signed-off-by: Jesse Whitehouse <[email protected]>

Black client.py

7812278

Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop marked this pull request as ready for review November 16, 2022 23:13

susodapop requested review from arikfr, moderakh and yunbodeng-db as code owners November 16, 2022 23:13

susodapop commented Nov 16, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

susodapop commented Nov 16, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

susodapop commented Nov 16, 2022

View reviewed changes

tests/e2e/driver_tests.py Outdated Show resolved Hide resolved

Jesse Whitehouse added 3 commits November 17, 2022 09:45

Add unit test that sanity checks _handle_staging_operation is called

6b76439

when the server indicates isStagingOperation=True in the ExecuteResponse Signed-off-by: Jesse Whitehouse <[email protected]>

Fix imports so that this module can be run independently:

3df7c89

poetry run python -m pytest tests/unit/tests.py This command failed with only relative inputs. However poetry run python -m pytest tests/unit would succeed Signed-off-by: Jesse Whitehouse <[email protected]>

Implement GET operation

8f0a02e

Signed-off-by: Jesse Whitehouse <[email protected]>

moderakh requested review from HSCandYH and xiaonanyang-db November 17, 2022 16:16

moderakh reviewed Nov 17, 2022

View reviewed changes

xiaonanyang-db reviewed Nov 17, 2022

View reviewed changes

CONTRIBUTING.md Outdated Show resolved Hide resolved

xiaonanyang-db reviewed Nov 17, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

xiaonanyang-db reviewed Nov 17, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

HSCandYH reviewed Nov 17, 2022

View reviewed changes

Jesse Whitehouse added 5 commits November 23, 2022 13:22

Refactor client.py into distinct methods for each ingestion command type

55525cb

Signed-off-by: Jesse Whitehouse <[email protected]>

Update pypoetry so I can develop on Python 3.10

157ac3d

Signed-off-by: Jesse Whitehouse <[email protected]>

Applied PR feedback around explicit response codes.

0739ccc

Signed-off-by: Jesse Whitehouse <[email protected]>

Applying PR feedback

d3a3651

Signed-off-by: Jesse Whitehouse <[email protected]>

PR feedback

72f917e

Signed-off-by: Jesse Whitehouse <[email protected]>

Jesse Whitehouse added 2 commits November 23, 2022 14:31

Only allow ingestion commands when base_uploads_path is specified

36885a4

Signed-off-by: Jesse Whitehouse <[email protected]>

Restrict local file operations to descendents of uploads_base_path

c0c09d4

Fix lifecycle e2e test so it honours these requirements Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop requested review from HSCandYH, moderakh and xiaonanyang-db November 23, 2022 20:50

susodapop commented Nov 23, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

xiaonanyang-db approved these changes Dec 2, 2022

View reviewed changes

src/databricks/sql/client.py Outdated Show resolved Hide resolved

src/databricks/sql/client.py Outdated Show resolved Hide resolved

tests/e2e/driver_tests.py Show resolved Hide resolved

moderakh approved these changes Dec 14, 2022

View reviewed changes

Jesse Whitehouse added 4 commits December 20, 2022 13:42

Remove per PR feedback

f612795

Signed-off-by: Jesse Whitehouse <[email protected]>

Add check for null local_file per PR feedback

e609ef3

Signed-off-by: Jesse Whitehouse <[email protected]>

Open output stream _after_ successful HTTP request

cdbe2d6

Changed per PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

Resolve relative paths before comparing row.localFile to uploads_base…

34a0362

…_path Applies feedback from PR review Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop force-pushed the PECO-397 branch from 891749f to 5daea8d Compare December 20, 2022 22:01

Jesse Whitehouse added 2 commits December 20, 2022 16:03

Add test that PUT fails if file exists in staging location and OVERWR…

c8a64c7

…ITE not set Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

Add tests: operations fail to modify another user's staging location

d48d3f3

Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop force-pushed the PECO-397 branch from 5daea8d to d48d3f3 Compare December 20, 2022 22:06

Jesse Whitehouse added 2 commits December 20, 2022 16:09

Add test that ingestion command fails if local file is blank

e0037e0

Added after PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

Add test that invalid staging path will fail at server

3fa5d84

Added after PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop requested a review from moderakh December 20, 2022 22:50

Jesse Whitehouse added 6 commits December 22, 2022 11:18

Basic usage example (needs tweaking)

4824b68

Signed-off-by: Jesse Whitehouse <[email protected]>

Add samples of GET and REMOVE

469f35f

Signed-off-by: Jesse Whitehouse <[email protected]>

Refactor to allow uploads_base_path to be either a single string object

bdb948a

or a list of strings. Signed-off-by: Jesse Whitehouse <[email protected]>

Refactor uploads_base_path to staging_allowed_local_path

0261b7a

Signed-off-by: Jesse Whitehouse <[email protected]>

Fix mypy static type failures

00d8a49

Signed-off-by: Jesse Whitehouse <[email protected]>

Black src files

7a602e6

Signed-off-by: Jesse Whitehouse <[email protected]>

susodapop merged commit c41d724 into si Dec 30, 2022

susodapop mentioned this pull request Jan 9, 2023

Merge staging ingestion into main #78

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SI: Implement put operations #67

SI: Implement put operations #67

susodapop commented Nov 15, 2022 •

edited

Loading

susodapop Nov 16, 2022

HSCandYH Nov 17, 2022

moderakh Dec 14, 2022

HSCandYH Nov 17, 2022

susodapop Nov 17, 2022

moderakh Dec 20, 2022

susodapop commented Nov 23, 2022

xiaonanyang-db left a comment

moderakh left a comment •

edited

Loading

moderakh Dec 14, 2022

susodapop Dec 14, 2022

moderakh Dec 14, 2022

moderakh Dec 14, 2022

susodapop Dec 20, 2022

susodapop Dec 20, 2022

moderakh Dec 14, 2022

moderakh commented Dec 20, 2022

SI: Implement put operations #67

SI: Implement put operations #67

Conversation

susodapop commented Nov 15, 2022 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

susodapop commented Nov 23, 2022

xiaonanyang-db left a comment

Choose a reason for hiding this comment

moderakh left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Before

After

Choose a reason for hiding this comment

Choose a reason for hiding this comment

moderakh commented Dec 20, 2022

susodapop commented Nov 15, 2022 •

edited

Loading

moderakh left a comment •

edited

Loading