-
Notifications
You must be signed in to change notification settings - Fork 102
SI: Implement put operations #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
doesn't set `isStagingOperation` flag to True...researching Signed-off-by: Jesse Whitehouse <[email protected]>
This upgrade now captures the isStagingOperation flag. Staging Ops still don't work because the flag shows false. Researching... Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
…hanged Signed-off-by: Jesse Whitehouse <[email protected]>
Stub out delete and get tests so it's clear what has and has not been done. Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
@@ -331,6 +379,10 @@ def execute( | |||
self.buffer_size_bytes, | |||
self.arraysize, | |||
) | |||
|
|||
if execute_response.is_staging_operation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for reviewers: is there any specifically desired end-state for the cursor
after a staging operation? Maybe we return a new NamedTuple StagingOperationResult
with properties of .successful:boolean
and perhaps a copy of the operation
and localFile
that were used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get this question, but the cursor for now will return just one row and we should have reached the end of this cursor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@susodapop could you please with an a sample code explain how this will provide different experience to the end user?
when the server indicates isStagingOperation=True in the ExecuteResponse Signed-off-by: Jesse Whitehouse <[email protected]>
poetry run python -m pytest tests/unit/tests.py This command failed with only relative inputs. However poetry run python -m pytest tests/unit would succeed Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
src/databricks/sql/client.py
Outdated
req_func = operation_map[operation] | ||
|
||
if local_file: | ||
raw_data = open(local_file, "rb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_file is super interesting here, customer might try different schemes I bet.
dbfs://bla, https://bla, files://bla
Each of those schemes has a related "open" function and client side shall try to understand their ask and support/decline their request.
I also would like to know how much of those schemes we plan to support in near term and how this function would grow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preliminary spec is to only support upload of local files. Nothing in dbfs or from an arbitrary URL. That restriction isn't implemented here because it's part of a separate ticket. The basic idea is that uploads will only be possible when a user configures an uploadsBasePath
pointing to a mounted volume.
I agree that we need a way to hook this behaviour for other file origins, however. I'm going to noodle how we can make this sufficiently generic for now.
cc: @moderakh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ingestion only supports uploading from local file system not elsewhere. anything else must fail.
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Fix lifecycle e2e test so it honours these requirements Signed-off-by: Jesse Whitehouse <[email protected]>
I've pushed in the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some minor comments and tests request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @susodapop high level looks good.
could you please verify what happens if uploads_base_path = /Users/user1
and row.localFile = /Users/user1/../user2
maybe add a test.
also please add a sample code for the cursor usage after staging operation.
also could you please once a comment is addressed, resolve the comment. not clear to me which comments are addressed and which one not addressed in the PR.
"You must provide an uploads_base_path when initialising a connection to perform ingestion commands" | ||
) | ||
|
||
row = self.active_result_set.fetchone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know self.active_result_set
is introduced in this PR.
so this is merely a generic question rather that specific to staging.
if we are using a field member self.active_result_set
for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql. is this understanding correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the first part of your question:
I know self.active_result_set is introduced in this PR.
I don't believe this is correct. active_result_set
has been present since the first version of this library. It's present on main
right now.
if we are using a field member self.active_result_set for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql
You're pulling on a valid thread. But I disagree with this assessment. In general pysql works fine with multi-threading. In fact, multi-threading is required if you want to cancel a running query (which is reflected in PySQLCoreTestSuite.test_cancel_during_execute
).
The specific scenario where active_result_set
state would affect multi-threaded applications is if multiple threads are working with the same cursor. Is that a desirable usage pattern? I think there is usually one cursor per thread, in which case there's no issue with shared state.
src/databricks/sql/client.py
Outdated
row = self.active_result_set.fetchone() | ||
|
||
if getattr(row, "localFile", None): | ||
if os.path.commonpath([row.localFile, uploads_base_path]) != uploads_base_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if uploads_base_path = /Users/user1
and row.localFile = /Users/user1/../user2
what does this method return?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add some tests for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
tl;dr I updated the code in 34a0362 so that it resolves any relative paths before checking for their common_path. I added a test to prove this.
Before
/Users/user1
and /Users/user1/../user2
show a common path of /Users/user1
which is wrong.
After
/Users/user1
and /Users/user1/../user2
show a common path of /Users
which is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moderakh Are there other cases we should consider?
@@ -331,6 +379,10 @@ def execute( | |||
self.buffer_size_bytes, | |||
self.arraysize, | |||
) | |||
|
|||
if execute_response.is_staging_operation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@susodapop could you please with an a sample code explain how this will provide different experience to the end user?
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Changed per PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>
…_path Applies feedback from PR review Signed-off-by: Jesse Whitehouse <[email protected]>
…ITE not set Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>
Added following PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>
Added after PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>
Added after PR feedback Signed-off-by: Jesse Whitehouse <[email protected]>
thanks @susodapop my answers inline:
yes
yes
I thinks we should make this plural but
I don't think we can do that that requires knowledge of specific OS filesystem structure and home user path. We rely on the end user to configure that. |
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
or a list of strings. Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Follow up to #67 and #64 * Regenerate TCLIService using latest TCLIService.thrift from DBR (#64) * SI: Implement GET, PUT, and REMOVE (#67) * Re-lock dependencies after merging `main` Signed-off-by: Jesse Whitehouse <[email protected]>
Description
This pull request implements basic
PUT
operations.Forthcoming: Based on our internal design doc we need to restrict which local files may be PUT using
uploadsBasePath
. This will come in a follow-up PR.The scope of PECO-397 was to implement PUT only. But I think we should include implementing GET in this PR so we can verify in the e2e test that the file we attempted to PUT was successfully captured on the server. Right now we're just checking for a 200 status code. Which is better than nothing but only means that the server said everything is okay. But maybe the server is wrong...
Update: I've now implemented GET in this pull request so we can verify that a PUT operation succeeded.
Update: I've applied all PR feedback minus adding a usage example. To make the example I need to know a bit more about how we want to address multi-end-user cases (do we need more than one
uploads_base_path
, should we renameuploads_base_path
, and should we illustrate how to perform retries of the staging operations or simply retry them automatically?)