Skip to content

SI: Implement put operations #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Dec 30, 2022
Merged

SI: Implement put operations #67

merged 41 commits into from
Dec 30, 2022

Conversation

susodapop
Copy link
Contributor

@susodapop susodapop commented Nov 15, 2022

Description

This pull request implements basic PUT operations.

  • PUT is supported with tests
  • GET is supported with tests
  • REMOVE is supported with tests

Forthcoming: Based on our internal design doc we need to restrict which local files may be PUT using uploadsBasePath. This will come in a follow-up PR.

The scope of PECO-397 was to implement PUT only. But I think we should include implementing GET in this PR so we can verify in the e2e test that the file we attempted to PUT was successfully captured on the server. Right now we're just checking for a 200 status code. Which is better than nothing but only means that the server said everything is okay. But maybe the server is wrong...

Update: I've now implemented GET in this pull request so we can verify that a PUT operation succeeded.

Update: I've applied all PR feedback minus adding a usage example. To make the example I need to know a bit more about how we want to address multi-end-user cases (do we need more than one uploads_base_path, should we rename uploads_base_path, and should we illustrate how to perform retries of the staging operations or simply retry them automatically?)

Jesse Whitehouse added 7 commits November 15, 2022 11:03
doesn't set `isStagingOperation` flag to True...researching

Signed-off-by: Jesse Whitehouse <[email protected]>
This upgrade now captures the isStagingOperation flag.

Staging Ops still don't work because the flag shows false. Researching...

Signed-off-by: Jesse Whitehouse <[email protected]>
Stub out delete and get tests so it's clear what has and has not been done.

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop susodapop marked this pull request as ready for review November 16, 2022 23:13
@@ -331,6 +379,10 @@ def execute(
self.buffer_size_bytes,
self.arraysize,
)

if execute_response.is_staging_operation:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for reviewers: is there any specifically desired end-state for the cursor after a staging operation? Maybe we return a new NamedTuple StagingOperationResult with properties of .successful:boolean and perhaps a copy of the operation and localFile that were used?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get this question, but the cursor for now will return just one row and we should have reached the end of this cursor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susodapop could you please with an a sample code explain how this will provide different experience to the end user?

Jesse Whitehouse added 3 commits November 17, 2022 09:45
when the server indicates isStagingOperation=True in the ExecuteResponse

Signed-off-by: Jesse Whitehouse <[email protected]>
poetry run python -m pytest tests/unit/tests.py

This command failed with only relative inputs.

However

poetry run python -m pytest tests/unit

would succeed

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
req_func = operation_map[operation]

if local_file:
raw_data = open(local_file, "rb")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

local_file is super interesting here, customer might try different schemes I bet.
dbfs://bla, https://bla, files://bla

Each of those schemes has a related "open" function and client side shall try to understand their ask and support/decline their request.

I also would like to know how much of those schemes we plan to support in near term and how this function would grow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary spec is to only support upload of local files. Nothing in dbfs or from an arbitrary URL. That restriction isn't implemented here because it's part of a separate ticket. The basic idea is that uploads will only be possible when a user configures an uploadsBasePath pointing to a mounted volume.

I agree that we need a way to hook this behaviour for other file origins, however. I'm going to noodle how we can make this sufficiently generic for now.

cc: @moderakh

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingestion only supports uploading from local file system not elsewhere. anything else must fail.

Jesse Whitehouse added 5 commits November 23, 2022 13:22
Jesse Whitehouse added 2 commits November 23, 2022 14:31
Fix lifecycle e2e test so it honours these requirements

Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop
Copy link
Contributor Author

I've pushed in the uploadsBasePath restriction and associated tests. I have a few questions for the group though:

  1. Should we allow uploads_base_path to be a list of allowed paths?
  2. Do we need to require uploads_base_path for REMOVE operations?
  3. Should we re-name uploads_base_path to allowed_local_dirs since it's technically restricting not only what data we PUT but what data we GET?
  4. Right now we requireuploads_base_path be provide and be a path-like string. Do we need any other explicit constraints? For example we could prevent users from specifying the root directory '/' as their uploads_base_path. Is that desireable?

cc @moderakh @HSCandYH

Copy link

@xiaonanyang-db xiaonanyang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some minor comments and tests request

Copy link
Collaborator

@moderakh moderakh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @susodapop high level looks good.

could you please verify what happens if uploads_base_path = /Users/user1 and row.localFile = /Users/user1/../user2 maybe add a test.

also please add a sample code for the cursor usage after staging operation.

also could you please once a comment is addressed, resolve the comment. not clear to me which comments are addressed and which one not addressed in the PR.

"You must provide an uploads_base_path when initialising a connection to perform ingestion commands"
)

row = self.active_result_set.fetchone()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know self.active_result_set is introduced in this PR.
so this is merely a generic question rather that specific to staging.

if we are using a field member self.active_result_set for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql. is this understanding correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the first part of your question:

I know self.active_result_set is introduced in this PR.

I don't believe this is correct. active_result_set has been present since the first version of this library. It's present on main right now.

if we are using a field member self.active_result_set for keeping a state that means we won't be able to support multi threading in an application which concurrently uses pysql

You're pulling on a valid thread. But I disagree with this assessment. In general pysql works fine with multi-threading. In fact, multi-threading is required if you want to cancel a running query (which is reflected in PySQLCoreTestSuite.test_cancel_during_execute).

The specific scenario where active_result_set state would affect multi-threaded applications is if multiple threads are working with the same cursor. Is that a desirable usage pattern? I think there is usually one cursor per thread, in which case there's no issue with shared state.

row = self.active_result_set.fetchone()

if getattr(row, "localFile", None):
if os.path.commonpath([row.localFile, uploads_base_path]) != uploads_base_path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if uploads_base_path = /Users/user1 and row.localFile = /Users/user1/../user2 what does this method return?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add some tests for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

tl;dr I updated the code in 34a0362 so that it resolves any relative paths before checking for their common_path. I added a test to prove this.

Before

/Users/user1 and /Users/user1/../user2 show a common path of /Users/user1 which is wrong.

After

/Users/user1 and /Users/user1/../user2 show a common path of /Users which is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moderakh Are there other cases we should consider?

@@ -331,6 +379,10 @@ def execute(
self.buffer_size_bytes,
self.arraysize,
)

if execute_response.is_staging_operation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@susodapop could you please with an a sample code explain how this will provide different experience to the end user?

Jesse Whitehouse added 4 commits December 20, 2022 13:42
Signed-off-by: Jesse Whitehouse <[email protected]>
Changed per PR feedback

Signed-off-by: Jesse Whitehouse <[email protected]>
…_path

Applies feedback from PR review

Signed-off-by: Jesse Whitehouse <[email protected]>
Jesse Whitehouse added 2 commits December 20, 2022 16:03
…ITE not set

Added following PR feedback

Signed-off-by: Jesse Whitehouse <[email protected]>
Added following PR feedback

Signed-off-by: Jesse Whitehouse <[email protected]>
Jesse Whitehouse added 2 commits December 20, 2022 16:09
Added after PR feedback

Signed-off-by: Jesse Whitehouse <[email protected]>
Added after PR feedback

Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop susodapop requested a review from moderakh December 20, 2022 22:50
@moderakh
Copy link
Collaborator

thanks @susodapop my answers inline:

I've pushed in the uploadsBasePath restriction and associated tests. I have a few questions for the group though:

  1. Should we allow uploads_base_path to be a list of allowed paths?

yes

  1. Do we need to require uploads_base_path for REMOVE operations?

yes

  1. Should we re-name uploads_base_path to allowed_local_dirs since it's technically restricting not only what data we PUT but what data we GET?

I thinks we should make this plural but allowed_local_dirs is not clear to what feature we are referring.
maybe uploads_base_paths ?

  1. Right now we requireuploads_base_path be provide and be a path-like string. Do we need any other explicit constraints? For example we could prevent users from specifying the root directory '/' as their uploads_base_path. Is that desireable?

I don't think we can do that that requires knowledge of specific OS filesystem structure and home user path. We rely on the end user to configure that.

cc @moderakh @HSCandYH

Jesse Whitehouse added 6 commits December 22, 2022 11:18
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
or a list of strings.

Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
Signed-off-by: Jesse Whitehouse <[email protected]>
@susodapop susodapop merged commit c41d724 into si Dec 30, 2022
susodapop pushed a commit that referenced this pull request Jan 10, 2023
Follow up to #67 and #64 

* Regenerate TCLIService using latest TCLIService.thrift from DBR (#64)
* SI: Implement GET, PUT, and REMOVE (#67)
* Re-lock dependencies after merging `main`

Signed-off-by: Jesse Whitehouse <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants