Skip to content

repackModel could expose an output_path parameter #2803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tuliocasagrande opened this issue Dec 17, 2021 · 1 comment
Closed

repackModel could expose an output_path parameter #2803

tuliocasagrande opened this issue Dec 17, 2021 · 1 comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: feature request

Comments

@tuliocasagrande
Copy link
Contributor

Describe the feature you'd like
Right now, repackModel steps write to root of the default bucket. Here's one line from test code that we can verify that:

"OutputDataConfig": {"S3OutputPath": f"s3://{BUCKET}/"},

It would be great if we could expose an output_path and perhaps also a code_location parameters to help us organize our S3 buckets.

How would this feature be used? Please describe.
The feature would be used to help us organize S3 buckets into namespaces/hierachy. For example, right now I organize my prefixes as:

s3://<myprojectbucket>/<base_job_prefix>/preprocess/...
s3://<myprojectbucket>/<base_job_prefix>/train/...
s3://<myprojectbucket>/<base_job_prefix>/evaluate/...

But since repackModel doesn't expose parameters to tune S3 outputs, I also end up some objects in the root, example:

s3://<myprojectbucket>/pipelines-<jobid-1>-sklearnRepackModel.../output/model.tar.gz
s3://<myprojectbucket>/pipelines-<jobid-2>-sklearnRepackModel.../output/model.tar.gz
s3://<myprojectbucket>/pipelines-<jobid-N>-sklearnRepackModel.../output/model.tar.gz
s3://<myprojectbucket>/sagemaker-scikit-learn-<timestamp-1>/source/sourcedir.tar.gz
s3://<myprojectbucket>/sagemaker-scikit-learn-<timestamp-2>/source/sourcedir.tar.gz
s3://<myprojectbucket>/sagemaker-scikit-learn-<timestamp-N>/source/sourcedir.tar.gz

Describe alternatives you've considered
I'd need some guidance to find all the correct places, but I can submit a PR with these changes:

  1. step_collections.RegisterModel and step_collections.EstimatorTransformer, add a new parameter called repack_output_path
  2. On _RepackModelStep() calls (L139, L183, L324), handover the parameter repack_output_path=repack_output_path.
  3. Inside of _RepackModelStep, on the SKLearn() call, add output_path=repack_output_path and code_location=repack_output_path.
  4. Maybe another place I didn't identify.

I'd also need help to understand which unit tests would be needed, on RegisterModel, EstimatorTransformer, _RepackModelStep, or all of them.

Thanks for your review!

tuliocasagrande added a commit to tuliocasagrande/sagemaker-python-sdk that referenced this issue Dec 17, 2021
tuliocasagrande added a commit to tuliocasagrande/sagemaker-python-sdk that referenced this issue Dec 17, 2021
tuliocasagrande added a commit to tuliocasagrande/sagemaker-python-sdk that referenced this issue Dec 17, 2021
@EthanShouhanCheng EthanShouhanCheng added the component: pipelines Relates to the SageMaker Pipeline Platform label Jan 6, 2022
@jerrypeng7773
Copy link
Contributor

@tuliocasagrande looks like your already unblocked yourself =)

but for future reference, RegisterModel takes kwargs, and you will be able to pass outpout_path as a kwarg to it, and the repack model training job will use it as its OutputDataConfig.S3Outputpath.

Closing this issue, feel free to re-open if you have any other concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants