-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @j99ca, sorry about the confusion around local artifact s3 storage. Last year, we released new improvements for Pipeline caching, and part of those improvements was to update the S3 path structure for local artifact uploads. The reason we did this is because previously the The job_name / timestamp path structure was replaced with path components such as the pipeline name, step name, and a hash of the actual code files. This organizes artifacts under a single pipeline and ensures that only a change to the actual artifact will cause a cache miss and re-execute the pipeline. The docs for the path structures are here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#s3-artifact-folder-structure. It seems there is still some documentation that discusses the |
Hi @brockwade633 thanks for the response. I looked at the source code for the processor step and how it generates the name, and it appears it does not use the pipeline name. It uses the step name followed by a hash of the code that the step uses. This is unfortunate because it's causing unexpected collisions. I have several pipelines that use the same source code but have different ProcessorInput files, which are colliding each time I build the pipelines. If the pipeline name was included in the generated path I would not have this issue, but alas the sagemaker sdk code for generating the path is:
This path calculated for the code path is reused for the other ProcessorInputs instead of generating one for each Processor Input. |
Hi @j99ca, would you mind sharing your python pipeline code so I can take a closer look? In the context of a PipelineSession, the job name created via |
@brockwade633 I am currently using the older method of using the processor=my_processor and job_arguments=[...] rather than using step_arguments + calling the script processors run function within the PipelineSession, which I believe is the newer method. Most of this code was written a couple of years ago. If I update the depreciated functions (which is on my list of tasks) you are saying that would use the pipeline name in constructing the path? For a temp workaround I added a line to change the step.name =... value which seems to work for now to get our pipelines functioning properly |
@j99ca that's correct - with a
Then the step is passed into the pipeline later. Once the pipeline is created, upserted, or updated, then the local artifacts are uploaded based on the new path structures here. Please let me know if you're able to verify this for your use case, or if you run into any other blockers. |
It's likely code comments like these that are causing confusion. I'll additionally open a PR to adjust those parameter descriptions. |
Closing the issue as it's not a customer blocker anymore. @brockwade633 will work on a PR to adjust the doc strings. |
While I can understand that because of caching, the base name should not contain a time stamp or the like, it would still be nice if I can specify a constant base name, otherwise, when I look at a processing job name, I can't tell which pipeline it is part of. So I vote to re-open this issue. |
Also reported here: aws/amazon-sagemaker-examples#3106 |
@brockwade633 Sorry for the very late reply, I meant to dig back into the documentation to find where it pointed me in that direction, but other work sidetracked me! Yes it was in the code comments. I think you found it! |
Actually support for this has been implemented recently, just use |
Describe the bug
The base_job_name is not being respected. The docs for ProcessingInputs state that the job name is used for storage for local inputs files that get uploaded to s3. However, local files are not using the base_job_name
To reproduce
Create a script processor object with a base_job_name. Put it into a sagemaker pipeline and run. Look at the resulting "ProcessingInputs.0.S3Input.S3Uri" in sagemaker studio and see that it is not using the base_job_name like the docs say it should
Expected behavior
ProcessingInputs from local files should use the processor object's base_job_name when building the S3 path for upload
System information
A description of your system. Please provide:
The text was updated successfully, but these errors were encountered: