Skip to content

Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
j99ca opened this issue Jun 30, 2023 · 11 comments
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: question

Comments

@j99ca
Copy link

j99ca commented Jun 30, 2023

Describe the bug
The base_job_name is not being respected. The docs for ProcessingInputs state that the job name is used for storage for local inputs files that get uploaded to s3. However, local files are not using the base_job_name

To reproduce

Create a script processor object with a base_job_name. Put it into a sagemaker pipeline and run. Look at the resulting "ProcessingInputs.0.S3Input.S3Uri" in sagemaker studio and see that it is not using the base_job_name like the docs say it should

Expected behavior

ProcessingInputs from local files should use the processor object's base_job_name when building the S3 path for upload

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.161.0
  • Python version: 3.9
  • Custom Docker image (Y/N): Y
@aoguo64 aoguo64 added component: pipelines Relates to the SageMaker Pipeline Platform and removed type: bug labels Jul 3, 2023
@brockwade633
Copy link
Contributor

Hi @j99ca, sorry about the confusion around local artifact s3 storage. Last year, we released new improvements for Pipeline caching, and part of those improvements was to update the S3 path structure for local artifact uploads.

The reason we did this is because previously the base_job_name was used together with timestamp in the s3 path. Every time the python code was executed, a new timestamp and new s3 path was generated for the artifact. This effectively breaks caching functionality, because Pipelines includes the s3 path when generating cache keys. So pipeline executions that should've been a cache hit were a cache miss.

The job_name / timestamp path structure was replaced with path components such as the pipeline name, step name, and a hash of the actual code files. This organizes artifacts under a single pipeline and ensures that only a change to the actual artifact will cause a cache miss and re-execute the pipeline. The docs for the path structures are here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#s3-artifact-folder-structure.

It seems there is still some documentation that discusses the base_job_name as part of the s3 path. Would you share where that is? We may need to update other parts of our docs to prevent confusion. Thanks and let us know if there's anything else we can do to help.

@j99ca
Copy link
Author

j99ca commented Jul 5, 2023

Hi @brockwade633 thanks for the response. I looked at the source code for the processor step and how it generates the name, and it appears it does not use the pipeline name. It uses the step name followed by a hash of the code that the step uses. This is unfortunate because it's causing unexpected collisions. I have several pipelines that use the same source code but have different ProcessorInput files, which are colliding each time I build the pipelines. If the pipeline name was included in the generated path I would not have this issue, but alas the sagemaker sdk code for generating the path is:

    def _generate_code_upload_path(self) -> str:
        """Generate an upload path for local processing scripts based on its contents."""
        from sagemaker.workflow.utilities import hash_file

        code_hash = hash_file(self.code)
        return f"{self.name}-{code_hash}"[:1024]

This path calculated for the code path is reused for the other ProcessorInputs instead of generating one for each Processor Input.

@brockwade633
Copy link
Contributor

Hi @j99ca, would you mind sharing your python pipeline code so I can take a closer look? In the context of a PipelineSession, the job name created via _generate_code_upload_path should be ignored and the pipeline name included in the path.

@j99ca
Copy link
Author

j99ca commented Jul 6, 2023

@brockwade633 I am currently using the older method of using the processor=my_processor and job_arguments=[...] rather than using step_arguments + calling the script processors run function within the PipelineSession, which I believe is the newer method. Most of this code was written a couple of years ago. If I update the depreciated functions (which is on my list of tasks) you are saying that would use the pipeline name in constructing the path? For a temp workaround I added a line to change the step.name =... value which seems to work for now to get our pipelines functioning properly

@brockwade633
Copy link
Contributor

@j99ca that's correct - with a PipelineSession, the script processor's .run() method generates the step arguments but doesn't fully execute until the Pipeline is created. That way, the pipeline name can be leveraged when uploading local artifacts to S3. For example, from the abalone example notebook, the step args are created by using .run() and then the step is created:



from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="code/preprocessing.py",
)

step_process = ProcessingStep(name="AbaloneProcess", step_args=processor_args)

Then the step is passed into the pipeline later. Once the pipeline is created, upserted, or updated, then the local artifacts are uploaded based on the new path structures here. Please let me know if you're able to verify this for your use case, or if you run into any other blockers.

@brockwade633
Copy link
Contributor

brockwade633 commented Jul 9, 2023

It's likely code comments like these that are causing confusion. I'll additionally open a PR to adjust those parameter descriptions.

@qidewenwhen
Copy link
Member

Closing the issue as it's not a customer blocker anymore. @brockwade633 will work on a PR to adjust the doc strings.
Feel free to reopen if you have any further questions. Thanks!

@lorenzwalthert
Copy link

While I can understand that because of caching, the base name should not contain a time stamp or the like, it would still be nice if I can specify a constant base name, otherwise, when I look at a processing job name, I can't tell which pipeline it is part of. So I vote to re-open this issue.

@lorenzwalthert
Copy link

lorenzwalthert commented Aug 23, 2023

Also reported here: aws/amazon-sagemaker-examples#3106

@j99ca
Copy link
Author

j99ca commented Oct 25, 2023

@brockwade633 Sorry for the very late reply, I meant to dig back into the documentation to find where it pointed me in that direction, but other work sidetracked me! Yes it was in the code comments. I think you found it!

@lorenzwalthert
Copy link

lorenzwalthert commented Apr 8, 2024

Actually support for this has been implemented recently, just use PipelineDefinitionConfig, see also StackOverflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: question
Projects
None yet
Development

No branches or pull requests

5 participants