Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

j99ca · 2023-06-30T15:00:48Z

Describe the bug
The base_job_name is not being respected. The docs for ProcessingInputs state that the job name is used for storage for local inputs files that get uploaded to s3. However, local files are not using the base_job_name

To reproduce

Create a script processor object with a base_job_name. Put it into a sagemaker pipeline and run. Look at the resulting "ProcessingInputs.0.S3Input.S3Uri" in sagemaker studio and see that it is not using the base_job_name like the docs say it should

Expected behavior

ProcessingInputs from local files should use the processor object's base_job_name when building the S3 path for upload

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.161.0
Python version: 3.9
Custom Docker image (Y/N): Y

brockwade633 · 2023-07-05T17:13:02Z

Hi @j99ca, sorry about the confusion around local artifact s3 storage. Last year, we released new improvements for Pipeline caching, and part of those improvements was to update the S3 path structure for local artifact uploads.

The reason we did this is because previously the base_job_name was used together with timestamp in the s3 path. Every time the python code was executed, a new timestamp and new s3 path was generated for the artifact. This effectively breaks caching functionality, because Pipelines includes the s3 path when generating cache keys. So pipeline executions that should've been a cache hit were a cache miss.

The job_name / timestamp path structure was replaced with path components such as the pipeline name, step name, and a hash of the actual code files. This organizes artifacts under a single pipeline and ensures that only a change to the actual artifact will cause a cache miss and re-execute the pipeline. The docs for the path structures are here: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html#s3-artifact-folder-structure.

It seems there is still some documentation that discusses the base_job_name as part of the s3 path. Would you share where that is? We may need to update other parts of our docs to prevent confusion. Thanks and let us know if there's anything else we can do to help.

j99ca · 2023-07-05T17:36:38Z

Hi @brockwade633 thanks for the response. I looked at the source code for the processor step and how it generates the name, and it appears it does not use the pipeline name. It uses the step name followed by a hash of the code that the step uses. This is unfortunate because it's causing unexpected collisions. I have several pipelines that use the same source code but have different ProcessorInput files, which are colliding each time I build the pipelines. If the pipeline name was included in the generated path I would not have this issue, but alas the sagemaker sdk code for generating the path is:

    def _generate_code_upload_path(self) -> str:
        """Generate an upload path for local processing scripts based on its contents."""
        from sagemaker.workflow.utilities import hash_file

        code_hash = hash_file(self.code)
        return f"{self.name}-{code_hash}"[:1024]

This path calculated for the code path is reused for the other ProcessorInputs instead of generating one for each Processor Input.

brockwade633 · 2023-07-05T17:55:37Z

Hi @j99ca, would you mind sharing your python pipeline code so I can take a closer look? In the context of a PipelineSession, the job name created via _generate_code_upload_path should be ignored and the pipeline name included in the path.

j99ca · 2023-07-06T12:12:14Z

@brockwade633 I am currently using the older method of using the processor=my_processor and job_arguments=[...] rather than using step_arguments + calling the script processors run function within the PipelineSession, which I believe is the newer method. Most of this code was written a couple of years ago. If I update the depreciated functions (which is on my list of tasks) you are saying that would use the pipeline name in constructing the path? For a temp workaround I added a line to change the step.name =... value which seems to work for now to get our pipelines functioning properly

brockwade633 · 2023-07-06T17:38:06Z

@j99ca that's correct - with a PipelineSession, the script processor's .run() method generates the step arguments but doesn't fully execute until the Pipeline is created. That way, the pipeline name can be leveraged when uploading local artifacts to S3. For example, from the abalone example notebook, the step args are created by using .run() and then the step is created:



from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.workflow.steps import ProcessingStep

processor_args = sklearn_processor.run(
    inputs=[
        ProcessingInput(source=input_data, destination="/opt/ml/processing/input"),
    ],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    code="code/preprocessing.py",
)

step_process = ProcessingStep(name="AbaloneProcess", step_args=processor_args)

Then the step is passed into the pipeline later. Once the pipeline is created, upserted, or updated, then the local artifacts are uploaded based on the new path structures here. Please let me know if you're able to verify this for your use case, or if you run into any other blockers.

brockwade633 · 2023-07-09T01:51:06Z

It's likely code comments like these that are causing confusion. I'll additionally open a PR to adjust those parameter descriptions.

qidewenwhen · 2023-08-07T18:38:48Z

Closing the issue as it's not a customer blocker anymore. @brockwade633 will work on a PR to adjust the doc strings.
Feel free to reopen if you have any further questions. Thanks!

lorenzwalthert · 2023-08-23T15:37:52Z

While I can understand that because of caching, the base name should not contain a time stamp or the like, it would still be nice if I can specify a constant base name, otherwise, when I look at a processing job name, I can't tell which pipeline it is part of. So I vote to re-open this issue.

lorenzwalthert · 2023-08-23T15:39:02Z

Also reported here: aws/amazon-sagemaker-examples#3106

j99ca · 2023-10-25T18:37:13Z

@brockwade633 Sorry for the very late reply, I meant to dig back into the documentation to find where it pointed me in that direction, but other work sidetracked me! Yes it was in the code comments. I think you found it!

lorenzwalthert · 2024-04-08T21:24:36Z

Actually support for this has been implemented recently, just use PipelineDefinitionConfig, see also StackOverflow.

j99ca added the type: bug label Jun 30, 2023

aoguo64 added component: pipelines Relates to the SageMaker Pipeline Platform and removed type: bug labels Jul 3, 2023

qidewenwhen added the type: question label Aug 5, 2023

qidewenwhen closed this as completed Aug 7, 2023

lorenzwalthert mentioned this issue Dec 14, 2023

Error running a pipeline with a Processing Job using a LocalSession #4307

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

j99ca commented Jun 30, 2023

brockwade633 commented Jul 5, 2023

j99ca commented Jul 5, 2023 •

edited

Loading

brockwade633 commented Jul 5, 2023

j99ca commented Jul 6, 2023 •

edited

Loading

brockwade633 commented Jul 6, 2023

brockwade633 commented Jul 9, 2023 •

edited

Loading

qidewenwhen commented Aug 7, 2023

lorenzwalthert commented Aug 23, 2023

lorenzwalthert commented Aug 23, 2023 •

edited

Loading

j99ca commented Oct 25, 2023 •

edited

Loading

lorenzwalthert commented Apr 8, 2024 •

edited

Loading

Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

Sagemaker Processors base_job_name argument not working when used with sagemaker pipelines #3971

Comments

j99ca commented Jun 30, 2023

brockwade633 commented Jul 5, 2023

j99ca commented Jul 5, 2023 • edited Loading

brockwade633 commented Jul 5, 2023

j99ca commented Jul 6, 2023 • edited Loading

brockwade633 commented Jul 6, 2023

brockwade633 commented Jul 9, 2023 • edited Loading

qidewenwhen commented Aug 7, 2023

lorenzwalthert commented Aug 23, 2023

lorenzwalthert commented Aug 23, 2023 • edited Loading

j99ca commented Oct 25, 2023 • edited Loading

lorenzwalthert commented Apr 8, 2024 • edited Loading

j99ca commented Jul 5, 2023 •

edited

Loading

j99ca commented Jul 6, 2023 •

edited

Loading

brockwade633 commented Jul 9, 2023 •

edited

Loading

lorenzwalthert commented Aug 23, 2023 •

edited

Loading

j99ca commented Oct 25, 2023 •

edited

Loading

lorenzwalthert commented Apr 8, 2024 •

edited

Loading