_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

RicardoHS · 2022-11-10T18:31:12Z

Describe the bug
Not sure if it was introduced in this commit d5261f2. I have found my pipelines fails when using v2.116.0 instead of v2.112.0. Comparing pipeline.describe() output from when I upsert the pipeline from my local machine (using v2.112.0) vs the one executed on my CI/CD (uses last version available) shows me that if a step uses a custom entrypoint, the S3 path used is the same for all steps. Because runproc.sh is stored there, only the runproc.sh of last step with custom entrypoint will survive.

In v2.112.0 pipeline.describe() output:
My first step:

[...]
{
  "InputName": "entrypoint",
  "AppManaged": false,
  "S3Input": {
      "S3Uri": "s3://[...]/tensorflow-2022-11-10-15-11-30-827/source/runproc.sh",
      "LocalPath": "/opt/ml/processing/input/entrypoint",
      "S3DataType": "S3Prefix",
      "S3InputMode": "File",
      "S3DataDistributionType": "FullyReplicated",
      "S3CompressionType": "None"
  }
}

My second step

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/tensorflow-2022-11-10-15-11-31-496/source/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

In v2.116.0 pipeline.describe() output:
My first step:

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

My second step

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

To reproduce
Use two steps with custom entrypoint. In 2.112.0 it should work, in 2.116.0 all steps will use the same entrypoint code (the one from the last step defined)

Expected behavior
Custom entrypoint code is not overwritten

Screenshots or logs

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.112.0 vs 2.116.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
Framework version: 2.4.1
Python version: py37
CPU or GPU: CPU
Custom Docker image (Y/N):

Additional context

The text was updated successfully, but these errors were encountered:

qidewenwhen · 2022-11-14T19:20:27Z

Hi @RicardoHS thanks for using AWS SageMaker!
To help us understand and reproduce the issue, could you please provide us the following information?

What failure reason did you see?
Could you explain more on the sentence below? Since no pipeline code snippet was presented, I'm not able to tell if this behavior is expected or not.

if a step uses a custom entrypoint, the S3 path used is the same for all steps. Because runproc.sh is stored there, only the runproc.sh of last step with custom entrypoint will survive.

What kind of steps are you using, regarding all steps in your pipeline? E.g. TrainingStep, ProcessingStep or ModelStep? The _repack_script_launcher.sh would be introduced by the ModelStep/RegisterModel only
If possible, could you please provide us the code snippet for your pipeline definition? It would help us a lot to accelerate the issue reproducing process

RicardoHS · 2022-11-15T16:07:42Z

Hi @qidewenwhen, see the answers below

What failure reason did you see?

My first ProcessingStep fails due an strange error (makes no sense for the script). After inspecting the logs, this first step must call a custom script /opt/ml/processing/input/code/preprocess.py but instead it's calling the custom script of my last ProcessingStep /opt/ml/processing/input/code/evaluate.py. If I go to the S3 path where the runproc.sh for that first step is stored and cat the file I see this:

#!/bin/bash

cd /opt/ml/processing/input/code/
tar -xzf sourcedir.tar.gz

# Exit on any error. SageMaker uses error code to mark failed job.
set -e

if [[ -f 'requirements.txt' ]]; then
    # Some py3 containers has typing, which may breaks pip install
    pip uninstall --yes typing

    pip install -r requirements.txt
fi

python evaluate.py "$@"

Note the final line, this is the run file for the first step so that line must be python preprocess.py "$@"

When searching the S3 path for the last ProcessingStep I realized it's the same S3 path as the first step (both steps use the same path). This only happens for v2.116.0, for v2.112.0 the S3 path is different for each step:

for v2.116.0
s3://<default_sagemaker_bucket>/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/sourcedir.tar.gz

for v2.112.0
s3://<default_sagemaker_bucket>/tensorflow-2022-11-11-08-41-07-205/source/sourcedir.tar.gz)

MLPlat-weight-size-prediction is the name used on the pipeline creation.

Could you explain more on the sentence below? Since no pipeline code snippet was presented, I'm not able to tell if this behavior is expected or not.

Hope it's more clear with the previous explanation

What kind of steps are you using, regarding all steps in your pipeline? E.g. TrainingStep, ProcessingStep or ModelStep? The _repack_script_launcher.sh would be introduced by the ModelStep/RegisterModel only

I'm using ProcessingStep, ConditionStep, LambdaStep, ModelStep and TrainingStep

If possible, could you please provide us the code snippet for your pipeline definition? It would help us a lot to accelerate the issue reproducing process

I'm sorry but the whole code is splitted between multiple files and merging all together would be a lot of work. I can provide you the pipeline.describe() output if it's useful

Finally: If it's not clear, when reverting the version to 2.112.0 everything starts working again

qidewenwhen · 2022-11-15T18:06:07Z

Thanks @RicardoHS! Based on all these info, the issue seems to relate to our recently released Cache changes. To help us confirm this and to take next step for the fix, could you please kindly provide the following detailed info:

The part of code snippet on how you define the two ProcessingStep, their Processors, and how you define the processor.run (if you use the new step interface)
The json returned by pipeline.describe()

RicardoHS · 2022-11-16T14:20:26Z

@qidewenwhen here it is. Some information is inside self._config object (you can see that info is not important for the problem here). The self._session object is a PipelineSession object.


self.query_processor = ScriptProcessor(
            command=["python3"],
            image_uri=self._config.query_image_uri,
            role=self._config.role,
            instance_count=1, 
            instance_type=self._config.instance_type,
            network_config=NetworkConfig(
                enable_network_isolation=False,
                # VPC-Prod
                subnets=["subnet-something"],
                security_group_ids=["sg-something"],
            ),
            sagemaker_session=self._session,
        )

  self.data_processor = FrameworkProcessor(
      role=self._config.role,
      instance_type=self._config.instance_type,
      instance_count=self._config.instance_count,
      estimator_cls=TensorFlow,
      framework_version="2.9",
      py_version="py39",
      sagemaker_session=self._session,
  )

query_step = ProcessingStep(
            name="Query-Data",
            step_args=self.query_processor.run(
                code=self._config.source_dir + "query_data.py",
                arguments=[
                    "--output-path",
                    query_output_path,
                    "--region",
                    self._config.region,
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="task_query_output",
                        source=query_output_path,
                        destination=self._config.prepare_s3_path + "input/",
                    ),
                ],
            ),
            cache_config=self._cache_config,
        )

input_path = "/opt/ml/processing/input"
output_path = "/opt/ml/processing/output"

prepare_step = ProcessingStep(
    name="Prepare-Data",
    step_args=self.data_processor.run(
        code="preprocess.py",
        source_dir=self._config.source_dir + "data_preprocess/",
        inputs=[
            ProcessingInput(
                input_name="task_preprocess_input",
     source=query_step.properties.ProcessingOutputConfig.Outputs["task_query_output"].S3Output.S3Uri,
                destination=input_path,
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="task_preprocess_output",
                source=output_path,
                destination=self._config.prepare_s3_path + "output/",
            ),
        ],
        arguments=[
            "--input-path",
            input_path,
            "--output-path",
            output_path,
            "--region",
            self._config.region,
        ],
    ),
    cache_config=self._cache_config,
)

split_step = ProcessingStep(
            name="Split-Data",
            step_args=self.data_processor.run(
                code="train_test_split.py",
                source_dir=self._config.source_dir + "data_split/",
                inputs=[
                    ProcessingInput(
                        source=prepare_step.properties.ProcessingOutputConfig.Outputs[
                            "task_preprocess_output"
                        ].S3Output.S3Uri,
                        destination=input_path,
                    ),
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="task_split_output",
                        source=output_path,
                        destination=self._config.split_s3_path,
                    ),
                ],
                arguments=["--input-path", input_path, "--output-path", output_path],
            ),
            cache_config=self._cache_config,
        )

[...]
sk_processor = FrameworkProcessor(
            framework_version="1.0-1",
            instance_type=config.instance_type,
            instance_count=config.instance_count,
            base_job_name=config.base_job_name,
            role=config.role,
            estimator_cls=SKLearn,
            sagemaker_session=self._session,
        )

evaluate_step = ProcessingStep(
            name="Evaluate-Model",
            step_args=sk_processor.run(
                code="evaluate.py",
                source_dir=config.source_dir,
                inputs=[
                    ProcessingInput(
                        source=train_step.properties.ModelArtifacts.S3ModelArtifacts,
                        destination="/opt/ml/processing/model",
                    ),
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="evaluation",
                        source="/opt/ml/processing/evaluation",
                        destination=config.output_s3_destination,
                    ),
                ],
            ),
            property_files=[
                self.evaluation_file,
            ],
            cache_config=self._cache_config,
        )

pipe.describe() from v2.116.0

{
    "Version": "2020-12-01",
    "Metadata": {},
    "Parameters": [
        {
            "Name": "TrainingEpochs",
            "Type": "String",
            "DefaultValue": "10"
        }
    ],
    "PipelineExperimentConfig": {
        "ExperimentName": {
            "Get": "Execution.PipelineName"
        },
        "TrialName": {
            "Get": "Execution.PipelineExecutionId"
        }
    },
    "Steps": [
        {
            "Name": "Query-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-db-handler",
                    "ContainerArguments": [
                        "--output-path",
                        "/opt/ml/processing/output/data/",
                        "--region",
                        "eu-west-1"
                    ],
                    "ContainerEntrypoint": [
                        "python3",
                        "/opt/ml/processing/input/code/query_data.py"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/097a2d06096d5b4342bf5e0777381c38/query_data.py",
                            "LocalPath": "/opt/ml/processing/input/code",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_query_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/prepare/input/",
                                "LocalPath": "/opt/ml/processing/output/data/",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                },
                "NetworkConfig": {
                    "EnableNetworkIsolation": false,
                    "VpcConfig": {
                        "SecurityGroupIds": [
                            "sg-something"
                        ],
                        "Subnets": [
                            "subnet-something"
                        ]
                    }
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Prepare-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.9-cpu-py39",
                    "ContainerArguments": [
                        "--input-path",
                        "/opt/ml/processing/input",
                        "--output-path",
                        "/opt/ml/processing/output",
                        "--region",
                        "eu-west-1"
                    ],
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "task_preprocess_input",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Query-Data.ProcessingOutputConfig.Outputs['task_query_output'].S3Output.S3Uri"
                            },
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_preprocess_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/prepare/output/",
                                "LocalPath": "/opt/ml/processing/output",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Split-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.9-cpu-py39",
                    "ContainerArguments": [
                        "--input-path",
                        "/opt/ml/processing/input",
                        "--output-path",
                        "/opt/ml/processing/output"
                    ],
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "input-1",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Prepare-Data.ProcessingOutputConfig.Outputs['task_preprocess_output'].S3Output.S3Uri"
                            },
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_split_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/split_data/",
                                "LocalPath": "/opt/ml/processing/output",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Train-Model",
            "Type": "Training",
            "Arguments": {
                "AlgorithmSpecification": {
                    "TrainingInputMode": "File",
                    "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.4.1-cpu-py37",
                    "EnableSageMakerMetricsTimeSeries": true
                },
                "OutputDataConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/"
                },
                "StoppingCondition": {
                    "MaxRuntimeInSeconds": 86400
                },
                "ResourceConfig": {
                    "VolumeSizeInGB": 30,
                    "InstanceCount": 1,
                    "InstanceType": "ml.m5.2xlarge"
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "InputDataConfig": [
                    {
                        "DataSource": {
                            "S3DataSource": {
                                "S3DataType": "S3Prefix",
                                "S3Uri": {
                                    "Get": "Steps.Split-Data.ProcessingOutputConfig.Outputs['task_split_output'].S3Output.S3Uri"
                                },
                                "S3DataDistributionType": "FullyReplicated"
                            }
                        },
                        "ContentType": "text/csv",
                        "ChannelName": "train"
                    },
                    {
                        "DataSource": {
                            "S3DataSource": {
                                "S3DataType": "S3Prefix",
                                "S3Uri": {
                                    "Get": "Steps.Prepare-Data.ProcessingOutputConfig.Outputs['task_preprocess_output'].S3Output.S3Uri"
                                },
                                "S3DataDistributionType": "FullyReplicated"
                            }
                        },
                        "ContentType": "text/csv",
                        "ChannelName": "model_artifacts"
                    }
                ],
                "HyperParameters": {
                    "epochs": "\"10\"",
                    "batch-size": "2048",
                    "sagemaker_submit_directory": "\"s3://wllp-ds/Train-Model-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                    "sagemaker_program": "\"train.py\"",
                    "sagemaker_container_log_level": "20",
                    "sagemaker_region": "\"eu-west-1\"",
                    "model_dir": "\"s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/checkpoints/\""
                },
                "DebugHookConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/",
                    "CollectionConfigurations": []
                },
                "ProfilerRuleConfigurations": [
                    {
                        "RuleConfigurationName": "ProfilerReport-something",
                        "RuleEvaluatorImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest",
                        "RuleParameters": {
                            "rule_to_invoke": "ProfilerReport"
                        }
                    }
                ],
                "ProfilerConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/"
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Evaluate-Model",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.large",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3",
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "input-1",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                            },
                            "LocalPath": "/opt/ml/processing/model",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "evaluation",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/",
                                "LocalPath": "/opt/ml/processing/evaluation",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            },
            "PropertyFiles": [
                {
                    "PropertyFileName": "EvaluationReport",
                    "OutputName": "evaluation",
                    "FilePath": "evaluation.json"
                }
            ]
        },
        {
            "Name": "Metric-Threshold-Condition",
            "Type": "Condition",
            "Arguments": {
                "Conditions": [
                    {
                        "Type": "LessThanOrEqualTo",
                        "LeftValue": {
                            "Std:JsonGet": {
                                "PropertyFile": {
                                    "Get": "Steps.Evaluate-Model.PropertyFiles.EvaluationReport"
                                },
                                "Path": "metrics.mape.value"
                            }
                        },
                        "RightValue": 230.0
                    }
                ],
                "IfSteps": [
                    {
                        "Name": "Approved-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-637"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/Approved-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-637",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "Approved-RegisterModel",
                        "Type": "RegisterModel",
                        "Arguments": {
                            "ModelPackageGroupName": "mlplatform-weight-size-prediction",
                            "ModelMetrics": {
                                "ModelQuality": {
                                    "Statistics": {
                                        "ContentType": "application/json",
                                        "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/evaluation.json"
                                    }
                                },
                                "Bias": {},
                                "Explainability": {}
                            },
                            "InferenceSpecification": {
                                "Containers": [
                                    {
                                        "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                        "Environment": {
                                            "SAGEMAKER_PROGRAM": "inference.py",
                                            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                            "SAGEMAKER_REGION": "eu-west-1"
                                        },
                                        "ModelDataUrl": {
                                            "Get": "Steps.Approved-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                        }
                                    }
                                ],
                                "SupportedContentTypes": [
                                    "text/csv"
                                ],
                                "SupportedResponseMIMETypes": [
                                    "text/csv"
                                ],
                                "SupportedRealtimeInferenceInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ],
                                "SupportedTransformInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ]
                            },
                            "ModelApprovalStatus": "Approved"
                        }
                    },
                    {
                        "Name": "mlplatform-weight-size-prediction-something-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-646"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/mlplatform-weight-size-prediction-5fad3689f867-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-646",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "mlplatform-weight-size-prediction-something-CreateModel",
                        "Type": "Model",
                        "Arguments": {
                            "ExecutionRoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "PrimaryContainer": {
                                "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                "Environment": {
                                    "SAGEMAKER_PROGRAM": "inference.py",
                                    "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                    "SAGEMAKER_REGION": "eu-west-1"
                                },
                                "ModelDataUrl": {
                                    "Get": "Steps.mlplatform-weight-size-prediction-something-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                }
                            },
                            "Tags": [
                                ...
                            ]
                        },
                        "Description": "Model created automatically from script."
                    },
                    {
                        "Name": "DeployEndpoint",
                        "Type": "Lambda",
                        "Arguments": {
                            "model_name": {
                                "Get": "Steps.mlplatform-weight-size-prediction-5fad3689f867-CreateModel.ModelName"
                            },
                            "endpoint_config_name": "mlplatform-weight-size-prediction-prod-config-2022-11-10-18-22",
                            "endpoint_name": "mlplatform-weight-size-prediction-prod",
                            "instance_type": "ml.m5.large",
                            "initial_variant_weight": 1,
                            "initial_instance_count": 1,
                            "data_capture_enable": true,
                            "data_capture_destination_s3": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/endpoint_data_capture/",
                        },
                        "FunctionArn": "arn:aws:lambda:eu-west-1:something:function:mlplatform-weight-size-prediction-prod-sagemaker-deploy",
                        "OutputParameters": [
                            {
                                "OutputName": "statusCode",
                                "OutputType": "String"
                            },
                            {
                                "OutputName": "body",
                                "OutputType": "String"
                            }
                        ]
                    }
                ],
                "ElseSteps": [
                    {
                        "Name": "Rejected-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-44-929"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/Rejected-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-44-929",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "Rejected-RegisterModel",
                        "Type": "RegisterModel",
                        "Arguments": {
                            "ModelPackageGroupName": "mlplatform-weight-size-prediction",
                            "ModelMetrics": {
                                "ModelQuality": {
                                    "Statistics": {
                                        "ContentType": "application/json",
                                        "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/evaluation.json"
                                    }
                                },
                                "Bias": {},
                                "Explainability": {}
                            },
                            "InferenceSpecification": {
                                "Containers": [
                                    {
                                        "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                        "Environment": {
                                            "SAGEMAKER_PROGRAM": "inference.py",
                                            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                            "SAGEMAKER_REGION": "eu-west-1"
                                        },
                                        "ModelDataUrl": {
                                            "Get": "Steps.Rejected-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                        }
                                    }
                                ],
                                "SupportedContentTypes": [
                                    "text/csv"
                                ],
                                "SupportedResponseMIMETypes": [
                                    "text/csv"
                                ],
                                "SupportedRealtimeInferenceInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ],
                                "SupportedTransformInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ]
                            },
                            "ModelApprovalStatus": "Rejected"
                        }
                    }
                ]
            }
        }
    ]
}

brockwade633 · 2022-11-22T20:02:44Z

Hi @RicardoHS, thanks a lot for providing all the information. One other clarifying question to confirm the issue: When you first encountered this behavior, were all the code files located in the same local directory, like self._config.source_dir? Rather than segmenting some of them in subdirectories like data_preprocess, for example.

RicardoHS · 2022-11-23T10:04:22Z

Hi @brockwade633 . That's exactly the situation in the code when the behaviour was encountered.
Sorry, the development has continue while this issue was opened and I did not realize the code paths changed since that day.

brockwade633 · 2022-11-28T16:48:28Z

Hi @RicardoHS, thanks for your reply. The behavior that you describe does represent a bug - the runproc.sh artifact that gets created from FrameworkProcessors is being uploaded to the same directory as the other code artifacts when it should live in a separate directory. The effect of this is the runproc.sh file being overwritten during multiple processing steps built with a FrameworkProcessor. We will open a PR to fix the issue, although we will need to wait until after reinvent week to release the fix.

Workaround

In the meantime, I was able to produce a valid pipeline with distinct upload locations by using the path structures you included in the last code snippet, using additional subdirectories such as data_preprocess, or data_split:

split_step = ProcessingStep(
          name="Split-Data",
          step_args=self.data_processor.run(
              code="train_test_split.py",
              source_dir=self._config.source_dir + "data_split/",
              inputs=[
                  ProcessingInput(
                      source=prepare_step.properties.ProcessingOutputConfig.Outputs[
                          "task_preprocess_output"
                      ].S3Output.S3Uri,
                      destination=input_path,
                  ),
              ],
              outputs=[
                  ProcessingOutput(
                      output_name="task_split_output",
                      source=output_path,
                      destination=self._config.split_s3_path,
                  ),
              ],
              arguments=["--input-path", input_path, "--output-path", output_path],
          ),
          cache_config=self._cache_config,
      )

As a temporary workaround, the folder separations produce different S3 upload folders for each step, and prevents runproc.sh from being overwritten. Please let us know if you are able to reproduce this successfully. I will keep this thread updated with the latest info, thank you.

RicardoHS · 2022-11-29T16:05:33Z

Hi @brockwade633 I have tried putting every script on a separated folder and if I use v2.116.0 the pipeline throw me this new error:

ClientError: Failed to invoke sagemaker:CreateProcessingJob. Error Details: Input and Output names must be globally unique: [code]

brockwade633 · 2022-11-29T20:33:49Z

@RicardoHS, thanks for letting us know. We have experienced a couple issues specifically around FrameworkProcessors and SparkProcessors recently and this looks like an example of an idempotency problem that's being tracked here: #3451. In your code, are you calling pipeline.definition(), pipeline.create(), pipeline.update(), pipeline.upsert(), or step.arguments, multiple times?

brockwade633 · 2022-12-09T20:15:52Z

Hi @RicardoHS, a new sagemaker package version is available (2.121.1) that includes fixes for both the runproc.sh file issue, and the idempotency problems. Are you able to confirm that you are no longer seeing the runproc.sh over-writing behavior, or the duplicate input client error? Thanks.

RicardoHS · 2022-12-13T11:20:52Z

Hi @brockwade633 after the update to 2.121.1 everything seems to work. Thank you very much.

brockwade633 · 2022-12-13T16:47:58Z

Okay, great! You're welcome. Closing out the issue.

RicardoHS added the type: bug label Nov 10, 2022

rohangujarathi added the component: pipelines Relates to the SageMaker Pipeline Platform label Nov 12, 2022

qidewenwhen self-assigned this Nov 14, 2022

qidewenwhen assigned brockwade633 and unassigned qidewenwhen Nov 29, 2022

brockwade633 mentioned this issue Nov 30, 2022

fix: FrameworkProcessor S3 uploads #3493

Merged

9 tasks

brockwade633 closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

RicardoHS commented Nov 10, 2022

qidewenwhen commented Nov 14, 2022 •

edited

Loading

RicardoHS commented Nov 15, 2022 •

edited

Loading

qidewenwhen commented Nov 15, 2022 •

edited

Loading

RicardoHS commented Nov 16, 2022

brockwade633 commented Nov 22, 2022

RicardoHS commented Nov 23, 2022

brockwade633 commented Nov 28, 2022 •

edited

Loading

RicardoHS commented Nov 29, 2022

brockwade633 commented Nov 29, 2022 •

edited

Loading

brockwade633 commented Dec 9, 2022

RicardoHS commented Dec 13, 2022

brockwade633 commented Dec 13, 2022

_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

Comments

RicardoHS commented Nov 10, 2022

Screenshots or logs

Additional context

qidewenwhen commented Nov 14, 2022 • edited Loading

RicardoHS commented Nov 15, 2022 • edited Loading

qidewenwhen commented Nov 15, 2022 • edited Loading

RicardoHS commented Nov 16, 2022

brockwade633 commented Nov 22, 2022

RicardoHS commented Nov 23, 2022

brockwade633 commented Nov 28, 2022 • edited Loading

Workaround

RicardoHS commented Nov 29, 2022

brockwade633 commented Nov 29, 2022 • edited Loading

brockwade633 commented Dec 9, 2022

RicardoHS commented Dec 13, 2022

brockwade633 commented Dec 13, 2022

qidewenwhen commented Nov 14, 2022 •

edited

Loading

RicardoHS commented Nov 15, 2022 •

edited

Loading

qidewenwhen commented Nov 15, 2022 •

edited

Loading

brockwade633 commented Nov 28, 2022 •

edited

Loading

brockwade633 commented Nov 29, 2022 •

edited

Loading