Skip to content

_repack_script_launcher.sh is overwriting runproc.sh on subsequent pipeline steps making pipeline to fail #3467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RicardoHS opened this issue Nov 10, 2022 · 12 comments
Assignees
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug

Comments

@RicardoHS
Copy link

Describe the bug
Not sure if it was introduced in this commit d5261f2. I have found my pipelines fails when using v2.116.0 instead of v2.112.0. Comparing pipeline.describe() output from when I upsert the pipeline from my local machine (using v2.112.0) vs the one executed on my CI/CD (uses last version available) shows me that if a step uses a custom entrypoint, the S3 path used is the same for all steps. Because runproc.sh is stored there, only the runproc.sh of last step with custom entrypoint will survive.

In v2.112.0 pipeline.describe() output:
My first step:

[...]
{
  "InputName": "entrypoint",
  "AppManaged": false,
  "S3Input": {
      "S3Uri": "s3://[...]/tensorflow-2022-11-10-15-11-30-827/source/runproc.sh",
      "LocalPath": "/opt/ml/processing/input/entrypoint",
      "S3DataType": "S3Prefix",
      "S3InputMode": "File",
      "S3DataDistributionType": "FullyReplicated",
      "S3CompressionType": "None"
  }
}

My second step

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/tensorflow-2022-11-10-15-11-31-496/source/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

In v2.116.0 pipeline.describe() output:
My first step:

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

My second step

[...]
  {
      "InputName": "entrypoint",
      "AppManaged": false,
      "S3Input": {
          "S3Uri": "s3://[...]/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/runproc.sh",
          "LocalPath": "/opt/ml/processing/input/entrypoint",
          "S3DataType": "S3Prefix",
          "S3InputMode": "File",
          "S3DataDistributionType": "FullyReplicated",
          "S3CompressionType": "None"
      }
  }

To reproduce
Use two steps with custom entrypoint. In 2.112.0 it should work, in 2.116.0 all steps will use the same entrypoint code (the one from the last step defined)

Expected behavior
Custom entrypoint code is not overwritten

Screenshots or logs

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.112.0 vs 2.116.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Tensorflow
  • Framework version: 2.4.1
  • Python version: py37
  • CPU or GPU: CPU
  • Custom Docker image (Y/N):

Additional context

@rohangujarathi rohangujarathi added the component: pipelines Relates to the SageMaker Pipeline Platform label Nov 12, 2022
@qidewenwhen qidewenwhen self-assigned this Nov 14, 2022
@qidewenwhen
Copy link
Member

qidewenwhen commented Nov 14, 2022

Hi @RicardoHS thanks for using AWS SageMaker!
To help us understand and reproduce the issue, could you please provide us the following information?

  • What failure reason did you see?
  • Could you explain more on the sentence below? Since no pipeline code snippet was presented, I'm not able to tell if this behavior is expected or not.

if a step uses a custom entrypoint, the S3 path used is the same for all steps. Because runproc.sh is stored there, only the runproc.sh of last step with custom entrypoint will survive.

  • What kind of steps are you using, regarding all steps in your pipeline? E.g. TrainingStep, ProcessingStep or ModelStep? The _repack_script_launcher.sh would be introduced by the ModelStep/RegisterModel only
  • If possible, could you please provide us the code snippet for your pipeline definition? It would help us a lot to accelerate the issue reproducing process

@RicardoHS
Copy link
Author

RicardoHS commented Nov 15, 2022

Hi @qidewenwhen, see the answers below

What failure reason did you see?

My first ProcessingStep fails due an strange error (makes no sense for the script). After inspecting the logs, this first step must call a custom script /opt/ml/processing/input/code/preprocess.py but instead it's calling the custom script of my last ProcessingStep /opt/ml/processing/input/code/evaluate.py. If I go to the S3 path where the runproc.sh for that first step is stored and cat the file I see this:

#!/bin/bash

cd /opt/ml/processing/input/code/
tar -xzf sourcedir.tar.gz

# Exit on any error. SageMaker uses error code to mark failed job.
set -e

if [[ -f 'requirements.txt' ]]; then
    # Some py3 containers has typing, which may breaks pip install
    pip uninstall --yes typing

    pip install -r requirements.txt
fi

python evaluate.py "$@"

Note the final line, this is the run file for the first step so that line must be python preprocess.py "$@"

When searching the S3 path for the last ProcessingStep I realized it's the same S3 path as the first step (both steps use the same path). This only happens for v2.116.0, for v2.112.0 the S3 path is different for each step:

for v2.116.0
s3://<default_sagemaker_bucket>/MLPlat-weight-size-prediction/code/11596271372a0f364465cbc82d312815/sourcedir.tar.gz
for v2.112.0
s3://<default_sagemaker_bucket>/tensorflow-2022-11-11-08-41-07-205/source/sourcedir.tar.gz)

MLPlat-weight-size-prediction is the name used on the pipeline creation.

Could you explain more on the sentence below? Since no pipeline code snippet was presented, I'm not able to tell if this behavior is expected or not.

Hope it's more clear with the previous explanation

What kind of steps are you using, regarding all steps in your pipeline? E.g. TrainingStep, ProcessingStep or ModelStep? The _repack_script_launcher.sh would be introduced by the ModelStep/RegisterModel only

I'm using ProcessingStep, ConditionStep, LambdaStep, ModelStep and TrainingStep

If possible, could you please provide us the code snippet for your pipeline definition? It would help us a lot to accelerate the issue reproducing process

I'm sorry but the whole code is splitted between multiple files and merging all together would be a lot of work. I can provide you the pipeline.describe() output if it's useful

Finally: If it's not clear, when reverting the version to 2.112.0 everything starts working again

@qidewenwhen
Copy link
Member

qidewenwhen commented Nov 15, 2022

Thanks @RicardoHS! Based on all these info, the issue seems to relate to our recently released Cache changes. To help us confirm this and to take next step for the fix, could you please kindly provide the following detailed info:

  • The part of code snippet on how you define the two ProcessingStep, their Processors, and how you define the processor.run (if you use the new step interface)
  • The json returned by pipeline.describe()

@RicardoHS
Copy link
Author

@qidewenwhen here it is. Some information is inside self._config object (you can see that info is not important for the problem here). The self._session object is a PipelineSession object.


self.query_processor = ScriptProcessor(
            command=["python3"],
            image_uri=self._config.query_image_uri,
            role=self._config.role,
            instance_count=1, 
            instance_type=self._config.instance_type,
            network_config=NetworkConfig(
                enable_network_isolation=False,
                # VPC-Prod
                subnets=["subnet-something"],
                security_group_ids=["sg-something"],
            ),
            sagemaker_session=self._session,
        )

  self.data_processor = FrameworkProcessor(
      role=self._config.role,
      instance_type=self._config.instance_type,
      instance_count=self._config.instance_count,
      estimator_cls=TensorFlow,
      framework_version="2.9",
      py_version="py39",
      sagemaker_session=self._session,
  )

query_step = ProcessingStep(
            name="Query-Data",
            step_args=self.query_processor.run(
                code=self._config.source_dir + "query_data.py",
                arguments=[
                    "--output-path",
                    query_output_path,
                    "--region",
                    self._config.region,
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="task_query_output",
                        source=query_output_path,
                        destination=self._config.prepare_s3_path + "input/",
                    ),
                ],
            ),
            cache_config=self._cache_config,
        )

input_path = "/opt/ml/processing/input"
output_path = "/opt/ml/processing/output"

prepare_step = ProcessingStep(
    name="Prepare-Data",
    step_args=self.data_processor.run(
        code="preprocess.py",
        source_dir=self._config.source_dir + "data_preprocess/",
        inputs=[
            ProcessingInput(
                input_name="task_preprocess_input",
     source=query_step.properties.ProcessingOutputConfig.Outputs["task_query_output"].S3Output.S3Uri,
                destination=input_path,
            )
        ],
        outputs=[
            ProcessingOutput(
                output_name="task_preprocess_output",
                source=output_path,
                destination=self._config.prepare_s3_path + "output/",
            ),
        ],
        arguments=[
            "--input-path",
            input_path,
            "--output-path",
            output_path,
            "--region",
            self._config.region,
        ],
    ),
    cache_config=self._cache_config,
)

split_step = ProcessingStep(
            name="Split-Data",
            step_args=self.data_processor.run(
                code="train_test_split.py",
                source_dir=self._config.source_dir + "data_split/",
                inputs=[
                    ProcessingInput(
                        source=prepare_step.properties.ProcessingOutputConfig.Outputs[
                            "task_preprocess_output"
                        ].S3Output.S3Uri,
                        destination=input_path,
                    ),
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="task_split_output",
                        source=output_path,
                        destination=self._config.split_s3_path,
                    ),
                ],
                arguments=["--input-path", input_path, "--output-path", output_path],
            ),
            cache_config=self._cache_config,
        )

[...]
sk_processor = FrameworkProcessor(
            framework_version="1.0-1",
            instance_type=config.instance_type,
            instance_count=config.instance_count,
            base_job_name=config.base_job_name,
            role=config.role,
            estimator_cls=SKLearn,
            sagemaker_session=self._session,
        )

evaluate_step = ProcessingStep(
            name="Evaluate-Model",
            step_args=sk_processor.run(
                code="evaluate.py",
                source_dir=config.source_dir,
                inputs=[
                    ProcessingInput(
                        source=train_step.properties.ModelArtifacts.S3ModelArtifacts,
                        destination="/opt/ml/processing/model",
                    ),
                ],
                outputs=[
                    ProcessingOutput(
                        output_name="evaluation",
                        source="/opt/ml/processing/evaluation",
                        destination=config.output_s3_destination,
                    ),
                ],
            ),
            property_files=[
                self.evaluation_file,
            ],
            cache_config=self._cache_config,
        )

pipe.describe() from v2.116.0

{
    "Version": "2020-12-01",
    "Metadata": {},
    "Parameters": [
        {
            "Name": "TrainingEpochs",
            "Type": "String",
            "DefaultValue": "10"
        }
    ],
    "PipelineExperimentConfig": {
        "ExperimentName": {
            "Get": "Execution.PipelineName"
        },
        "TrialName": {
            "Get": "Execution.PipelineExecutionId"
        }
    },
    "Steps": [
        {
            "Name": "Query-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-db-handler",
                    "ContainerArguments": [
                        "--output-path",
                        "/opt/ml/processing/output/data/",
                        "--region",
                        "eu-west-1"
                    ],
                    "ContainerEntrypoint": [
                        "python3",
                        "/opt/ml/processing/input/code/query_data.py"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/097a2d06096d5b4342bf5e0777381c38/query_data.py",
                            "LocalPath": "/opt/ml/processing/input/code",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_query_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/prepare/input/",
                                "LocalPath": "/opt/ml/processing/output/data/",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                },
                "NetworkConfig": {
                    "EnableNetworkIsolation": false,
                    "VpcConfig": {
                        "SecurityGroupIds": [
                            "sg-something"
                        ],
                        "Subnets": [
                            "subnet-something"
                        ]
                    }
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Prepare-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.9-cpu-py39",
                    "ContainerArguments": [
                        "--input-path",
                        "/opt/ml/processing/input",
                        "--output-path",
                        "/opt/ml/processing/output",
                        "--region",
                        "eu-west-1"
                    ],
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "task_preprocess_input",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Query-Data.ProcessingOutputConfig.Outputs['task_query_output'].S3Output.S3Uri"
                            },
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_preprocess_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/prepare/output/",
                                "LocalPath": "/opt/ml/processing/output",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Split-Data",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.2xlarge",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.9-cpu-py39",
                    "ContainerArguments": [
                        "--input-path",
                        "/opt/ml/processing/input",
                        "--output-path",
                        "/opt/ml/processing/output"
                    ],
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "input-1",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Prepare-Data.ProcessingOutputConfig.Outputs['task_preprocess_output'].S3Output.S3Uri"
                            },
                            "LocalPath": "/opt/ml/processing/input",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "task_split_output",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/split_data/",
                                "LocalPath": "/opt/ml/processing/output",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Train-Model",
            "Type": "Training",
            "Arguments": {
                "AlgorithmSpecification": {
                    "TrainingInputMode": "File",
                    "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-training:2.4.1-cpu-py37",
                    "EnableSageMakerMetricsTimeSeries": true
                },
                "OutputDataConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/"
                },
                "StoppingCondition": {
                    "MaxRuntimeInSeconds": 86400
                },
                "ResourceConfig": {
                    "VolumeSizeInGB": 30,
                    "InstanceCount": 1,
                    "InstanceType": "ml.m5.2xlarge"
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "InputDataConfig": [
                    {
                        "DataSource": {
                            "S3DataSource": {
                                "S3DataType": "S3Prefix",
                                "S3Uri": {
                                    "Get": "Steps.Split-Data.ProcessingOutputConfig.Outputs['task_split_output'].S3Output.S3Uri"
                                },
                                "S3DataDistributionType": "FullyReplicated"
                            }
                        },
                        "ContentType": "text/csv",
                        "ChannelName": "train"
                    },
                    {
                        "DataSource": {
                            "S3DataSource": {
                                "S3DataType": "S3Prefix",
                                "S3Uri": {
                                    "Get": "Steps.Prepare-Data.ProcessingOutputConfig.Outputs['task_preprocess_output'].S3Output.S3Uri"
                                },
                                "S3DataDistributionType": "FullyReplicated"
                            }
                        },
                        "ContentType": "text/csv",
                        "ChannelName": "model_artifacts"
                    }
                ],
                "HyperParameters": {
                    "epochs": "\"10\"",
                    "batch-size": "2048",
                    "sagemaker_submit_directory": "\"s3://wllp-ds/Train-Model-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                    "sagemaker_program": "\"train.py\"",
                    "sagemaker_container_log_level": "20",
                    "sagemaker_region": "\"eu-west-1\"",
                    "model_dir": "\"s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/checkpoints/\""
                },
                "DebugHookConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/",
                    "CollectionConfigurations": []
                },
                "ProfilerRuleConfigurations": [
                    {
                        "RuleConfigurationName": "ProfilerReport-something",
                        "RuleEvaluatorImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-debugger-rules:latest",
                        "RuleParameters": {
                            "rule_to_invoke": "ProfilerReport"
                        }
                    }
                ],
                "ProfilerConfig": {
                    "S3OutputPath": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/train_task/model/"
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            }
        },
        {
            "Name": "Evaluate-Model",
            "Type": "Processing",
            "Arguments": {
                "ProcessingResources": {
                    "ClusterConfig": {
                        "InstanceType": "ml.m5.large",
                        "InstanceCount": 1,
                        "VolumeSizeInGB": 30
                    }
                },
                "AppSpecification": {
                    "ImageUri": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:1.0-1-cpu-py3",
                    "ContainerEntrypoint": [
                        "/bin/bash",
                        "/opt/ml/processing/input/entrypoint/runproc.sh"
                    ]
                },
                "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                "ProcessingInputs": [
                    {
                        "InputName": "input-1",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": {
                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                            },
                            "LocalPath": "/opt/ml/processing/model",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "code",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/sourcedir.tar.gz",
                            "LocalPath": "/opt/ml/processing/input/code/",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    },
                    {
                        "InputName": "entrypoint",
                        "AppManaged": false,
                        "S3Input": {
                            "S3Uri": "s3://sagemaker-eu-west-1-something/MLPlat-weight-size-prediction/code/1188c5996b8665c088b520ecf8a51666/runproc.sh",
                            "LocalPath": "/opt/ml/processing/input/entrypoint",
                            "S3DataType": "S3Prefix",
                            "S3InputMode": "File",
                            "S3DataDistributionType": "FullyReplicated",
                            "S3CompressionType": "None"
                        }
                    }
                ],
                "ProcessingOutputConfig": {
                    "Outputs": [
                        {
                            "OutputName": "evaluation",
                            "AppManaged": false,
                            "S3Output": {
                                "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/",
                                "LocalPath": "/opt/ml/processing/evaluation",
                                "S3UploadMode": "EndOfJob"
                            }
                        }
                    ]
                }
            },
            "CacheConfig": {
                "Enabled": false,
                "ExpireAfter": "15d"
            },
            "PropertyFiles": [
                {
                    "PropertyFileName": "EvaluationReport",
                    "OutputName": "evaluation",
                    "FilePath": "evaluation.json"
                }
            ]
        },
        {
            "Name": "Metric-Threshold-Condition",
            "Type": "Condition",
            "Arguments": {
                "Conditions": [
                    {
                        "Type": "LessThanOrEqualTo",
                        "LeftValue": {
                            "Std:JsonGet": {
                                "PropertyFile": {
                                    "Get": "Steps.Evaluate-Model.PropertyFiles.EvaluationReport"
                                },
                                "Path": "metrics.mape.value"
                            }
                        },
                        "RightValue": 230.0
                    }
                ],
                "IfSteps": [
                    {
                        "Name": "Approved-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-637"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/Approved-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-637",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "Approved-RegisterModel",
                        "Type": "RegisterModel",
                        "Arguments": {
                            "ModelPackageGroupName": "mlplatform-weight-size-prediction",
                            "ModelMetrics": {
                                "ModelQuality": {
                                    "Statistics": {
                                        "ContentType": "application/json",
                                        "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/evaluation.json"
                                    }
                                },
                                "Bias": {},
                                "Explainability": {}
                            },
                            "InferenceSpecification": {
                                "Containers": [
                                    {
                                        "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                        "Environment": {
                                            "SAGEMAKER_PROGRAM": "inference.py",
                                            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                            "SAGEMAKER_REGION": "eu-west-1"
                                        },
                                        "ModelDataUrl": {
                                            "Get": "Steps.Approved-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                        }
                                    }
                                ],
                                "SupportedContentTypes": [
                                    "text/csv"
                                ],
                                "SupportedResponseMIMETypes": [
                                    "text/csv"
                                ],
                                "SupportedRealtimeInferenceInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ],
                                "SupportedTransformInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ]
                            },
                            "ModelApprovalStatus": "Approved"
                        }
                    },
                    {
                        "Name": "mlplatform-weight-size-prediction-something-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-646"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/mlplatform-weight-size-prediction-5fad3689f867-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-45-646",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "mlplatform-weight-size-prediction-something-CreateModel",
                        "Type": "Model",
                        "Arguments": {
                            "ExecutionRoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "PrimaryContainer": {
                                "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                "Environment": {
                                    "SAGEMAKER_PROGRAM": "inference.py",
                                    "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                    "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                    "SAGEMAKER_REGION": "eu-west-1"
                                },
                                "ModelDataUrl": {
                                    "Get": "Steps.mlplatform-weight-size-prediction-something-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                }
                            },
                            "Tags": [
                                ...
                            ]
                        },
                        "Description": "Model created automatically from script."
                    },
                    {
                        "Name": "DeployEndpoint",
                        "Type": "Lambda",
                        "Arguments": {
                            "model_name": {
                                "Get": "Steps.mlplatform-weight-size-prediction-5fad3689f867-CreateModel.ModelName"
                            },
                            "endpoint_config_name": "mlplatform-weight-size-prediction-prod-config-2022-11-10-18-22",
                            "endpoint_name": "mlplatform-weight-size-prediction-prod",
                            "instance_type": "ml.m5.large",
                            "initial_variant_weight": 1,
                            "initial_instance_count": 1,
                            "data_capture_enable": true,
                            "data_capture_destination_s3": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/endpoint_data_capture/",
                        },
                        "FunctionArn": "arn:aws:lambda:eu-west-1:something:function:mlplatform-weight-size-prediction-prod-sagemaker-deploy",
                        "OutputParameters": [
                            {
                                "OutputName": "statusCode",
                                "OutputType": "String"
                            },
                            {
                                "OutputName": "body",
                                "OutputType": "String"
                            }
                        ]
                    }
                ],
                "ElseSteps": [
                    {
                        "Name": "Rejected-RepackModel-0",
                        "Type": "Training",
                        "Arguments": {
                            "AlgorithmSpecification": {
                                "TrainingInputMode": "File",
                                "TrainingImage": "something.dkr.ecr.eu-west-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3"
                            },
                            "OutputDataConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-44-929"
                            },
                            "StoppingCondition": {
                                "MaxRuntimeInSeconds": 86400
                            },
                            "ResourceConfig": {
                                "VolumeSizeInGB": 30,
                                "InstanceCount": 1,
                                "InstanceType": "ml.m5.large"
                            },
                            "RoleArn": "arn:aws:iam::something:role/IAMRoleSageMakerBI",
                            "InputDataConfig": [
                                {
                                    "DataSource": {
                                        "S3DataSource": {
                                            "S3DataType": "S3Prefix",
                                            "S3Uri": {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            },
                                            "S3DataDistributionType": "FullyReplicated"
                                        }
                                    },
                                    "ChannelName": "training"
                                }
                            ],
                            "HyperParameters": {
                                "inference_script": "\"inference.py\"",
                                "model_archive": {
                                    "Std:Join": {
                                        "On": "",
                                        "Values": [
                                            {
                                                "Get": "Steps.Train-Model.ModelArtifacts.S3ModelArtifacts"
                                            }
                                        ]
                                    }
                                },
                                "dependencies": "null",
                                "source_dir": "\"/Users/ricardohortelano/git/mlplatform-weight-size-prediction/src/weight_size_prediction/drivers/pipeline/remote/code/\"",
                                "sagemaker_submit_directory": "\"s3://sagemaker-eu-west-1-something/Rejected-RepackModel-0-1188c5996b8665c088b520ecf8a51666/source/sourcedir.tar.gz\"",
                                "sagemaker_program": "\"_repack_script_launcher.sh\"",
                                "sagemaker_container_log_level": "20",
                                "sagemaker_region": "\"eu-west-1\""
                            },
                            "DebugHookConfig": {
                                "S3OutputPath": "s3://sagemaker-eu-west-1-something/tensorflow-inference-2022-11-10-17-22-44-929",
                                "CollectionConfigurations": []
                            }
                        },
                        "Description": "Used to repack a model with customer scripts for a register/create model step"
                    },
                    {
                        "Name": "Rejected-RegisterModel",
                        "Type": "RegisterModel",
                        "Arguments": {
                            "ModelPackageGroupName": "mlplatform-weight-size-prediction",
                            "ModelMetrics": {
                                "ModelQuality": {
                                    "Statistics": {
                                        "ContentType": "application/json",
                                        "S3Uri": "s3://wllp-ds/projects/mlplatform-weight-size-prediction/evaluation/evaluation.json"
                                    }
                                },
                                "Bias": {},
                                "Explainability": {}
                            },
                            "InferenceSpecification": {
                                "Containers": [
                                    {
                                        "Image": "something.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.4.1-cpu",
                                        "Environment": {
                                            "SAGEMAKER_PROGRAM": "inference.py",
                                            "SAGEMAKER_SUBMIT_DIRECTORY": "/opt/ml/model/code",
                                            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
                                            "SAGEMAKER_REGION": "eu-west-1"
                                        },
                                        "ModelDataUrl": {
                                            "Get": "Steps.Rejected-RepackModel-0.ModelArtifacts.S3ModelArtifacts"
                                        }
                                    }
                                ],
                                "SupportedContentTypes": [
                                    "text/csv"
                                ],
                                "SupportedResponseMIMETypes": [
                                    "text/csv"
                                ],
                                "SupportedRealtimeInferenceInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ],
                                "SupportedTransformInstanceTypes": [
                                    "ml.m5.2xlarge"
                                ]
                            },
                            "ModelApprovalStatus": "Rejected"
                        }
                    }
                ]
            }
        }
    ]
}

@brockwade633
Copy link
Contributor

Hi @RicardoHS, thanks a lot for providing all the information. One other clarifying question to confirm the issue: When you first encountered this behavior, were all the code files located in the same local directory, like self._config.source_dir? Rather than segmenting some of them in subdirectories like data_preprocess, for example.

@RicardoHS
Copy link
Author

Hi @brockwade633 . That's exactly the situation in the code when the behaviour was encountered.
Sorry, the development has continue while this issue was opened and I did not realize the code paths changed since that day.

@brockwade633
Copy link
Contributor

brockwade633 commented Nov 28, 2022

Hi @RicardoHS, thanks for your reply. The behavior that you describe does represent a bug - the runproc.sh artifact that gets created from FrameworkProcessors is being uploaded to the same directory as the other code artifacts when it should live in a separate directory. The effect of this is the runproc.sh file being overwritten during multiple processing steps built with a FrameworkProcessor. We will open a PR to fix the issue, although we will need to wait until after reinvent week to release the fix.

Workaround

In the meantime, I was able to produce a valid pipeline with distinct upload locations by using the path structures you included in the last code snippet, using additional subdirectories such as data_preprocess, or data_split:

split_step = ProcessingStep(
          name="Split-Data",
          step_args=self.data_processor.run(
              code="train_test_split.py",
              source_dir=self._config.source_dir + "data_split/",
              inputs=[
                  ProcessingInput(
                      source=prepare_step.properties.ProcessingOutputConfig.Outputs[
                          "task_preprocess_output"
                      ].S3Output.S3Uri,
                      destination=input_path,
                  ),
              ],
              outputs=[
                  ProcessingOutput(
                      output_name="task_split_output",
                      source=output_path,
                      destination=self._config.split_s3_path,
                  ),
              ],
              arguments=["--input-path", input_path, "--output-path", output_path],
          ),
          cache_config=self._cache_config,
      )

As a temporary workaround, the folder separations produce different S3 upload folders for each step, and prevents runproc.sh from being overwritten. Please let us know if you are able to reproduce this successfully. I will keep this thread updated with the latest info, thank you.

@RicardoHS
Copy link
Author

Hi @brockwade633 I have tried putting every script on a separated folder and if I use v2.116.0 the pipeline throw me this new error:

ClientError: Failed to invoke sagemaker:CreateProcessingJob. Error Details: Input and Output names must be globally unique: [code]

@brockwade633
Copy link
Contributor

brockwade633 commented Nov 29, 2022

@RicardoHS, thanks for letting us know. We have experienced a couple issues specifically around FrameworkProcessors and SparkProcessors recently and this looks like an example of an idempotency problem that's being tracked here: #3451. In your code, are you calling pipeline.definition(), pipeline.create(), pipeline.update(), pipeline.upsert(), or step.arguments, multiple times?

@brockwade633
Copy link
Contributor

Hi @RicardoHS, a new sagemaker package version is available (2.121.1) that includes fixes for both the runproc.sh file issue, and the idempotency problems. Are you able to confirm that you are no longer seeing the runproc.sh over-writing behavior, or the duplicate input client error? Thanks.

@RicardoHS
Copy link
Author

Hi @brockwade633 after the update to 2.121.1 everything seems to work. Thank you very much.

@brockwade633
Copy link
Contributor

Okay, great! You're welcome. Closing out the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug
Projects
None yet
Development

No branches or pull requests

4 participants