TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

marianokamp · 2021-08-03T09:56:01Z

When creating a Transform the source_dir attribute that points to the local directory containing the source is used to pack up the sources, put them onto S3 and when the Transform is executed, it uses that code that has been transferred from the S3 location to the running Transform container.

This works with the SDK directly, but not with SM Pipelines. Here source_dir and the consequent upload of the source is not handled properly.

What is working?

(Plain use of SDK to instantiate model and transfomer)

pytorch_model = PyTorchModel(model_data=model_url,
                             role=role,
                             py_version='py3',
                             framework_version='1.8',
                             source_dir='.',
                             entry_point='inference.py')

pytorch_transformer = pytorch_model.transformer(instance_type='ml.g4dn.xlarge', \
                                                instance_count=1,\
                                                accept='text/csv',\
                                                output_path=[..],\
                                                env={[..]})

This is working in the SDK. The source dir is packaged up, transfered to S3 and also the resulting model contains the property: SAGEMAKER_SUBMIT_DIRECTORY = s3://.../model.tar.gz

Output of a describe-model and inspection of the tar ball:

aws sagemaker describe-model --model-name pytorch-inference-2021-07-28-19-50-16-101
{
    "ModelName": "pytorch-inference-2021-07-28-19-50-16-101",
    "PrimaryContainer": {
        "Image": "763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.8-gpu-py3",
        "Mode": "SingleModel",
        "ModelDataUrl": "s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz",
        "Environment": {
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_REGION": "eu-west-1",
            "SAGEMAKER_SUBMIT_DIRECTORY": "s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz"
        }
    },
    "ExecutionRoleArn": "arn:aws:iam::XXX:role/service-role/AmazonSageMaker-ExecutionRole-20191105T200284",
    "CreationTime": "2021-07-28T21:50:16.627000+02:00",
    "ModelArn": "arn:aws:sagemaker:eu-west-1:XXX:model/pytorch-inference-2021-07-28-19-50-16-101",
    "EnableNetworkIsolation": false
}

aws s3 cp s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz /tmp/m.tgz
tar -tf /tmp/m.tgz
/
args_init.pth
classes.pth
code
[..]
code/inference.py
[..]
model.pth
training_state.pth

^^^ Please recognize the existence of code/inference.py and see the SAGEMAKER_SUBMIT_DIRECTORY.

What is not working?

(Equivalent steps with SM Pipelines)

pytorch_model = PyTorchModel(model_data=model_url,
                             role=role,
                             py_version='py3',
                             framework_version='1.8',
                             source_dir='.',
                             entry_point='inference.py') # same as above

model_input = CreateModelInput(instance_type=estimator.instance_type)

create_model_step = \
            CreateModelStep(name=f'create-model-{model_name}',
                            model=model,
                            inputs=model_input)

transformer = Transformer(
    model_name=create_finetuned_model_step.properties.ModelName,
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    output_path=[..],
    env={'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' },
    accept='text/csv'
)

transform_step = TransformStep(
    name='create-pseudo-labeles',
    transformer=transformer,
    inputs=TransformInput(data=[..],
                          split_type='Line',
                          content_type='text/csv'))

When running the Transform I get the dreaded Please provide a model_fn implementation error.

The main difference seems to be that the SAGEMAKER_SUBMIT_DIRECTORY property has not been updated to s3://.../model.tar.gz as in the working case above, but shows the local path file://.. And the tarball does not contain the inference code.

Full output:


aws sagemaker describe-model --model-name pipelines-dk7q607at2b1-create-model-lstm-YuGjeljWaM
{
    "ModelName": "pipelines-dk7q607at2b1-create-model-lstm-YuGjeljWaM",
    "PrimaryContainer": {
        "Image": "763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.8-gpu-py3",
        "Mode": "SingleModel",
        "ModelDataUrl": "s3://sagemaker-eu-west-1-XXX/pipelines-dk7q607at2b1-train-lstm-yKvVNN1ZA2/output/model.tar.gz",
        "Environment": {
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_REGION": "eu-west-1",
            "SAGEMAKER_SUBMIT_DIRECTORY": "file://."
        }
    },
    "ExecutionRoleArn": "arn:aws:iam::XXX:role/service-role/AmazonSageMaker-ExecutionRole-20191105T200284",
    "CreationTime": "2021-08-03T12:12:50.553000+02:00",
    "ModelArn": "arn:aws:sagemaker:eu-west-1:XXX:model/pipelines-dk7q607at2b1-create-model-lstm-yugjeljwam",
    "EnableNetworkIsolation": false
}
aws s3 cp s3://sagemaker-eu-west-1-XXX/pipelines-dk7q607at2b1-train-lstm-yKvVNN1ZA2/output/model.tar.gz /tmp/m.tgz
tar -tf /tmp/m.tgz
model.pth
args_init.pth
training_state.pth
classes.pth
tokenizer.pth

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.49.2
Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
Framework version: 1.8
Python version: 3.8.3
CPU or GPU: GPU
Custom Docker image (Y/N): N

You can slack me for details, I am mkamp@.

The text was updated successfully, but these errors were encountered:

ahsan-z-khan · 2021-08-03T21:32:41Z

Hi @marianokamp ,

Thanks for using Amazon SageMaker.

@jerrypeng7773 Can you take a look at the issue for this pipeline bug?

eugeneteoh · 2022-01-22T12:16:03Z

Hi, I'm getting the same issue. It seems like the PyTorchModel does not repack when it is done through SageMaker Pipelines. Is there a workaround for this?

jerrypeng7773 · 2022-02-14T21:21:03Z

we are working on this.

rohangpatil · 2022-05-03T06:07:49Z

#3091 is possibly related to this.

jerrypeng7773 · 2022-05-13T01:07:13Z

@marianokamp Can you please retry, I think this should be addressed. The SAGEMAKER_SUBMIT_DIRECTORY should be consistent across model created by .transform or pipeline execution now. Im closing this issue, feel free to re-open if you think there is any other issue.

marianokamp · 2022-05-21T12:00:45Z

Meanwhile I created my own container and such to deal with that issue from last year. I would like to re-test, @jerrypeng7773, but my code evolved and I am not sure what the equivalent with the BatchTransformStep is.

My processing step that I created as a workaround for the broken transform creates an named output, but I leave it to Pipelines to name the location:

create_pseudo_labels_step = ProcessingStep(
[..]
  outputs=[ProcessingOutput(
                        output_name='pseudo_labeled',
                        source='/opt/ml/processing/output')]
  [..]

I then use this output as an input in a subsequent step:

[..]
TrainingStep(
  .., {'train': TrainingInput(p['env/locations/prepared_reviews_labeled_data']),
         'supplementary': TrainingInput(
             s3_data=\
create_pseudo_labels_step.properties.ProcessingOutputConfig.Outputs['pseudo_labeled'].S3Output.S3Uri)}
)
[..]

This works. But for that I had to build my own container (no source_dir for PyTorch) and re-implement the batch transform logic.

So back to the fixed version of the batch transform step ...

I tried to replace it, but could not find much control over the outputs, especially not how to name the outputs, so that I can look them up later. But it looks like I can leave output_path=None. So that may work as the docs say the default bucket will be used. Except it doesn't. And anyway it is on the Transformer, while it should, IMHO, be on the Step. But the docs say no: No output control as far as I could find.


transformer = Transformer(
    model_name=teacher_create_model_step.properties.ModelName,
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    #output_path='<some hardcoded s3 url that I don't want to use>',
    env={'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' },
    accept='text/csv'
)
[..]
transform_step = TransformStep(
    name='create-pseudo-labels',
    transformer=transformer,
    inputs=TransformInput(data=p['env/locations/prepared_reviews_unlabeled_data'],
                          split_type='Line',
                          content_type='text/csv'))

[..]
TrainingStep(..,
  {'train': TrainingInput(p['env/locations/prepared_reviews_labeled_data']),
           'supplementary': TrainingInput(
               s3_data=transform_step.properties.TransformOutput.S3OutputPath)})

I get the following error during the upsert.

ClientError: An error occurred (ValidationException) when calling the UpdatePipeline operation: Unable to parse pipeline definition. Property 'S3OutputPath' with value 'null' is of unexpected type. Types must be booleans, strings, numbers or SageMaker Pipelines functions

jerrypeng7773 · 2022-05-25T16:48:23Z

@marianokamp sorry for the inconvenience, we release a new way to construct the step in 2.89.0, can you try it out to see if it fixes the issue?

For processing, we can do

from sagemaker.workflow.pipeline_context import PipelineSession

session = PipelineSession()

processor = FrameworkProcessor(..., sagemaker_session=session)

step_args = processor.run(inputs=..., source_dir="...")

step_sklearn = ProcessingStep(
    name="MyProcessingStep",
    step_args=step_args,
)

For transformer, you can do

transformer = Transformer(
    model_name=teacher_create_model_step.properties.ModelName,
    sagemaker_session=session,
    ...
)

step_args = transformer.transform(...)

step_transform = TransformStep(
    name='create-pseudo-labels',
    step_args=step_args
)

In summary, we introduced the PipelineSession. This special session does not trigger the job immediately when you call processor.run or transformer.transform, instead, it captures the request arguments required to run a processing job, and delegate it to the processing step to start the job later during pipeline execution. At the meanwhile, for processing, you can use .run and supply values such as "source_dir" as usual.

This works for estimator, transformer, and tuner too.

ahsan-z-khan added the component: pipelines Relates to the SageMaker Pipeline Platform label Aug 3, 2021

jerrypeng7773 added the type: bug label Jan 11, 2022

navaj0 assigned jerrypeng7773 Mar 18, 2022

jerrypeng7773 closed this as completed May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

marianokamp commented Aug 3, 2021 •

edited

Loading

ahsan-z-khan commented Aug 3, 2021

eugeneteoh commented Jan 22, 2022 •

edited

Loading

jerrypeng7773 commented Feb 14, 2022

rohangpatil commented May 3, 2022

jerrypeng7773 commented May 13, 2022

marianokamp commented May 21, 2022

jerrypeng7773 commented May 25, 2022 •

edited

Loading

TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

Comments

marianokamp commented Aug 3, 2021 • edited Loading

What is working?

What is not working?

ahsan-z-khan commented Aug 3, 2021

eugeneteoh commented Jan 22, 2022 • edited Loading

jerrypeng7773 commented Feb 14, 2022

rohangpatil commented May 3, 2022

jerrypeng7773 commented May 13, 2022

marianokamp commented May 21, 2022

jerrypeng7773 commented May 25, 2022 • edited Loading

marianokamp commented Aug 3, 2021 •

edited

Loading

eugeneteoh commented Jan 22, 2022 •

edited

Loading

jerrypeng7773 commented May 25, 2022 •

edited

Loading