Skip to content

TransformStep in Pipelines mishandling source_dir attribute (PyTorch/Script Mode) #2549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
marianokamp opened this issue Aug 3, 2021 · 7 comments
Assignees
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug

Comments

@marianokamp
Copy link
Collaborator

marianokamp commented Aug 3, 2021

When creating a Transform the source_dir attribute that points to the local directory containing the source is used to pack up the sources, put them onto S3 and when the Transform is executed, it uses that code that has been transferred from the S3 location to the running Transform container.

This works with the SDK directly, but not with SM Pipelines. Here source_dir and the consequent upload of the source is not handled properly.

What is working?

(Plain use of SDK to instantiate model and transfomer)

pytorch_model = PyTorchModel(model_data=model_url,
                             role=role,
                             py_version='py3',
                             framework_version='1.8',
                             source_dir='.',
                             entry_point='inference.py')

pytorch_transformer = pytorch_model.transformer(instance_type='ml.g4dn.xlarge', \
                                                instance_count=1,\
                                                accept='text/csv',\
                                                output_path=[..],\
                                                env={[..]})

This is working in the SDK. The source dir is packaged up, transfered to S3 and also the resulting model contains the property: SAGEMAKER_SUBMIT_DIRECTORY = s3://.../model.tar.gz

Output of a describe-model and inspection of the tar ball:

aws sagemaker describe-model --model-name pytorch-inference-2021-07-28-19-50-16-101
{
    "ModelName": "pytorch-inference-2021-07-28-19-50-16-101",
    "PrimaryContainer": {
        "Image": "763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.8-gpu-py3",
        "Mode": "SingleModel",
        "ModelDataUrl": "s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz",
        "Environment": {
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_REGION": "eu-west-1",
            "SAGEMAKER_SUBMIT_DIRECTORY": "s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz"
        }
    },
    "ExecutionRoleArn": "arn:aws:iam::XXX:role/service-role/AmazonSageMaker-ExecutionRole-20191105T200284",
    "CreationTime": "2021-07-28T21:50:16.627000+02:00",
    "ModelArn": "arn:aws:sagemaker:eu-west-1:XXX:model/pytorch-inference-2021-07-28-19-50-16-101",
    "EnableNetworkIsolation": false
}

aws s3 cp s3://sagemaker-eu-west-1-XXX/pytorch-inference-2021-07-28-19-43-05-552/model.tar.gz /tmp/m.tgz
tar -tf /tmp/m.tgz
/
args_init.pth
classes.pth
code
[..]
code/inference.py
[..]
model.pth
training_state.pth

^^^ Please recognize the existence of code/inference.py and see the SAGEMAKER_SUBMIT_DIRECTORY.

What is not working?

(Equivalent steps with SM Pipelines)

pytorch_model = PyTorchModel(model_data=model_url,
                             role=role,
                             py_version='py3',
                             framework_version='1.8',
                             source_dir='.',
                             entry_point='inference.py') # same as above

model_input = CreateModelInput(instance_type=estimator.instance_type)

create_model_step = \
            CreateModelStep(name=f'create-model-{model_name}',
                            model=model,
                            inputs=model_input)

transformer = Transformer(
    model_name=create_finetuned_model_step.properties.ModelName,
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    output_path=[..],
    env={'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' },
    accept='text/csv'
)

transform_step = TransformStep(
    name='create-pseudo-labeles',
    transformer=transformer,
    inputs=TransformInput(data=[..],
                          split_type='Line',
                          content_type='text/csv'))

When running the Transform I get the dreaded Please provide a model_fn implementation error.

The main difference seems to be that the SAGEMAKER_SUBMIT_DIRECTORY property has not been updated to s3://.../model.tar.gz as in the working case above, but shows the local path file://.. And the tarball does not contain the inference code.

Full output:


aws sagemaker describe-model --model-name pipelines-dk7q607at2b1-create-model-lstm-YuGjeljWaM
{
    "ModelName": "pipelines-dk7q607at2b1-create-model-lstm-YuGjeljWaM",
    "PrimaryContainer": {
        "Image": "763104351884.dkr.ecr.eu-west-1.amazonaws.com/pytorch-inference:1.8-gpu-py3",
        "Mode": "SingleModel",
        "ModelDataUrl": "s3://sagemaker-eu-west-1-XXX/pipelines-dk7q607at2b1-train-lstm-yKvVNN1ZA2/output/model.tar.gz",
        "Environment": {
            "SAGEMAKER_CONTAINER_LOG_LEVEL": "20",
            "SAGEMAKER_PROGRAM": "inference.py",
            "SAGEMAKER_REGION": "eu-west-1",
            "SAGEMAKER_SUBMIT_DIRECTORY": "file://."
        }
    },
    "ExecutionRoleArn": "arn:aws:iam::XXX:role/service-role/AmazonSageMaker-ExecutionRole-20191105T200284",
    "CreationTime": "2021-08-03T12:12:50.553000+02:00",
    "ModelArn": "arn:aws:sagemaker:eu-west-1:XXX:model/pipelines-dk7q607at2b1-create-model-lstm-yugjeljwam",
    "EnableNetworkIsolation": false
}
aws s3 cp s3://sagemaker-eu-west-1-XXX/pipelines-dk7q607at2b1-train-lstm-yKvVNN1ZA2/output/model.tar.gz /tmp/m.tgz
tar -tf /tmp/m.tgz
model.pth
args_init.pth
training_state.pth
classes.pth
tokenizer.pth

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.49.2
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): PyTorch
  • Framework version: 1.8
  • Python version: 3.8.3
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

You can slack me for details, I am mkamp@.

@ahsan-z-khan ahsan-z-khan added the component: pipelines Relates to the SageMaker Pipeline Platform label Aug 3, 2021
@ahsan-z-khan
Copy link
Member

Hi @marianokamp ,

Thanks for using Amazon SageMaker.

@jerrypeng7773 Can you take a look at the issue for this pipeline bug?

@eugeneteoh
Copy link

eugeneteoh commented Jan 22, 2022

Hi, I'm getting the same issue. It seems like the PyTorchModel does not repack when it is done through SageMaker Pipelines. Is there a workaround for this?

@jerrypeng7773
Copy link
Contributor

we are working on this.

@rohangpatil
Copy link

#3091 is possibly related to this.

@jerrypeng7773
Copy link
Contributor

@marianokamp Can you please retry, I think this should be addressed. The SAGEMAKER_SUBMIT_DIRECTORY should be consistent across model created by .transform or pipeline execution now. Im closing this issue, feel free to re-open if you think there is any other issue.

@marianokamp
Copy link
Collaborator Author

Meanwhile I created my own container and such to deal with that issue from last year. I would like to re-test, @jerrypeng7773, but my code evolved and I am not sure what the equivalent with the BatchTransformStep is.

My processing step that I created as a workaround for the broken transform creates an named output, but I leave it to Pipelines to name the location:

create_pseudo_labels_step = ProcessingStep(
[..]
  outputs=[ProcessingOutput(
                        output_name='pseudo_labeled',
                        source='/opt/ml/processing/output')]
  [..]

I then use this output as an input in a subsequent step:

[..]
TrainingStep(
  .., {'train': TrainingInput(p['env/locations/prepared_reviews_labeled_data']),
         'supplementary': TrainingInput(
             s3_data=\
create_pseudo_labels_step.properties.ProcessingOutputConfig.Outputs['pseudo_labeled'].S3Output.S3Uri)}
)
[..]

This works. But for that I had to build my own container (no source_dir for PyTorch) and re-implement the batch transform logic.

So back to the fixed version of the batch transform step ...

I tried to replace it, but could not find much control over the outputs, especially not how to name the outputs, so that I can look them up later. But it looks like I can leave output_path=None. So that may work as the docs say the default bucket will be used. Except it doesn't. And anyway it is on the Transformer, while it should, IMHO, be on the Step. But the docs say no: No output control as far as I could find.


transformer = Transformer(
    model_name=teacher_create_model_step.properties.ModelName,
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    #output_path='<some hardcoded s3 url that I don't want to use>',
    env={'SAGEMAKER_MODEL_SERVER_TIMEOUT' : '3600' },
    accept='text/csv'
)
[..]
transform_step = TransformStep(
    name='create-pseudo-labels',
    transformer=transformer,
    inputs=TransformInput(data=p['env/locations/prepared_reviews_unlabeled_data'],
                          split_type='Line',
                          content_type='text/csv'))

[..]
TrainingStep(..,
  {'train': TrainingInput(p['env/locations/prepared_reviews_labeled_data']),
           'supplementary': TrainingInput(
               s3_data=transform_step.properties.TransformOutput.S3OutputPath)})

I get the following error during the upsert.

ClientError: An error occurred (ValidationException) when calling the UpdatePipeline operation: Unable to parse pipeline definition. Property 'S3OutputPath' with value 'null' is of unexpected type. Types must be booleans, strings, numbers or SageMaker Pipelines functions

@jerrypeng7773
Copy link
Contributor

jerrypeng7773 commented May 25, 2022

@marianokamp sorry for the inconvenience, we release a new way to construct the step in 2.89.0, can you try it out to see if it fixes the issue?

For processing, we can do

from sagemaker.workflow.pipeline_context import PipelineSession

session = PipelineSession()

processor = FrameworkProcessor(..., sagemaker_session=session)

step_args = processor.run(inputs=..., source_dir="...")

step_sklearn = ProcessingStep(
    name="MyProcessingStep",
    step_args=step_args,
)

For transformer, you can do

transformer = Transformer(
    model_name=teacher_create_model_step.properties.ModelName,
    sagemaker_session=session,
    ...
)

step_args = transformer.transform(...)

step_transform = TransformStep(
    name='create-pseudo-labels',
    step_args=step_args
)

In summary, we introduced the PipelineSession. This special session does not trigger the job immediately when you call processor.run or transformer.transform, instead, it captures the request arguments required to run a processing job, and delegate it to the processing step to start the job later during pipeline execution. At the meanwhile, for processing, you can use .run and supply values such as "source_dir" as usual.

This works for estimator, transformer, and tuner too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug
Projects
None yet
Development

No branches or pull requests

5 participants