MultiDataModel error during prediction: Please provide a model_fn implementation. #92

pashasensi · 2021-02-02T21:12:32Z

Describe the bug
When deploying a packaged PyTorch model using the PyTorchModel class I can successfully deploy and call the predict function, but as soon as I use the same model and pass it to a MultiDataModel class, the deployment process goes through, but when I call predictor.predict(data=data, target_model='model.tar.gz') I get the following error:

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from model with message "Your invocation timed out while waiting for a response from container model. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.".

I'm not sure if the error is related to the 'Please provide a model_fn implementation.' error I get in cloudwatch, but the model_fn function is in actually implemented and MultiDataModel somehow doesn't load it.

To reproduce

create a sample PyTorch model, train and package it.
deploy the model using PyTorchModel: (This will successfully deploy the model and when calling predictor.predict() successfully returns the inference results.

class InvoiceExtraction(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super().__init__(endpoint_name, sagemaker_session=sagemaker_session, serializer=json_serializer, 
                         deserializer=json_deserializer, content_type='application/json')
        
model = PyTorchModel(model_data=str('/home/ec2-user/SageMaker/model.tar.gz'),
                   name=name_from_base(MODEL_NAME),
                   role=role, 
                   entry_point='predictor.py',
                   framework_version='1.5.0', # Breaks for 1.6.0
                   py_version='py3',
                   predictor_cls=InvoiceExtraction)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge', endpoint_name=ENDPOINT_NAME)
predicted_value = predictor.predict(data=data)

if you deploy the model using a MultiDataModel instead, it will get deployed, but the predict function returns the error mentioned above.

model_data_prefix = 's3://multi-model-endpoint-models/'
model.sagemaker_session = sagemaker_session # not setting this results in the model's session not being initialized
mme = MultiDataModel(name=MODEL_NAME,
                     model_data_prefix=model_data_prefix,
                     model=model,# passing our pytorch model
                     sagemaker_session=sagemaker_session)

ENDPOINT_INSTANCE_TYPE = 'ml.m4.xlarge'
ENDPOINT_NAME = 'test-endpoint'

predictor = mme.deploy(initial_instance_count=1,
                       instance_type=ENDPOINT_INSTANCE_TYPE,
                       endpoint_name=ENDPOINT_NAME)

mme.add_model(model_data_source='/home/ec2-user/SageMaker/model.tar.gz', model_data_path='model.tar.gz')
list(mme.list_models())

predicted_value = predictor.predict(data=data, target_model='model.tar.gz')

Expected behavior
MultiDataModel should deploy and work without any errors.

Screenshots or logs
This is what's included in the CloudWatch logs:

2021-02-02 20:24:54,652 [INFO ] W-9000-2093075ac497ff81bd6238817 com.amazonaws.ml.mms.wlm.WorkerThread - Backend response time: 1

2021-02-02 20:24:54,653 [WARN ] W-9000-2093075ac497ff81bd6238817 com.amazonaws.ml.mms.wlm.WorkerThread - Backend worker thread exception.

java.lang.IllegalArgumentException: reasonPhrase contains one of the following prohibited characters: \r\n:

Please provide a model_fn implementation.

See documentation for model_fn at https://github.com/aws/sagemaker-python-sdk

#011at io.netty.handler.codec.http.HttpResponseStatus.(HttpResponseStatus.java:555)

#011at io.netty.handler.codec.http.HttpResponseStatus.(HttpResponseStatus.java:537)

#011at io.netty.handler.codec.http.HttpResponseStatus.valueOf(HttpResponseStatus.java:465)

#011at com.amazonaws.ml.mms.wlm.Job.response(Job.java:85)

#011at com.amazonaws.ml.mms.wlm.BatchAggregator.sendResponse(BatchAggregator.java:85)

#011at com.amazonaws.ml.mms.wlm.WorkerThread.run(WorkerThread.java:146)

#011at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

#011at java.util.concurrent.FutureTask.run(FutureTask.java:266)

#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

#011at java.lang.Thread.run(Thread.java:748)

2021-02-02 20:24:54,655 [ERROR] W-9000-2093075ac497ff81bd6238817 com.amazonaws.ml.mms.wlm.BatchAggregator - Unexpected job: 99ef86e2-aedf-47d7-8f6c-950fde1bec88

System information
A description of your system. Please provide:

Toolkit version: Latest
Framework version: tried both on 1.5.0 and 1.6.0
Python version: 3.6
CPU or GPU: CPU
Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

mostafabayat · 2023-09-05T06:17:56Z

Your model need to access your inference code during invocations!

for example, pay attention to these examples:
pytorch multi-model example

**Note:** To directly use training job `model.tar.gz` outputs as we do here, you'll need to make sure your training job produces results that:  

    - Already include any required inference code in a `code/` subfolder, and\n",
    - (If you're using SageMaker PyTorch containers v1.6+) have been packaged to be compatible with TorchServe.\n",

See the `enable_sm_oneclick_deploy()` and `enable_torchserve_multi_model()` functions in [src/train.py](src/train.py) for notes on this. Alternatively, you can perform the same steps after the fact - to produce a new, serving-ready `model.tar.gz` from your raw training job result."

or

sklearn multi-model example

# pay attention to code_location argument!!
    estimator = SKLearn(
        entry_point=TRAINING_FILE,  # script to use for training job
        role=role,
        source_dir=SOURCE_DIR,  # Location of scripts
        instance_count=1,
        instance_type=TRAIN_INSTANCE_TYPE,
        framework_version="1.2-1",  # 1.2-1 is the latest version
        output_path=s3_output_path,  # Where to store model artifacts
        base_job_name=_job,
        code_location=code_location,  # This is where the .tar.gz of the source_dir will be stored
        metric_definitions=[{"Name": "median-AE", "Regex": "AE-at-50th-percentile: ([0-9.]+).*$"}],
        hyperparameters={"n-estimators": 100, "min-samples-leaf": 3, "model-name": location},
    )

there are many ways to make you codes accessible and I bring you two of them for you :)
I hope it is useful

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiDataModel error during prediction: Please provide a model_fn implementation. #92

MultiDataModel error during prediction: Please provide a model_fn implementation. #92

pashasensi commented Feb 2, 2021

mostafabayat commented Sep 5, 2023 •

edited

Loading

MultiDataModel error during prediction: Please provide a model_fn implementation. #92

MultiDataModel error during prediction: Please provide a model_fn implementation. #92

Comments

pashasensi commented Feb 2, 2021

mostafabayat commented Sep 5, 2023 • edited Loading

mostafabayat commented Sep 5, 2023 •

edited

Loading