estimator.deploy() always uses the same directory to load models #402

Elpiro · 2018-09-25T14:47:29Z

I am using pytorch with the conda-pytorch36 environment provided in AWS, with a ml.p2.xlarge

Describe the problem

Sagemaker is looking for the model in the directory that was created the first time I ran the notebook. I have specified the output_path in the PyTorch estimator. But when I call deploy() on it, it will always look for the model in the folder it created the first time running it.

In the error log below, the URL should be something like "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-25-HH-MM-SS-mS/output/model.tar.gz" instead of "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-17-14-13-04-805/output/model.tar.gz"

Is there a way to explicitly tell to the deploy() function where to look for the newly created model ?

Minimal repro / logs

When calling pytorch_estimator.deploy(instance_type='ml.p2.xlarge',
                                                            initial_instance_count=1,
                                                            endpoint_name ='sales-predict')

ValueError                                Traceback (most recent call last)
<ipython-input-19-385ef6572fdf> in <module>()
      1 predictor = pytorch_estimator.deploy(instance_type='ml.p2.xlarge',
      2                                      initial_instance_count=1,
----> 3                                      endpoint_name ='sales-predict')

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, endpoint_name, **kwargs)
    273             instance_type=instance_type,
    274             initial_instance_count=initial_instance_count,
--> 275             endpoint_name=endpoint_name)
    276 
    277     @property

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, endpoint_name, tags)
    100         production_variant = sagemaker.production_variant(model_name, instance_type, initial_instance_count)
    101         self.endpoint_name = endpoint_name or model_name
--> 102         self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)
    103         if self.predictor_cls:
    104             return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, wait)
    754 
    755             self.sagemaker_client.create_endpoint_config(**config_options)
--> 756         return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
    757 
    758     def expand_role(self, role):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, wait)
    546         self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
    547         if wait:
--> 548             self.wait_for_endpoint(endpoint_name)
    549         return endpoint_name
    550 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
    643         if status != 'InService':
    644             reason = desc.get('FailureReason', None)
--> 645             raise ValueError('Error hosting endpoint {}: {} Reason: {}'.format(endpoint, status, reason))
    646         return desc
    647 

ValueError: Error hosting endpoint sales-predict: Failed Reason:  Failed to download model data for container "container_1" from URL: "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-17-14-13-04-805/output/model.tar.gz". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.

prudhvigurram · 2018-09-28T20:31:45Z

@Elpiro : Did you find a solution? I am facing the same issue

Elpiro · 2018-10-01T08:32:37Z

@prudhvigurram I had to use the low-level API with boto3. You can follow the documentation to use tensorflow with sagemaker here : https://docs.aws.amazon.com/sagemaker/latest/dg/tf-example1-train.html . You just need to change the training image link from tensorflow to pytorch, and the rest of the configuration is the same.

Also, I used the low-level api on lambda functions and not a sagemaker notebook, so if you want to do it in a notebook I don't know how it will turn out. You don't have to worry about execution times since you're using the lambda to start training job/endpoint on sagemaker. To know when the training is done, you can set a S3 trigger with another lambda which will start the endpoint, on the bucket that receives the output of the training job.

ChoiByungWook · 2018-10-03T18:05:48Z

Hello @Elpiro,

I apologize for the late response. For clarification purposes, how I am understanding this is that you provided a custom S3 URI for the 'output_path' parameter, however SageMaker isn't saving the output into that S3 URI and instead into the folder... it first created?

Does that sound correct?

I believe it isn't possible to customize the the s3 URI of the model you are trying to deploy through the 'deploy' function. However, it should be possible to spawn up a TensorFlowModel object with the custom s3 URI pointing to your specified model. Then you can use 'deploy'

andrewcking · 2018-10-17T20:38:05Z

I ran into what I think is this same issue. Perhaps there is a reason things work this way but as far as I can tell it is a bug. If there is a reason things are being done this way I would suggest somehow alerting the user to what is happening.

The issue:

Train and deploy a model and specify the endpoint name (lets say as 'test-endpoint')
Delete the endpoint
Train a new model and deploy an endpoint with the same name as before ('test-endpoint')

If you do the above then sagemaker will not attempt to deploy your newly trained model specified with estimator.deploy it will go and try to deploy the old model that was first deployed with that name. You can verify this by deleting the the old model and then sagemaker will throw an error that it could not be found. For some reason it still looks for those old resources even though you have done nothing to suggest that it should do that.

Thoughts

The implication of this being that there does not appear to be a supported way of deploying a model with an endpoint name that you have used before and then deleted (update_endpoint will not work if the endpoint has been deleted and deploy will always look in the first location). This can lead to a lot of confusion where you are deploying a model with changes and then getting the same old results back.

Code Example

# create estimator
estimator = PyTorch(entry_point='train.py', role=role, framework_version='0.4.0', train_instance_count=1, train_instance_type='ml.p2.xlarge', source_dir='source', hyperparameters={'epochs': 6,})

# fit estimator
estimator.fit({'train': data_dir})

# deploy estimator (except it doesn't deploy your new estimator if a previously used endpoint name is specified)
estimator.deploy(instance_type='ml.t2.medium', initial_instance_count=1, endpoint_name="test-endpoint")

mvsusp · 2018-10-27T01:21:13Z

Hi @andrewcking,

I believe there are two different use cases. Let me explain both:

Trying to deploy a model from a existent object:

@Elpiro question was how can he deploy already created model using Python SDK. I believe he needs to an instance of the Model class, example:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz',
                             role='arn:aws:iam::accid:sagemaker-role',
                             entry_point='entry_point.py')

predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.m4.xlarge')

For more information about the model class, please see: https://github.com/aws/sagemaker-python-sdk#byo-model and https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-model

Trying to create deploy an endpoint with the same name of a previous one.

In SageMaker Hosting, the process to create an endpoint requires to create an model, an endpoint configuration, and an endpoint.

SageMaker Python SDK abstract these 3 layers for you. These are the lines where this process happens

        container_def = self.prepare_container_def(instance_type)
        self.name = self.name or name_from_image(container_def['Image'])
        self.sagemaker_session.create_model(self.name, self.role, container_def, vpc_config=self.vpc_config)
        production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count)
        self.endpoint_name = endpoint_name or self.name
        self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)

To be able to use Python SDK to create an endpoint with the same name, you will have to delete the model, the endpoint_config, and the endpoint. SageMaker Python SDK delete_endpoint() only deletes the endpoint, which is why the issue is happening.

I will close this issue because it is unrelated and add one issue with bug that you found.

Thanks!

Add batch transform to image-classification notebook

Elpiro changed the title ~~estimator.deploy() uses only~~ estimator.deploy() always uses the same directory to load models Sep 25, 2018

mvsusp closed this as completed Oct 27, 2018

mvsusp mentioned this issue Oct 27, 2018

Estimator.delete_endpoint does not delete the Model or Endpoint Configuration #447

Closed

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#402 from Aloha106/master

db3e8b5

Add batch transform to image-classification notebook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

estimator.deploy() always uses the same directory to load models #402

estimator.deploy() always uses the same directory to load models #402

Elpiro commented Sep 25, 2018 •

edited

Loading

prudhvigurram commented Sep 28, 2018

Elpiro commented Oct 1, 2018 •

edited

Loading

ChoiByungWook commented Oct 3, 2018

andrewcking commented Oct 17, 2018

mvsusp commented Oct 27, 2018 •

edited by laurenyu

Loading

estimator.deploy() always uses the same directory to load models #402

estimator.deploy() always uses the same directory to load models #402

Comments

Elpiro commented Sep 25, 2018 • edited Loading

Describe the problem

Minimal repro / logs

prudhvigurram commented Sep 28, 2018

Elpiro commented Oct 1, 2018 • edited Loading

ChoiByungWook commented Oct 3, 2018

andrewcking commented Oct 17, 2018

The issue:

Thoughts

Code Example

mvsusp commented Oct 27, 2018 • edited by laurenyu Loading

Trying to deploy a model from a existent object:

Trying to create deploy an endpoint with the same name of a previous one.

Elpiro commented Sep 25, 2018 •

edited

Loading

Elpiro commented Oct 1, 2018 •

edited

Loading

mvsusp commented Oct 27, 2018 •

edited by laurenyu

Loading