Skip to content

estimator.deploy() always uses the same directory to load models #402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Elpiro opened this issue Sep 25, 2018 · 5 comments
Closed

estimator.deploy() always uses the same directory to load models #402

Elpiro opened this issue Sep 25, 2018 · 5 comments

Comments

@Elpiro
Copy link

Elpiro commented Sep 25, 2018

I am using pytorch with the conda-pytorch36 environment provided in AWS, with a ml.p2.xlarge

Describe the problem

Sagemaker is looking for the model in the directory that was created the first time I ran the notebook. I have specified the output_path in the PyTorch estimator. But when I call deploy() on it, it will always look for the model in the folder it created the first time running it.

In the error log below, the URL should be something like "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-25-HH-MM-SS-mS/output/model.tar.gz" instead of "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-17-14-13-04-805/output/model.tar.gz"

Is there a way to explicitly tell to the deploy() function where to look for the newly created model ?

Minimal repro / logs

When calling pytorch_estimator.deploy(instance_type='ml.p2.xlarge',
                                                            initial_instance_count=1,
                                                            endpoint_name ='sales-predict')

ValueError                                Traceback (most recent call last)
<ipython-input-19-385ef6572fdf> in <module>()
      1 predictor = pytorch_estimator.deploy(instance_type='ml.p2.xlarge',
      2                                      initial_instance_count=1,
----> 3                                      endpoint_name ='sales-predict')

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, endpoint_name, **kwargs)
    273             instance_type=instance_type,
    274             initial_instance_count=initial_instance_count,
--> 275             endpoint_name=endpoint_name)
    276 
    277     @property

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, endpoint_name, tags)
    100         production_variant = sagemaker.production_variant(model_name, instance_type, initial_instance_count)
    101         self.endpoint_name = endpoint_name or model_name
--> 102         self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)
    103         if self.predictor_cls:
    104             return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, wait)
    754 
    755             self.sagemaker_client.create_endpoint_config(**config_options)
--> 756         return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
    757 
    758     def expand_role(self, role):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, wait)
    546         self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
    547         if wait:
--> 548             self.wait_for_endpoint(endpoint_name)
    549         return endpoint_name
    550 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
    643         if status != 'InService':
    644             reason = desc.get('FailureReason', None)
--> 645             raise ValueError('Error hosting endpoint {}: {} Reason: {}'.format(endpoint, status, reason))
    646         return desc
    647 

ValueError: Error hosting endpoint sales-predict: Failed Reason:  Failed to download model data for container "container_1" from URL: "s3://sagemaker-eu-west-1-[my-user-id]/sagemaker-pytorch-2018-09-17-14-13-04-805/output/model.tar.gz". Please ensure that there is an object located at the URL and that the role passed to CreateModel has permissions to download the object.
@Elpiro Elpiro changed the title estimator.deploy() uses only estimator.deploy() always uses the same directory to load models Sep 25, 2018
@prudhvigurram
Copy link

@Elpiro : Did you find a solution? I am facing the same issue

@Elpiro
Copy link
Author

Elpiro commented Oct 1, 2018

@prudhvigurram I had to use the low-level API with boto3. You can follow the documentation to use tensorflow with sagemaker here : https://docs.aws.amazon.com/sagemaker/latest/dg/tf-example1-train.html . You just need to change the training image link from tensorflow to pytorch, and the rest of the configuration is the same.

Also, I used the low-level api on lambda functions and not a sagemaker notebook, so if you want to do it in a notebook I don't know how it will turn out. You don't have to worry about execution times since you're using the lambda to start training job/endpoint on sagemaker. To know when the training is done, you can set a S3 trigger with another lambda which will start the endpoint, on the bucket that receives the output of the training job.

@ChoiByungWook
Copy link
Contributor

Hello @Elpiro,

I apologize for the late response. For clarification purposes, how I am understanding this is that you provided a custom S3 URI for the 'output_path' parameter, however SageMaker isn't saving the output into that S3 URI and instead into the folder... it first created?

Does that sound correct?

I believe it isn't possible to customize the the s3 URI of the model you are trying to deploy through the 'deploy' function. However, it should be possible to spawn up a TensorFlowModel object with the custom s3 URI pointing to your specified model. Then you can use 'deploy'

@andrewcking
Copy link

I ran into what I think is this same issue. Perhaps there is a reason things work this way but as far as I can tell it is a bug. If there is a reason things are being done this way I would suggest somehow alerting the user to what is happening.

The issue:

  1. Train and deploy a model and specify the endpoint name (lets say as 'test-endpoint')
  2. Delete the endpoint
  3. Train a new model and deploy an endpoint with the same name as before ('test-endpoint')

If you do the above then sagemaker will not attempt to deploy your newly trained model specified with estimator.deploy it will go and try to deploy the old model that was first deployed with that name. You can verify this by deleting the the old model and then sagemaker will throw an error that it could not be found. For some reason it still looks for those old resources even though you have done nothing to suggest that it should do that.

Thoughts

The implication of this being that there does not appear to be a supported way of deploying a model with an endpoint name that you have used before and then deleted (update_endpoint will not work if the endpoint has been deleted and deploy will always look in the first location). This can lead to a lot of confusion where you are deploying a model with changes and then getting the same old results back.

Code Example

# create estimator
estimator = PyTorch(entry_point='train.py', role=role, framework_version='0.4.0', train_instance_count=1, train_instance_type='ml.p2.xlarge', source_dir='source', hyperparameters={'epochs': 6,})

# fit estimator
estimator.fit({'train': data_dir})

# deploy estimator (except it doesn't deploy your new estimator if a previously used endpoint name is specified)
estimator.deploy(instance_type='ml.t2.medium', initial_instance_count=1, endpoint_name="test-endpoint")

@mvsusp
Copy link
Contributor

mvsusp commented Oct 27, 2018

Hi @andrewcking,

I believe there are two different use cases. Let me explain both:

Trying to deploy a model from a existent object:

@Elpiro question was how can he deploy already created model using Python SDK. I believe he needs to an instance of the Model class, example:

from sagemaker.mxnet.model import MXNetModel

sagemaker_model = MXNetModel(model_data='s3://path/to/model.tar.gz',
                             role='arn:aws:iam::accid:sagemaker-role',
                             entry_point='entry_point.py')

predictor = sagemaker_model.deploy(initial_instance_count=1,
                                   instance_type='ml.m4.xlarge')

For more information about the model class, please see: https://github.com/aws/sagemaker-python-sdk#byo-model and https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html#tensorflow-model

Trying to create deploy an endpoint with the same name of a previous one.

In SageMaker Hosting, the process to create an endpoint requires to create an model, an endpoint configuration, and an endpoint.

SageMaker Python SDK abstract these 3 layers for you. These are the lines where this process happens

        container_def = self.prepare_container_def(instance_type)
        self.name = self.name or name_from_image(container_def['Image'])
        self.sagemaker_session.create_model(self.name, self.role, container_def, vpc_config=self.vpc_config)
        production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count)
        self.endpoint_name = endpoint_name or self.name
        self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)

To be able to use Python SDK to create an endpoint with the same name, you will have to delete the model, the endpoint_config, and the endpoint. SageMaker Python SDK delete_endpoint() only deletes the endpoint, which is why the issue is happening.

I will close this issue because it is unrelated and add one issue with bug that you found.

Thanks!

@mvsusp mvsusp closed this as completed Oct 27, 2018
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Add batch transform to image-classification notebook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants