Model deploy instance_type modification failed #987

Wei-1 · 2019-08-17T11:27:36Z

Reference:

0409412934
MLFW-1638

System Information

Framework / Algorithm: PCA
Framework Version: SageMaker Aug 12, 2019 09:03 UTC
Python Version: 3
CPU or GPU: CPU
Python SDK Version:
Are you using a custom image: No

Describe the problem

Running SageMaker example: PCA for MNIST
If a user try to deploy a model with an instance_type that is not available,
the user won't be able to simply replace the instance_type and deploy again.

Minimal repro / logs

While running PCA for MNIST in the example project, and executing the following script:

pca_predictor = pca.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

When a user doesn't have any extra resource to launch new instance with the assigned instance_type. The user will get a ResourceLimitExceeded error.
If the user desides to change the instance_type and execute the script again.

pca_predictor = pca.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

The the user will get this error:

--------------------------------------------------------------------------- 
ResourceLimitExceeded Traceback (most recent call last) 
<ipython-input-18-b2bb257b120c> in <module>() 
1 pca_predictor = pca.deploy(initial_instance_count=1, 
----> 2 instance_type='ml.t2.medium') 
...... 
ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.m4.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances... 
---------------------------------------------------------------------------

From this error message,
we can see that although we are trying to set a new instance_type,
the instance_type is not really reset.

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2019-08-30T22:10:54Z

Hello @Wei-1,

Thank you for bringing this to our attention.

Let me look into this and figure out why the instance type isn't being propagated correctly.

Thank you for your patience.

ChoiByungWook · 2019-08-31T00:27:07Z

Can you paste how you instantiate the PCA algorithm?

Do you specify a name in the constructor, as I am suspecting this line causes an existing endpoint configuration name to be used: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/session.py#L1306

That line is fed when attempting to deploy your model: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L427

Wei-1 · 2019-08-31T00:35:22Z

I follow the lab tutorial: link
I don't think the name is stated.
I think it makes sense that the instance_type is passed through the model but the instance_type in the session is not replaced. Cus the error log states this behavior.

pca = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=1, 
                                    train_instance_type='ml.c4.xlarge',
                                    output_path=output_location,
                                    sagemaker_session=sess)
pca.set_hyperparameters(feature_dim=50000,
                        num_components=10,
                        subtract_mean=True,
                        algorithm_mode='randomized',
                        mini_batch_size=200)
pca.fit({'train': s3_train_data})

pca_predictor = pca.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')

ChoiByungWook · 2019-08-31T00:47:22Z

@Wei-1,

Thank you for the clarification.

Can you do me a quick favor and check your endpoint configurations in the AWS console?

AWS Console -> Amazon SageMaker -> Endpoint configurations

Or you can use the cli: https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-endpoint-configs.html

Can you tell me if you see any corresponding endpoint configurations that have the expected "ml.t2.medium", because based on the code links provided the session object should be propagating correctly.

Thanks!

Wei-1 · 2019-08-31T02:06:28Z

I had rerun the whole estimator initiation process so I can see both t2.medium and m4.xlarge.
Let me find a time to remove both endpoint-config and run the whole notebook again to check.

Wei-1 · 2019-08-31T02:42:48Z

Can you tell me if you see any corresponding endpoint configurations that have the expected "ml.t2.medium", because based on the code links provided the session object should be propagating correctly.

While running the code the first time with ml.m4.xlarge, we will be able to see the corresponding endpoint config.
However, after we get the error message and rerun the code with ml.t2.medium, a new endpoint config will NOT be generated.

ChoiByungWook · 2019-09-03T17:49:37Z

Gotcha, so from a notebook context this is still failing.

Hmmm... I wonder if there is something going on with the Python cache in the notebook environment. Thank you so much for all of this information.

I think this will require a bit of dedicated investigation. I'll bring this up with the team.

Thanks!

laurenyu · 2019-09-03T22:18:08Z

hi @Wei-1, can you try adding update_endpoint=True to the deploy call?

reference: https://sagemaker.readthedocs.io/en/stable/pca.html?highlight=update_endpoint#sagemaker.PCA.deploy

Wei-1 · 2019-09-04T01:15:11Z

@laurenyu, Now it is even more interesting...
If I modify the deploy script to this:

pca_predictor = pca.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium',
                           update_endpoint=True)

I will get this error message:

ClientError: An error occurred (ValidationException) when calling the CreateEndpointConfig operation: Cannot create already existing endpoint configuration "arn:aws:sagemaker:us-east-2:XXXXXXXXXXXX:endpoint-config/pca-2019-09-04-01-01-54-841".

It will show that the endpointConfig already exist,
while the existing one is still in the old instance_type.

It will NOT create a new endpointConfig with ml.t2.medium.

laurenyu · 2019-09-09T17:54:09Z

@Wei-1 hmm, can you try specifying a new endpoint name?

Wei-1 · 2019-09-10T17:57:42Z

pca_predictor = pca.deploy(initial_instance_count=1,
                           instance_type='ml.t2.medium',
                           endpoint_name='NewEndpointName')

@laurenyu, with a new endpoint name, the endpointCondfig with t2.medium can successfully be created and there will have no other errors for the entire process.

mvsusp · 2019-09-11T23:19:20Z

Hi @Wei-1,

Thanks for reporting this bug. I am adding to our roadmap changes in the behaviour of pca.deploy fixing your use case. We will update this ticket as soon as the changes are pushed.

Thanks for using SageMaker!

Wei-1 · 2019-09-12T06:03:17Z

Cool @mvsusp!
the problem seems to be in the model and session, so it should be a general model deploy problem, not just PCA.
I can also draft a PR if you need any help.

mvsusp · 2019-09-12T17:07:26Z

You are right. I believe that the issue is that session.create_model tries to create a new model with the same previous name and fail - see https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/session.py#L774

One possible solution to fix this issue is to always generate models with a new name, perhaps in the format of name + '_' + timestamp.

Any contribution will be highly appreciated. Thanks @Wei-1

Wei-1 · 2019-09-18T14:34:52Z

I will check if I am able to solve the issue this weekend.
Will let you know if I have any result.

icywang86rui · 2019-09-20T21:54:56Z

@Wei-1 the update_endpoint arg is designed to handle this case. Are you still seeing error or unexpected behavior when update_endpoint is set to True and and a new endpoint name is given in the second call?

Wei-1 · 2019-09-21T09:14:31Z

@icywang86rui, if we use a new endpoint name, the module will behave as designed.
However, update_endpoint didn't act as expected as my investigation here: #987 (comment)

mvsusp · 2019-09-29T18:32:08Z

Thanks for start working on this PR @Wei-1 . Let us know if you have any doubts.

Wei-1 · 2019-09-30T07:57:53Z

I changed some naming of the model -> endpoint config -> endpoint
instead of following the name of model, I change endpoint config to follow the name of endpoint since instance_type is not a part of model.
Please help to check if the modification in #1058 make sense.

mvsusp · 2019-09-30T19:25:20Z

I will answer your observations in the PR. Thanks.

Wei-1 · 2019-10-11T09:22:18Z

@mvsusp, I just notice that the PR is reverted because of getting an error in the canaries in #1070.
The test seem to be failing when using hardcoded endpoint name,
but those tests section doesn't seem to be included in this repository.
You got an idea how we can proceed with this?

ChoiByungWook · 2019-10-14T21:51:47Z

Hello @Wei-1,

@mvsusp just recently left Amazon.

I'll take over this issue now. I'll begin by looking at #1070 .

Reference:

0409412934
MLFW-1638

Wei-1 · 2019-11-25T08:49:10Z

Hello @ChoiByungWook, any follow up on this?
Or is there any help that I can offer?

AkashShukla199 · 2020-06-05T09:06:48Z

getting an error while I was deploying the trained model. It is just failing in between.

laurenyu · 2020-07-10T15:22:49Z

I've merged changes that were released as part of v2.0.0.rc1 to address this issue.

Wei-1 · 2020-07-10T15:36:07Z

Nice nice, should I close this issue?

laurenyu · 2020-07-10T15:41:11Z

Yeah, we can close this issue. Feel free to reopen/create a new issue if further issues arise :)

ChoiByungWook added the type: bug label Aug 31, 2019

Wei-1 mentioned this issue Sep 23, 2019

fix: #987 Add Instance_Type in Endpoint_Config_Name #1058

Merged

4 tasks

ChoiByungWook added the feature request label Oct 14, 2019

laurenyu mentioned this issue May 7, 2020

Create new resources each time with deploy() and transformer() #1470

Closed

ajaykarpur added type: feature request and removed type: feature request labels Jun 24, 2020

laurenyu mentioned this issue Jun 26, 2020

breaking: create new inference resources during estimator.deploy() or estimator.transformer() #1639

Merged

7 tasks

laurenyu closed this as completed Jul 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model deploy instance_type modification failed #987

Model deploy instance_type modification failed #987

Wei-1 commented Aug 17, 2019 •

edited by ChoiByungWook

Loading

ChoiByungWook commented Aug 30, 2019

ChoiByungWook commented Aug 31, 2019

Wei-1 commented Aug 31, 2019 •

edited

Loading

ChoiByungWook commented Aug 31, 2019

Wei-1 commented Aug 31, 2019

Wei-1 commented Aug 31, 2019

ChoiByungWook commented Sep 3, 2019

laurenyu commented Sep 3, 2019

Wei-1 commented Sep 4, 2019 •

edited

Loading

laurenyu commented Sep 9, 2019

Wei-1 commented Sep 10, 2019

mvsusp commented Sep 11, 2019

Wei-1 commented Sep 12, 2019 •

edited

Loading

mvsusp commented Sep 12, 2019

Wei-1 commented Sep 18, 2019

icywang86rui commented Sep 20, 2019

Wei-1 commented Sep 21, 2019

mvsusp commented Sep 29, 2019

Wei-1 commented Sep 30, 2019

mvsusp commented Sep 30, 2019

Wei-1 commented Oct 11, 2019

ChoiByungWook commented Oct 14, 2019 •

edited

Loading

Wei-1 commented Nov 25, 2019

AkashShukla199 commented Jun 5, 2020

laurenyu commented Jul 10, 2020

Wei-1 commented Jul 10, 2020

laurenyu commented Jul 10, 2020

Model deploy instance_type modification failed #987

Model deploy instance_type modification failed #987

Comments

Wei-1 commented Aug 17, 2019 • edited by ChoiByungWook Loading

System Information

Describe the problem

Minimal repro / logs

ChoiByungWook commented Aug 30, 2019

ChoiByungWook commented Aug 31, 2019

Wei-1 commented Aug 31, 2019 • edited Loading

ChoiByungWook commented Aug 31, 2019

Wei-1 commented Aug 31, 2019

Wei-1 commented Aug 31, 2019

ChoiByungWook commented Sep 3, 2019

laurenyu commented Sep 3, 2019

Wei-1 commented Sep 4, 2019 • edited Loading

laurenyu commented Sep 9, 2019

Wei-1 commented Sep 10, 2019

mvsusp commented Sep 11, 2019

Wei-1 commented Sep 12, 2019 • edited Loading

mvsusp commented Sep 12, 2019

Wei-1 commented Sep 18, 2019

icywang86rui commented Sep 20, 2019

Wei-1 commented Sep 21, 2019

mvsusp commented Sep 29, 2019

Wei-1 commented Sep 30, 2019

mvsusp commented Sep 30, 2019

Wei-1 commented Oct 11, 2019

ChoiByungWook commented Oct 14, 2019 • edited Loading

Wei-1 commented Nov 25, 2019

AkashShukla199 commented Jun 5, 2020

laurenyu commented Jul 10, 2020

Wei-1 commented Jul 10, 2020

laurenyu commented Jul 10, 2020

Wei-1 commented Aug 17, 2019 •

edited by ChoiByungWook

Loading

Wei-1 commented Aug 31, 2019 •

edited

Loading

Wei-1 commented Sep 4, 2019 •

edited

Loading

Wei-1 commented Sep 12, 2019 •

edited

Loading

ChoiByungWook commented Oct 14, 2019 •

edited

Loading