Skip to content

Model persists when endpoint deploys fails at 'createEndpoint' #2194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
evakravi opened this issue Mar 9, 2021 · 4 comments
Closed

Model persists when endpoint deploys fails at 'createEndpoint' #2194

evakravi opened this issue Mar 9, 2021 · 4 comments

Comments

@evakravi
Copy link
Member

evakravi commented Mar 9, 2021

I'm a software engineer at AWS that is experiencing this bug in SageMaker Studio (JumpStart). My alias is @evakravi

Describe the bug
If you attempt to deploy a model and are successful in a) uploading the script/deps to s3, (b) creating a model, (c) creating an endpoint config, but d) FAIL to create the endpoint (which could happen if an account does not have GPU instances allocated), then if you attempt to re-deploy the same model with parameters such that the endpoint creation succeeds, the deployment will still fail.

To reproduce
Deploy a model on a disallowed instance type, then deploy the identical model to a permitted instance type.

Expected behavior
The deployment to an allowed instance type should be successful, but it ends up failing.

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.21.0
  • Python version: Python 3.7.10
  • CPU or GPU: SageMaker Studio Jupyter Server
  • Custom Docker image (Y/N): SageMaker Studio Jupyter Server

Additional context
While this issue is related to SageMaker JumpStart ModelHub, the sagemaker api issue can be reproduced outside of SageMaker Studio.

@ChoiByungWook
Copy link
Contributor

What is the exact error/stacktrace are you seeing when you deploy for the second time?

From my understanding it sounds like you're running a similar issue shown in: #1470?

@ChoiByungWook
Copy link
Contributor

Something went wrong
We encountered an error while preparing to deploy your endpoint. You can get more details below.
operation deploy failed: An error occurred (ResourceLimitExceeded) when calling the CreateEndpoint operation: The account-level service limit 'ml.g4dn.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

@ChoiByungWook
Copy link
Contributor

It looks like we check to see if there is an existing endpoint_config:

if not _deployment_entity_exists(

Are you using the same name in your code for your deployments?

@marckarp
Copy link
Contributor

There is not enough in the information provided to reproduce/classify this as a bug. Do you have a code snippets/notebook to reproduce the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants