Skip to content

Image Uri generation for Hugging Face is broken #2700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
philschmid opened this issue Oct 13, 2021 · 7 comments
Closed

Image Uri generation for Hugging Face is broken #2700

philschmid opened this issue Oct 13, 2021 · 7 comments

Comments

@philschmid
Copy link
Contributor

philschmid commented Oct 13, 2021

Describe the bug
i am getting an error saying
UnexpectedStatusException: Error hosting endpoint huggingface-pytorch-inference-2021-10-13-08-17-14-036: Failed. Reason: The image '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.8.1-transformers4.10.2-gpu-py36-cu110-ubuntu18.04' does not exist..

when using the following code

huggingface_model = HuggingFaceModel(
	transformers_version='4.10.2',
	pytorch_version='1.8.1',
	py_version='py36',
	env=hub,
	role=role, 
)

Looking at the release here: https://github.com/aws/deep-learning-containers/releases/tag/v1.0-hf-4.10.2-pt-1.8.1-py36
I can see that the difference is cu110 to cu111

I looked at the image_uri_config and it correctly contains cu111

"container_version": {"gpu":"cu111-ubuntu18.04", "cpu":"ubuntu18.04" }

It is the same for pytorch 1.9 and transformers 4.11. But there is also the ubuntu version wrong.

When using pytorch 1.9 and transformers 4.11 the sdk generates the following tag 1.9-transformers4.11-gpu-py38-cu110-ubuntu18.04
but the correct one must include cu111-ubuntu20.04

"container_version": {"gpu": "cu111-ubuntu20.04", "cpu": "ubuntu20.04" }

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.62.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Hugging Face

cc @ahsan-z-khan

@ahsan-z-khan
Copy link
Member

@philschmid Thanks for addressing the issue. I am working on it.

@philschmid
Copy link
Contributor Author

I tested @ahsan-z-khan branch with

pip install git+https://github.com/ahsan-z-khan/sagemaker-python-sdk.git@HF-image-uri --upgrade

it used the correct image.

@ahsan-z-khan
Copy link
Member

Fix release with version 2.66.1

@TimbusCalin
Copy link

TimbusCalin commented Aug 8, 2022

I trained a network (basic text-classifier distillbert) on my own laptop:

PT 1.12.0
PY 3.9
HF 4.21.0

I had to downgrade to these versions since this is what JNB outputs in the console (max versions displayed below).

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://tfsaws/model.tar.gz",  # path to your trained sagemaker model
   role=role,# iam role with permissions to create an Endpoint
   transformers_version="4.17", # transformers version used
   pytorch_version="1.10.2", # pytorch version used
   py_version="py38", # python version of the DLC
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

Error (seems to reproduce again):

ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Requested image 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.10.2-transformers4.17-cpu-py38-ubuntu20.04 not found.

@philschmid @ahsan-z-khan do you have any idea how to tackle this?

Thank you in advance.

UPDATE:
I managed to deploy the model with the following configuration.

   transformers_version="4.12.3", # transformers version used
   pytorch_version="1.9.1", # pytorch version used
   py_version="py38", # python version of the DLC

However I find it quite cumbersome to contend with the versions of the libraries for the docker images, since the error shown doesn't point towards a suitable combination of TF + PT + PY but rather the latest available in each category.

Is there any particular "table" in which combinations of those libraries are certain to work?

@VikParuchuri
Copy link

Since I found this issue when Googling - I'll leave the link to all of the deep learning container images here. You can find the correct combinations of Python, transformers, and Pytorch versions there.

@llealgt
Copy link

llealgt commented Oct 27, 2023

I'm having a similar issue, I was trying to use settings from the original documentation https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/ecr-us-east-1.html but even the default example from this page fails:

image

unnamed

It seems to me like the uri system for huggingface is broken or at least the documentation is outdated

@dayo777
Copy link

dayo777 commented Dec 1, 2024

I also ran into this issue today. The solution is to check for the correct image uri on their official https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-inference-containers page.

AWS image uri list is outdated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants