amazon-sagemaker-bert-pytorch failed to run #1975

yzhang-github-pub · 2020-10-27T22:50:50Z

I cloned https://github.com/aws-samples/amazon-sagemaker-bert-pytorch.git in SageMaker, and ran jupyter notebook without any modification, and got error as below:

"UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".

sugspi · 2020-10-29T08:33:12Z

Hi,

I’ve also had same kind of problem since the last Monday (just this week).
I can’t run pytorch code on SageMaker instance(ml.p3.8xlarge and local mode).
However I could run and complete learning jusing same code last Friday.

I got AlgorithmError: ExecuteUserScriptError, but my algorithm log (I use CloudWatch) is :
â��â��â��â��â��â��â��â��â��â��| 207659/210805 [03:27<00:02, 1050.67it/s]#15 99%|â��â��â��â��â��â��â��â��â��â��| 210753/210805 [03:30<00:00, 1048.70it/s]#015100%|â��â��â��â��â��â��â��â��â��â��| 210805/210805 [03:30<00:00, 999.88it/s]
-- | --
tqdm outputted garbled characters.

I would like to know any workaround.
Thank you very much for your help.

spacer730 · 2020-10-29T09:29:25Z

Hi,

I have the same issue as above. Code that worked perfectly 2 weeks ago and where I changed nothing, now that same code does not run.

There is something seriously wrong with the pytorch estimator of sagemaker. This should be looked at.

Thank you

lawwu · 2020-11-13T23:17:28Z

Ran into the same issue but was able to get the training job to run using pytorch framework_version == 1.6.0 and updating the version of transformers==3.5.0. But when trying to deploy an endpoint:

predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

I get this error:

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-14-0037ac51261d> in <module>
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    801             wait=wait,
    802             kms_key=kms_key,
--> 803             data_capture_config=data_capture_config,
    804         )
    805 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
    528             kms_key=kms_key,
    529             wait=wait,
--> 530             data_capture_config_dict=data_capture_config_dict,
    531         )
    532 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
   3191 
   3192             self.sagemaker_client.create_endpoint_config(**config_options)
-> 3193         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   3194 
   3195     def expand_role(self, role):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
   2709         )
   2710         if wait:
-> 2711             self.wait_for_endpoint(endpoint_name)
   2712         return endpoint_name
   2713 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   2978                 ),
   2979                 allowed_statuses=["InService"],
-> 2980                 actual_status=status,
   2981             )
   2982         return desc

UnexpectedStatusException: Error hosting endpoint pytorch-training-2020-11-13-22-51-37-707: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Pulling up the cloudwatch logs, I see:

Traceback (most recent call last):
  File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
    sys.exit(generate_model_archive())
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
    package_model(args, manifest=manifest)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
    model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
    shutil.copy(path, model_path)
  File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:

and

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth'

Did something change in the PyTorch image?

lawwu · 2020-11-14T00:38:08Z

Updating the requirements.txt to

tqdm
requests==2.22.0
regex
sacremoses
sentencepiece==0.1.91
transformers==3.5.0

and using the pytorch 1.5.0 image worked for me.

NeuroWinter · 2020-12-09T01:44:29Z

I am getting the exact same issue, using pytorch 1.5.0 does not work for me with the above requirements, nor does the following requirements with 1.6.0:

pycm 
tqdm 
requests==2.22.0 
regex 
sacremoses 
sentencepiece==0.1.91 
# transformers==3.5.0 Does not work with pytorch v 1.5- 1.6 
transformers==4.0.0 `

The sagemaker version in the notebook is as follows:

sagemaker==2.18.0 
sagemaker-pyspark==1.4.1

icywang86rui · 2020-12-10T18:32:20Z

This change to switch to TorchServe broke some workflows when the training job doesn't save the model to "/opt/ml/model/model.pth". See aws/sagemaker-pytorch-inference-toolkit@a3a08d0
This change is released in the 1.6 PT container. The solution of providing a custom save_pytorch_model method mentioned here should work - https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/86. Another workaround is to switch to 1.5 or older versions of PyTorch images.

himswamy · 2022-01-16T17:59:34Z

Hi, still facing the same error

himswamy · 2022-01-16T18:01:33Z

PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
2022-01-16 17:48:12,592 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

2022-01-16 17:48:21 Uploading - Uploading generated training model
2022-01-16 17:48:21 Failed - Training job failed
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
2022-01-16 17:48:11,890 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

UnexpectedStatusException Traceback (most recent call last)
in
20 disable_profiler=True, # disable debugger
21 )
---> 22 estimator.fit({"training": inputs_train, "testing": inputs_test})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1653 # If logs are requested, call logs_for_jobs.
1654 if logs != "None":
-> 1655 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1656 else:
1657 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3777
3778 if wait:
-> 3779 self._check_job_status(job_name, description, "TrainingJobStatus")
3780 if dot:
3781 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3336 ),
3337 allowed_statuses=["Completed", "Stopped"],
-> 3338 actual_status=status,
3339 )
3340

UnexpectedStatusException: Error for Training job pytorch-training-2022-01-16-17-43-52-825: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

georgebakas · 2022-06-15T08:46:53Z

The issue is not yet addressed. I am trying to deploy a custom Detr model with PyTorch (based on HuggingFace implementation and trained with Pytorch for custom object detection). I am using PT v1.11 and Python 3.8. The weights are structured in the same way that is stated in the aws/pytorch documentation:
https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher

The deployment is performed on a ml.m4.xlarge and the error I get is that the weights are not found.

W-9000-model_1.0-stdout MODEL_LOG - FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pt'

Could you please clarify where are the model weights expected to be? Or please provide container Dockerfile in order to check.

akrishna1995 · 2023-12-26T23:13:25Z

Thanks for raising this issue. Can anyone confirm if this issue still persists on the latest sagemaker ?

taziamoma · 2025-05-07T01:20:54Z

I'm having this exactly issue despite including transformers in my requirements.txt file. Mine is being handled by sagemaker though, I'm not uploading a model.

icywang86rui added the type: question label Dec 10, 2020

martinRenou added the PyTorch label Sep 21, 2023

nargokul assigned mollyheamazon Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

amazon-sagemaker-bert-pytorch failed to run #1975

amazon-sagemaker-bert-pytorch failed to run #1975

yzhang-github-pub commented Oct 27, 2020

sugspi commented Oct 29, 2020 •

edited

Loading

spacer730 commented Oct 29, 2020

lawwu commented Nov 13, 2020

lawwu commented Nov 14, 2020

NeuroWinter commented Dec 9, 2020 •

edited

Loading

icywang86rui commented Dec 10, 2020

himswamy commented Jan 16, 2022

himswamy commented Jan 16, 2022

georgebakas commented Jun 15, 2022

akrishna1995 commented Dec 26, 2023

taziamoma commented May 7, 2025

amazon-sagemaker-bert-pytorch failed to run #1975

amazon-sagemaker-bert-pytorch failed to run #1975

Comments

yzhang-github-pub commented Oct 27, 2020

sugspi commented Oct 29, 2020 • edited Loading

spacer730 commented Oct 29, 2020

lawwu commented Nov 13, 2020

lawwu commented Nov 14, 2020

NeuroWinter commented Dec 9, 2020 • edited Loading

icywang86rui commented Dec 10, 2020

himswamy commented Jan 16, 2022

himswamy commented Jan 16, 2022

georgebakas commented Jun 15, 2022

akrishna1995 commented Dec 26, 2023

taziamoma commented May 7, 2025

sugspi commented Oct 29, 2020 •

edited

Loading

NeuroWinter commented Dec 9, 2020 •

edited

Loading