Skip to content

amazon-sagemaker-bert-pytorch failed to run #1975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yzhang-github-pub opened this issue Oct 27, 2020 · 11 comments
Open

amazon-sagemaker-bert-pytorch failed to run #1975

yzhang-github-pub opened this issue Oct 27, 2020 · 11 comments

Comments

@yzhang-github-pub
Copy link

I cloned https://github.com/aws-samples/amazon-sagemaker-bert-pytorch.git in SageMaker, and ran jupyter notebook without any modification, and got error as below:

"UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".

@sugspi
Copy link

sugspi commented Oct 29, 2020

Hi,

I’ve also had same kind of problem since the last Monday (just this week).
I can’t run pytorch code on SageMaker instance(ml.p3.8xlarge and local mode).
However I could run and complete learning jusing same code last Friday.

I got AlgorithmError: ExecuteUserScriptError, but my algorithm log (I use CloudWatch) is :
����������| 207659/210805 [03:27<00:02, 1050.67it/s]#15 99%|����������| 210753/210805 [03:30<00:00, 1048.70it/s]#015100%|����������| 210805/210805 [03:30<00:00, 999.88it/s]
-- | --
tqdm outputted garbled characters.

I would like to know any workaround.
Thank you very much for your help.

@spacer730
Copy link

Hi,

I have the same issue as above. Code that worked perfectly 2 weeks ago and where I changed nothing, now that same code does not run.

There is something seriously wrong with the pytorch estimator of sagemaker. This should be looked at.

Thank you

@lawwu
Copy link

lawwu commented Nov 13, 2020

Ran into the same issue but was able to get the training job to run using pytorch framework_version == 1.6.0 and updating the version of transformers==3.5.0. But when trying to deploy an endpoint:

predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

I get this error:

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-14-0037ac51261d> in <module>
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
    801             wait=wait,
    802             kms_key=kms_key,
--> 803             data_capture_config=data_capture_config,
    804         )
    805 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
    528             kms_key=kms_key,
    529             wait=wait,
--> 530             data_capture_config_dict=data_capture_config_dict,
    531         )
    532 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
   3191 
   3192             self.sagemaker_client.create_endpoint_config(**config_options)
-> 3193         return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
   3194 
   3195     def expand_role(self, role):

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
   2709         )
   2710         if wait:
-> 2711             self.wait_for_endpoint(endpoint_name)
   2712         return endpoint_name
   2713 

~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
   2978                 ),
   2979                 allowed_statuses=["InService"],
-> 2980                 actual_status=status,
   2981             )
   2982         return desc

UnexpectedStatusException: Error hosting endpoint pytorch-training-2020-11-13-22-51-37-707: Failed. Reason:  The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Pulling up the cloudwatch logs, I see:

Traceback (most recent call last):
  File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
    sys.exit(generate_model_archive())
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
    package_model(args, manifest=manifest)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
    model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
  File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
    shutil.copy(path, model_path)
  File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
    with open(src, 'rb') as fsrc:

and

FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth'

Did something change in the PyTorch image?

@lawwu
Copy link

lawwu commented Nov 14, 2020

Updating the requirements.txt to

tqdm
requests==2.22.0
regex
sacremoses
sentencepiece==0.1.91
transformers==3.5.0

and using the pytorch 1.5.0 image worked for me.

@NeuroWinter
Copy link

NeuroWinter commented Dec 9, 2020

I am getting the exact same issue, using pytorch 1.5.0 does not work for me with the above requirements, nor does the following requirements with 1.6.0:

pycm 
tqdm 
requests==2.22.0 
regex 
sacremoses 
sentencepiece==0.1.91 
# transformers==3.5.0 Does not work with pytorch v 1.5- 1.6 
transformers==4.0.0 `

The sagemaker version in the notebook is as follows:

sagemaker==2.18.0 
sagemaker-pyspark==1.4.1 

@icywang86rui
Copy link
Contributor

This change to switch to TorchServe broke some workflows when the training job doesn't save the model to "/opt/ml/model/model.pth". See aws/sagemaker-pytorch-inference-toolkit@a3a08d0
This change is released in the 1.6 PT container. The solution of providing a custom save_pytorch_model method mentioned here should work - https://github.com/aws/sagemaker-pytorch-inference-toolkit/issues/86​. Another workaround is to switch to 1.5 or older versions of PyTorch images.

@himswamy
Copy link

Hi, still facing the same error

@himswamy
Copy link

PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
2022-01-16 17:48:12,592 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

2022-01-16 17:48:21 Uploading - Uploading generated training model
2022-01-16 17:48:21 Failed - Training job failed
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'
2022-01-16 17:48:11,890 sagemaker-containers ERROR ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

UnexpectedStatusException Traceback (most recent call last)
in
20 disable_profiler=True, # disable debugger
21 )
---> 22 estimator.fit({"training": inputs_train, "testing": inputs_test})

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
690 self.jobs.append(self.latest_training_job)
691 if wait:
--> 692 self.latest_training_job.wait(logs=logs)
693
694 def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
1653 # If logs are requested, call logs_for_jobs.
1654 if logs != "None":
-> 1655 self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
1656 else:
1657 self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3777
3778 if wait:
-> 3779 self._check_job_status(job_name, description, "TrainingJobStatus")
3780 if dot:
3781 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
3336 ),
3337 allowed_statuses=["Completed", "Stopped"],
-> 3338 actual_status=status,
3339 )
3340

UnexpectedStatusException: Error for Training job pytorch-training-2022-01-16-17-43-52-825: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2"
Traceback (most recent call last):
File "train_deploy.py", line 14, in
from transformers import AdamW, BertForSequenceClassification, BertTokenizer
ModuleNotFoundError: No module named 'transformers'

@georgebakas
Copy link

The issue is not yet addressed. I am trying to deploy a custom Detr model with PyTorch (based on HuggingFace implementation and trained with Pytorch for custom object detection). I am using PT v1.11 and Python 3.8. The weights are structured in the same way that is stated in the aws/pytorch documentation:
https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#for-versions-1-2-and-higher

The deployment is performed on a ml.m4.xlarge and the error I get is that the weights are not found.

W-9000-model_1.0-stdout MODEL_LOG - FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pt'

Could you please clarify where are the model weights expected to be? Or please provide container Dockerfile in order to check.

@akrishna1995
Copy link
Contributor

Thanks for raising this issue. Can anyone confirm if this issue still persists on the latest sagemaker ?

@taziamoma
Copy link

I'm having this exactly issue despite including transformers in my requirements.txt file. Mine is being handled by sagemaker though, I'm not uploading a model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests