-
Notifications
You must be signed in to change notification settings - Fork 1.2k
amazon-sagemaker-bert-pytorch failed to run #1975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, I’ve also had same kind of problem since the last Monday (just this week). I got AlgorithmError: ExecuteUserScriptError, but my algorithm log (I use CloudWatch) is : I would like to know any workaround. |
Hi, I have the same issue as above. Code that worked perfectly 2 weeks ago and where I changed nothing, now that same code does not run. There is something seriously wrong with the pytorch estimator of sagemaker. This should be looked at. Thank you |
Ran into the same issue but was able to get the training job to run using pytorch
I get this error: ---------------------------------------------------------------------------
UnexpectedStatusException Traceback (most recent call last)
<ipython-input-14-0037ac51261d> in <module>
----> 1 predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge")
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, use_compiled_model, wait, model_name, kms_key, data_capture_config, tags, **kwargs)
801 wait=wait,
802 kms_key=kms_key,
--> 803 data_capture_config=data_capture_config,
804 )
805
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, serializer, deserializer, accelerator_type, endpoint_name, tags, kms_key, wait, data_capture_config, **kwargs)
528 kms_key=kms_key,
529 wait=wait,
--> 530 data_capture_config_dict=data_capture_config_dict,
531 )
532
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, kms_key, wait, data_capture_config_dict)
3191
3192 self.sagemaker_client.create_endpoint_config(**config_options)
-> 3193 return self.create_endpoint(endpoint_name=name, config_name=name, tags=tags, wait=wait)
3194
3195 def expand_role(self, role):
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, tags, wait)
2709 )
2710 if wait:
-> 2711 self.wait_for_endpoint(endpoint_name)
2712 return endpoint_name
2713
~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_endpoint(self, endpoint, poll)
2978 ),
2979 allowed_statuses=["InService"],
-> 2980 actual_status=status,
2981 )
2982 return desc
UnexpectedStatusException: Error hosting endpoint pytorch-training-2020-11-13-22-51-37-707: Failed. Reason: The primary container for production variant AllTraffic did not pass the ping health check. Please check CloudWatch logs for this endpoint. Pulling up the cloudwatch logs, I see: Traceback (most recent call last):
File "/opt/conda/bin/torch-model-archiver", line 10, in <module>
sys.exit(generate_model_archive())
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 60, in generate_model_archive
package_model(args, manifest=manifest)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging.py", line 37, in package_model
model_path = ModelExportUtils.copy_artifacts(model_name, **artifact_files)
File "/opt/conda/lib/python3.6/site-packages/model_archiver/model_packaging_utils.py", line 150, in copy_artifacts
shutil.copy(path, model_path)
File "/opt/conda/lib/python3.6/shutil.py", line 245, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/opt/conda/lib/python3.6/shutil.py", line 120, in copyfile
with open(src, 'rb') as fsrc: and FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pth' Did something change in the PyTorch image? |
Updating the requirements.txt to
and using the pytorch 1.5.0 image worked for me. |
I am getting the exact same issue, using pytorch 1.5.0 does not work for me with the above requirements, nor does the following requirements with 1.6.0:
The sagemaker version in the notebook is as follows:
|
This change to switch to TorchServe broke some workflows when the training job doesn't save the model to "/opt/ml/model/model.pth". See aws/sagemaker-pytorch-inference-toolkit@a3a08d0 |
Hi, still facing the same error |
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages 2022-01-16 17:48:21 Uploading - Uploading generated training model
|
The issue is not yet addressed. I am trying to deploy a custom Detr model with PyTorch (based on HuggingFace implementation and trained with Pytorch for custom object detection). I am using PT v1.11 and Python 3.8. The weights are structured in the same way that is stated in the aws/pytorch documentation: The deployment is performed on a ml.m4.xlarge and the error I get is that the weights are not found. W-9000-model_1.0-stdout MODEL_LOG - FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/model/model.pt' Could you please clarify where are the model weights expected to be? Or please provide container Dockerfile in order to check. |
Thanks for raising this issue. Can anyone confirm if this issue still persists on the latest sagemaker ? |
I'm having this exactly issue despite including |
I cloned https://github.com/aws-samples/amazon-sagemaker-bert-pytorch.git in SageMaker, and ran jupyter notebook without any modification, and got error as below:
"UnexpectedStatusException: Error for Training job pytorch-training-2020-10-27-16-28-37-955: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/opt/conda/bin/python train_deploy.py --backend gloo --epochs 1 --num_labels 2".
The text was updated successfully, but these errors were encountered: