-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Curl returned error code 55 --- failed to save tensorflow checkpoint on S3 #454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is the size of the checkpoint? The tensorflow plugin for S3 file system does not support checkpoints bigger than 5GB. |
@nadiaya the size of checkpoint is ~90MB. It seems like this error happens in a random manner, every time I save checkpoint to S3, there's a chance this error might happen. For example, when I was saving my checkpoint every 300 straining steps, sometimes this error happens around 10,000 training steps, sometimes this error show up around 4,000 steps. The training job fails after the error. |
Would it be possible to see the full stack trace with the error? Do you get the same error if running outside of SageMaker? |
full stack trace:
There's no error running outside of Sagemaker because the checkpoint doesn't have to be uploaded to S3 bucket. |
There are a few other known reasons that cause TensorFlow S3 File System Plugin issues:
|
@nadiaya We are getting the same problem using script mode with TF 1.11.0 with Python 3 using both the SageMaker S3 buckets and our own bucket by specifying a Furthermore the issue seems closely tied to the process of saving the model weights. A temp dir of the weight is built but the connection seems to exit before this is cleaned up. The following error emerges after the initial train and eval construction
|
Thank you so much! This is very useful information!
How exactly are you running locally? Do you run just the script? or using SageMaker Python SDK 'local' instance type? |
Dear @nadiaya, Have there been any updates regards finding the cause of the issue? For us this problem has been persisting for the last 3 weeks. It is hard to precisely quantify as on some rare days all training requests seem to succeed but on others more often that not the jobs fail at some point during training. There are a few different cases we’ve encountered (with associated error logs).
The first two types of failures can happen at any point during the training without having changed any of the training parameters e.g. when running the same script 3 times: first time an error happens 60 steps into the training, the second run can be 1500 steps and the third run can finish successfully (2000 steps). Sagemaker script mode environment [EU (Ireland)]:
Size of checkpoints is 500-1000mb due to embeddings trained as a part of an RNN. When executed on our own machines (outside of sagemaker environment, storing the input data and writing checkpoints locally without making any calls to s3) the training is always completed. Please let me know if I can assist further. |
@andremoeller where can we get the correct SageMaker TensorFlow 1.12 release, and could you let us know what the cause is? We're also running into the exact same issue. |
It seems like the other user seems to already be using 1.12, but you would supply I believe the 1.12 image has some fixes built into it that may reduce the occurrence of "Unable to connect to endpoint" issues. |
@andremoeller got it thanks. Unfortunately right now we aren't using Sagemaker, instead we're spinning up P3 EC2 instances, with the latest Nvidias Volta Deep Learning images, that has tensorflow-gpu==1.12.0+nv Any recommendations? We just haven't looked into Sagemaker yet. |
If the problem is the "Unable to connect to endpoint" problem, and you're writing or reading checkpoints to S3, then that TensorFlow installation is probably using the standard S3 FileSystem implementation. SageMaker included some fixes to the checkpoint behavior to make it more robust. Since SageMaker is a managed service, you won't need to maintain your EC2 instances, you can run many jobs at once, distribute training without a cluster manager, do HPO with a managed service, etc. If you're interested in trying it out, here's how you'd move over to SageMaker: https://sagemaker.readthedocs.io/en/stable/using_tf.html#training-with-tensorflow. Two tips: if you already use the Dataset API, you don't have to read from the "channels", and it's easiest to go through the SageMaker console to create the IAM Role. If you'd like to stick with EC2, you might want to look into "yas3fs" and mount your own S3 bucket -- I haven't tried it myself, but I've heard of others getting good results: https://github.com/danilop/yas3fs. |
@andremoeller thanks again for the detailed response! Just to understand, seems like I can do what I'm currently doing on SageMaker? We already have our estimators, input_fn, and data query methods ready to go. It makes external DB calls to get large volumes of data already. All we need is a GPU server. |
Sure thing, @albertlim It seems like it -- SageMaker will just run the script you give it in a TensorFlow Docker container, and if you have more dependencies, the docs describe how to add those to your environment in SageMaker. About the external DB calls and permissions: If the DB is an AWS service, you can query your DB directly from SageMaker as long as the IAM role you give to SageMaker has the permissions to do so. If it's some other DB, you might have to get your credentials into the SageMaker container when it runs your job before you make the query. Getting the data is easiest if it's in S3, in which case SageMaker can download your data before the job starts, and you can just read from local files. |
@andremoeller got it. We're starting to discuss using SageMaker. Until we get the go ahead to do so however, would it be possible to share the commit or pull request that shows the fix? |
Unfortunately, I won't be able to share the diff with you. I can tell you that it replaces the bare S3 client with the TransferManager: https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/examples-s3-transfermanager.html Among other things, this lets the S3 plugin do multi-part uploads, and retry on failed parts, rather than fail atomically. |
Thank you for your suggestion @andremoeller! Indeed, I was already using |
@andremoeller Hi, I have recently started using the framework_version = '1.14' and came across this issue. Is it resolved for 1.12 only? |
@nikhila0912 could you open a new issue in this repository? thanks! |
Please fill out the form below.
System Information
Describe the problem
I was training my code on p3.8xlarge, and after several training steps, the curl error will pop up and cause the training to fail. I tried 3 times, the error occurred at 3900 steps for the first time, 4200 steps for the second time and at 7200 steps for the third time.
logs
tensorflow - Saving checkpoints for 7200 into s3://....../checkpoints/model.ckpt.
tensorflow/core/platform/s3/aws_logging.cc:60] Curl returned error code 55
tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:137 : Unknown: : Unable to connect to endpoint
ERROR - container_support.training - uncaught exception during training: : Unable to connect to endpoint
The text was updated successfully, but these errors were encountered: