-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Tensorflow hangs after creating checkpoint #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Those particular tensorflow/core/platform/s3 warnings and errors are unfortunately quite common, but usually do not affect training in any way. |
closing due to inactivity. feel free to reopen if necessary. |
Hi! Did you solve this issue? I seem to be having the same problem. |
I'm facing the same issue while trying to launch a training job using script mode on a p type instance. My piece of code: hyperparams = { "model_dir":"s3://mlops-data/text_classification_bert/model", tf_estimator = TensorFlow(entry_point='task.py',role = role, tf_estimator.fit() |
System Information
The model hangs after these logs are finished. The cloud watch metric suggests that nothing is being run on the machine.
Logs
Secondary question, which channels are available and how does the Tensorflow model use them? Is the channel supposed to be training for "train" and "eval"?
The text was updated successfully, but these errors were encountered: