Tensorflow hangs after creating checkpoint #453

gdj0nes · 2018-11-01T15:55:05Z

System Information

Framework: Tensorflow
Framework Version: 1.10.0
py2
CPU
Python SDK: 1.12.0
Custom image: No

The model hangs after these logs are finished. The cloud watch metric suggests that nothing is being run on the machine.

Logs

WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-bucket
INFO:sagemaker:Creating training-job with name: model-*****
2018-11-01 14:54:31 Starting - Starting the training job.INFO:sagemaker:TensorBoard 0.1.7 at http://localhost:6007
..
Launching requested ML instances......
Preparing the instances for training.....
2018-11-01 14:56:43,692 INFO - root - running container entrypoint
2018-11-01 14:56:43,692 INFO - root - starting train task
2018-11-01 14:56:43,699 INFO - container_support.training - Training starting
Downloading s3://sagemaker-us-east-1-****/model-****/source/sourcedir.tar.gz to /tmp/script.tar.gz
2018-11-01 14:56:47,030 INFO - container_support.environment - current Python environment
2018-11-01 14:56:47,031 INFO - container_support.environment - installing requirements in /opt/ml/code/requirements.txt via pip
2018-11-01 14:56:48,366 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-11-01 14:56:48,367 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-11-01 14:56:48,367 INFO - tf_container - ---------------------------------------------------------
2018-11-01 14:56:48,367 INFO - tf_container - creating RunConfig:
2018-11-01 14:56:48,367 INFO - tf_container - {u'log_step_count_steps': 10, u'save_summary_steps': 10, 'save_checkpoints_secs': 300}
2018-11-01 14:56:48,367 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-11-01 14:56:48,367 INFO - tf_container - creating an estimator from the user-provided model_fn
2018-11-01 14:56:48,368 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_keep_checkpoint_max': 5, '_task_type': u'master', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc0d711e1d0>, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 10, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_global_id_in_cluster': 0, '_is_chief': True, '_protocol': None, '_save_checkpoints_steps': None, '_experimental_distribute': None, '_save_summary_steps': 10, '_model_dir': u's3://sagemaker-us-east-1-***/model-***/checkpoints', '_master': ''}
2018-11-01 14:56:48,369 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-11-01 14:56:48.441981: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:48.442028: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
in the pipeline <PipeModeDataset shapes: (), types: tf.string>

2018-11-01 14:56:37 Downloading - Downloading input data
2018-11-01 14:56:41 Training - Training image download completed. Training in progress.2018-11-01 14:56:50,689 INFO - tensorflow - Calling model_fn.
2018-11-01 14:56:54,181 INFO - tensorflow - Done calling model_fn.
2018-11-01 14:56:54,182 INFO - tensorflow - Create CheckpointSaverHook.
2018-11-01 14:56:54.192907: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.192936: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:54.214335: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.214361: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:54.233231: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.233256: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:56,004 INFO - tensorflow - Graph was finalized.
2018-11-01 14:56:56.011930: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:56.011963: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:58,011 INFO - tensorflow - Running local_init_op.
2018-11-01 14:56:58,079 INFO - tensorflow - Done running local_init_op.
2018-11-01 14:56:58.799410: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:58.799448: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:57:01,866 INFO - tensorflow - Saving checkpoints for 0 into s3://sagemaker-us-east-1-****/****/checkpoints/model.ckpt.
2018-11-01 14:57:10.640247: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:57:10.640285: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

estimator = TensorFlow(role=role,
                       framework_version='1.10.0',
                       train_volume_size=200,
                       input_mode='Pipe',
                       entry_point='entrypoint.py',
                       source_dir='model/estimator',
                       training_steps=10,
                       evaluation_steps=5,
                       train_instance_count=1,
                       train_instance_type='ml.m5.xlarge',
                       base_job_name='model-name',
                       requirements_file='requirements.txt',
                       **hyp_params)
estimator.fit({'train': 's3://bucket/prefix_or_file',
                      'eval':  's3://bucket/prefix_or_file',
                     run_tensorboard_locally=True)

Secondary question, which channels are available and how does the Tensorflow model use them? Is the channel supposed to be training for "train" and "eval"?

PipeModeDataset(channel='training', record_format='TFRecord')

The text was updated successfully, but these errors were encountered:

nadiaya · 2018-11-05T19:49:38Z

Those particular tensorflow/core/platform/s3 warnings and errors are unfortunately quite common, but usually do not affect training in any way.
Does the training job just hangs and never fails?

laurenyu · 2018-12-20T20:38:25Z

closing due to inactivity. feel free to reopen if necessary.

gautiese · 2019-08-29T10:57:48Z

Hi! Did you solve this issue? I seem to be having the same problem.

nikhila0912 · 2020-05-22T14:10:06Z

I'm facing the same issue while trying to launch a training job using script mode on a p type instance.
I have particularly specified model_dir path(s3 location) to save checkpoints but it is trying to save to a default location mentioned above.(s3://sagemaker-us-east-1-//checkpoints/model.ckpt.) and after saving the initial checkpoint the training crashes.

My piece of code:

hyperparams = { "model_dir":"s3://mlops-data/text_classification_bert/model",
"bucket-name":"mlops-data",
"data_dir":"data_input",
"strategy":"None"
}

tf_estimator = TensorFlow(entry_point='task.py',role = role,
source_dir = 'text_classification/trainer_module/trainer',
train_instance_count=1, train_instance_type='ml.p3.2xlarge',hyperparameters=hyperparams,
framework_version='1.14', py_version='py3',script_mode = True)

tf_estimator.fit()
Also, data_dir, bucket_name are arguments i have defined in my script.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Fix new MXNet parameter for distributed training (aws#453)

b6b9fee

laurenyu closed this as completed Dec 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensorflow hangs after creating checkpoint #453

Tensorflow hangs after creating checkpoint #453

gdj0nes commented Nov 1, 2018

nadiaya commented Nov 5, 2018

Uh oh!

laurenyu commented Dec 20, 2018

Uh oh!

gautiese commented Aug 29, 2019

Uh oh!

nikhila0912 commented May 22, 2020 •

edited

Loading

Uh oh!

Tensorflow hangs after creating checkpoint #453

Tensorflow hangs after creating checkpoint #453

Comments

gdj0nes commented Nov 1, 2018

System Information

Logs

nadiaya commented Nov 5, 2018

Uh oh!

laurenyu commented Dec 20, 2018

Uh oh!

gautiese commented Aug 29, 2019

Uh oh!

nikhila0912 commented May 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikhila0912 commented May 22, 2020 •

edited

Loading