Skip to content

Tensorflow hangs after creating checkpoint #453

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gdj0nes opened this issue Nov 1, 2018 · 4 comments
Closed

Tensorflow hangs after creating checkpoint #453

gdj0nes opened this issue Nov 1, 2018 · 4 comments

Comments

@gdj0nes
Copy link

gdj0nes commented Nov 1, 2018

System Information

  • Framework: Tensorflow
  • Framework Version: 1.10.0
  • py2
  • CPU
  • Python SDK: 1.12.0
  • Custom image: No

The model hangs after these logs are finished. The cloud watch metric suggests that nothing is being run on the machine.

Logs

WARNING:root:pandas failed to import. Analytics features will be impaired or broken.
INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-bucket
INFO:sagemaker:Creating training-job with name: model-*****
2018-11-01 14:54:31 Starting - Starting the training job.INFO:sagemaker:TensorBoard 0.1.7 at http://localhost:6007
..
Launching requested ML instances......
Preparing the instances for training.....
2018-11-01 14:56:43,692 INFO - root - running container entrypoint
2018-11-01 14:56:43,692 INFO - root - starting train task
2018-11-01 14:56:43,699 INFO - container_support.training - Training starting
Downloading s3://sagemaker-us-east-1-****/model-****/source/sourcedir.tar.gz to /tmp/script.tar.gz
2018-11-01 14:56:47,030 INFO - container_support.environment - current Python environment
2018-11-01 14:56:47,031 INFO - container_support.environment - installing requirements in /opt/ml/code/requirements.txt via pip
2018-11-01 14:56:48,366 INFO - tf_container - ----------------------TF_CONFIG--------------------------
2018-11-01 14:56:48,367 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}
2018-11-01 14:56:48,367 INFO - tf_container - ---------------------------------------------------------
2018-11-01 14:56:48,367 INFO - tf_container - creating RunConfig:
2018-11-01 14:56:48,367 INFO - tf_container - {u'log_step_count_steps': 10, u'save_summary_steps': 10, 'save_checkpoints_secs': 300}
2018-11-01 14:56:48,367 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1:2222']}, u'task': {u'index': 0, u'type': u'master'}}
2018-11-01 14:56:48,367 INFO - tf_container - creating an estimator from the user-provided model_fn
2018-11-01 14:56:48,368 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_keep_checkpoint_max': 5, '_task_type': u'master', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc0d711e1d0>, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_device_fn': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 10, '_evaluation_master': '', '_eval_distribute': None, '_train_distribute': None, '_session_config': device_filters: "/job:ps"
device_filters: "/job:master"
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_global_id_in_cluster': 0, '_is_chief': True, '_protocol': None, '_save_checkpoints_steps': None, '_experimental_distribute': None, '_save_summary_steps': 10, '_model_dir': u's3://sagemaker-us-east-1-***/model-***/checkpoints', '_master': ''}
2018-11-01 14:56:48,369 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
2018-11-01 14:56:48.441981: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:48.442028: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
in the pipeline <PipeModeDataset shapes: (), types: tf.string>

2018-11-01 14:56:37 Downloading - Downloading input data
2018-11-01 14:56:41 Training - Training image download completed. Training in progress.2018-11-01 14:56:50,689 INFO - tensorflow - Calling model_fn.
2018-11-01 14:56:54,181 INFO - tensorflow - Done calling model_fn.
2018-11-01 14:56:54,182 INFO - tensorflow - Create CheckpointSaverHook.
2018-11-01 14:56:54.192907: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.192936: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:54.214335: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.214361: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:54.233231: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:54.233256: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:56,004 INFO - tensorflow - Graph was finalized.
2018-11-01 14:56:56.011930: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:56.011963: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:56:58,011 INFO - tensorflow - Running local_init_op.
2018-11-01 14:56:58,079 INFO - tensorflow - Done running local_init_op.
2018-11-01 14:56:58.799410: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:56:58.799448: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
2018-11-01 14:57:01,866 INFO - tensorflow - Saving checkpoints for 0 into s3://sagemaker-us-east-1-****/****/checkpoints/model.ckpt.
2018-11-01 14:57:10.640247: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
2018-11-01 14:57:10.640285: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.

estimator = TensorFlow(role=role,
                       framework_version='1.10.0',
                       train_volume_size=200,
                       input_mode='Pipe',
                       entry_point='entrypoint.py',
                       source_dir='model/estimator',
                       training_steps=10,
                       evaluation_steps=5,
                       train_instance_count=1,
                       train_instance_type='ml.m5.xlarge',
                       base_job_name='model-name',
                       requirements_file='requirements.txt',
                       **hyp_params)
estimator.fit({'train': 's3://bucket/prefix_or_file',
                      'eval':  's3://bucket/prefix_or_file',
                     run_tensorboard_locally=True)

Secondary question, which channels are available and how does the Tensorflow model use them? Is the channel supposed to be training for "train" and "eval"?

PipeModeDataset(channel='training', record_format='TFRecord')
@nadiaya
Copy link
Contributor

nadiaya commented Nov 5, 2018

Those particular tensorflow/core/platform/s3 warnings and errors are unfortunately quite common, but usually do not affect training in any way.
Does the training job just hangs and never fails?

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
@laurenyu
Copy link
Contributor

closing due to inactivity. feel free to reopen if necessary.

@gautiese
Copy link

Hi! Did you solve this issue? I seem to be having the same problem.

@nikhila0912
Copy link

nikhila0912 commented May 22, 2020

I'm facing the same issue while trying to launch a training job using script mode on a p type instance.
I have particularly specified model_dir path(s3 location) to save checkpoints but it is trying to save to a default location mentioned above.(s3://sagemaker-us-east-1-//checkpoints/model.ckpt.) and after saving the initial checkpoint the training crashes.

My piece of code:

hyperparams = { "model_dir":"s3://mlops-data/text_classification_bert/model",
"bucket-name":"mlops-data",
"data_dir":"data_input",
"strategy":"None"
}

tf_estimator = TensorFlow(entry_point='task.py',role = role,
source_dir = 'text_classification/trainer_module/trainer',
train_instance_count=1, train_instance_type='ml.p3.2xlarge',hyperparameters=hyperparams,
framework_version='1.14', py_version='py3',script_mode = True)

tf_estimator.fit()
Also, data_dir, bucket_name are arguments i have defined in my script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants