Skip to content

TensorFlow #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
prithuraj opened this issue Feb 23, 2019 · 2 comments
Closed

TensorFlow #655

prithuraj opened this issue Feb 23, 2019 · 2 comments

Comments

@prithuraj
Copy link

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans):
  • Framework Version:
  • Python Version:
  • CPU or GPU:
  • Python SDK Version:
  • Are you using a custom image:

Describe the problem

Describe the problem or feature request clearly here.

Minimal repro / logs

from sagemaker.tensorflow import TensorFlow

iris_estimator = TensorFlow(entry_point='keras_input.py',
role=role,
framework_version='1.12.0',
output_path=model_artifacts_location,
code_location=custom_code_upload_location,
train_instance_count=1,
train_instance_type='ml.m5.24xlarge',
hyperparameters={'learning_rate': 0.001},
training_steps=100,
evaluation_steps=2)

%%time
import boto3

s3://fc-uk-data/datalake/blue-tigers/saimadhu-test/model_sample_datasets/inputs/iris_data.csv

use the region-specific sample data bucket

region = boto3.Session().region_name

datalake/blue-tigers/green-spiders/data-analysis/modelling_data/xgboost_model_data_2018_oct_to_dec/train_data

s3://sagemaker-sample-data-eu-west-1/tensorflow/iris

train_data='datalake/blue-tigers/green-spiders/data-analysis/modelling_data/xgboost_model_data_2018_oct_to_dec/train_data'

train_data_location = 's3://{}/{}'.format(bucket,train_data)

iris_estimator.fit(train_data_location,job_name='bluetigers-sagemaker-keras-20190220-8',run_tensorboard_locally=True)

  • Exact command to reproduce:

Error Message 👍

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

AbortedError (see above for traceback): All 10 retry attempts failed. The last failure: Unknown: AccessDenied: Access Denied
#11 [[node save/MergeV2Checkpoints (defined at /usr/local/lib/python2.7/dist-packages/tf_container/trainer.py:73) = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _arg_save/Const_0_1)]]

Graph was finalized.
Running local_init_op.
Done running local_init_op.
Saving checkpoints for 0 into s3://fc-uk-data/datalake/blue-tigers/saimadhu-test/model_sample_datasets/outputs/output_dir/bluetigers-sagemaker-keras-20190220-5/checkpoints/model.ckpt.

@mvsusp
Copy link
Contributor

mvsusp commented Feb 26, 2019

Hi @prithuraj ,

Looking at the error logs, it seems that TF is failing to write TF checkpoints to the S3 bucket . s3://fc-uk-data/datalake/blue-tigers/saimadhu-test/model_sample_datasets/outputs/output_dir/bluetigers-sagemaker-keras-20190220-5/checkpoints/model.ckpt.

Please, double check if your role has permissions to write to that bucket. Please notice as well that the bucket needs to include sagemaker in its name (https://docs.aws.amazon.com/sagemaker/latest/dg/gs-config-permissions.html).

Alternatively you can pass in the argument model_dir='/opt/ml/model' into the TensorFlow estimator allowing to save the checkpoints without using S3.

Thanks for using SageMaker!

@laurenyu
Copy link
Contributor

closing due to inactivity. feel free to reopen if necessary!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants