Training data not loaded from S3 when training locally #137

jsaedtler · 2018-04-10T16:44:54Z

I want to run a tensorflow estimator training locally and the training data is located in a S3 bucket. The remote training is working, but when changing the instance type to "local" the training fails.

The error message indicates a problem loading the training data and when inspecting the container, I can't see any files in /opt/ml/input/data/training/ . Permissions seem to be fine: When I start a shell in the container, I can download the training data via:

aws cp s3://BUCKET/PREFIX/training-data.json .

I deleted all related containers and images already, but it didn't help. Where is the download happening? How can I debug the download?

Training code:

    TRAING_DATA_BUCKET = 's3://XXX-training-data/PREFIX/'

    estimator = TensorFlow(
        entry_point='model_with_fx_keras.py',
        source_dir='v20180404/',
        role='SageMakerFullAccess',
        training_steps=50000,
        evaluation_steps=10,
        hyperparameters={'learning_rate': 1e-03},
        train_instance_count=1,
        train_instance_type='local',
        base_job_name='model-v20180404')

    estimator.fit(TRAING_DATA_BUCKET)

Output:

INFO:sagemaker:Creating training-job with name: model-v20180404-2018-04-10-16-34-22-191
Creating tmpmlqw4e_algo-1-U6A07_1 ... done
Attaching to tmpmlqw4e_algo-1-U6A07_1
algo-1-U6A07_1  | 2018-04-10 16:34:37,279 INFO - root - running container entrypoint
algo-1-U6A07_1  | 2018-04-10 16:34:37,280 INFO - root - starting train task
algo-1-U6A07_1  | 2018-04-10 16:34:37,292 INFO - container_support.training - Training starting
algo-1-U6A07_1  | /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
algo-1-U6A07_1  |   from ._conv import register_converters as _register_converters
algo-1-U6A07_1  | 2018-04-10 16:34:37,884 INFO - botocore.credentials - Found credentials in environment variables.
algo-1-U6A07_1  | Downloading s3://XXX/model-v20180404-2018-04-10-16-34-22-191/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-U6A07_1  | 2018-04-10 16:34:37,965 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): XXX.s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,118 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): XXX.s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,259 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): XXX.s3.eu-west-1.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,413 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): XXX.s3.eu-west-1.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,656 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - ----------------------TF_CONFIG--------------------------
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1-U6A07:2222"]}, "task": {"index": 0, "type": "master"}}
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - ---------------------------------------------------------
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - creating RunConfig:
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tf_container - {'save_checkpoints_secs': 300}
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1-U6A07:2222']}, u'task': {u'index': 0, u'type': u'master'}}
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tf_container - creating the estimator
algo-1-U6A07_1  | 2018-04-10 16:34:39,178 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f945c20d190>, '_model_dir': u's3://XXX/model-v20180404-2018-04-10-16-34-22-191/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
algo-1-U6A07_1  | 2018-04-10 16:34:39,179 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
algo-1-U6A07_1  | 2018-04-10 16:34:39.179845: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
algo-1-U6A07_1  | 2018-04-10 16:34:39.179924: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
algo-1-U6A07_1  | 2018-04-10 16:34:39.180037: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
algo-1-U6A07_1  | 2018-04-10 16:34:39.180075: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
algo-1-U6A07_1  | 2018-04-10 16:34:39.180091: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
algo-1-U6A07_1  | 2018-04-10 16:34:39.180102: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000
algo-1-U6A07_1  | 2018-04-10 16:34:39.180117: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.180176: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
algo-1-U6A07_1  | 2018-04-10 16:34:39.180238: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.180408: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
algo-1-U6A07_1  | 2018-04-10 16:34:39.180543: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-U6A07_1  | 2018-04-10 16:34:39.332669: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-U6A07_1  | 2018-04-10 16:34:39.332829: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-U6A07_1  | 2018-04-10 16:34:39.332919: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.333024: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-U6A07_1  | -------------- debug -----------/opt/ml/input/data/training/training-data.json
algo-1-U6A07_1  | /opt/ml/input/data/training
algo-1-U6A07_1  | training-data.json
algo-1-U6A07_1  | 2018-04-10 16:34:39,393 ERROR - container_support.training - uncaught exception during training: Expected object or value
algo-1-U6A07_1  | Traceback (most recent call last):
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
algo-1-U6A07_1  |     fw.train()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
algo-1-U6A07_1  |     train_wrapper.train()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
algo-1-U6A07_1  |     tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
algo-1-U6A07_1  |     executor.run()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
algo-1-U6A07_1  |     getattr(self, task_to_run)()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
algo-1-U6A07_1  |     self._start_distributed_training(saving_listeners=saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training
algo-1-U6A07_1  |     saving_listeners=saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
algo-1-U6A07_1  |     loss = self._train_model(input_fn, hooks, saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 809, in _train_model
algo-1-U6A07_1  |     input_fn, model_fn_lib.ModeKeys.TRAIN))
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 668, in _get_features_and_labels_from_input_fn
algo-1-U6A07_1  |     result = self._call_input_fn(input_fn, mode)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 760, in _call_input_fn
algo-1-U6A07_1  |     return input_fn(**kwargs)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 112, in <lambda>
algo-1-U6A07_1  |     train_input_fn = lambda: self.customer_script.train_input_fn(**invoke_args)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 79, in train_input_fn
algo-1-U6A07_1  |     return _input_fn(training_dir, 'training-data.json', input_tensor_name = INPUT_TENSOR_NAME)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 123, in _input_fn
algo-1-U6A07_1  |     X, y = _load_data(training_dir, training_filename)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 103, in _load_data
algo-1-U6A07_1  |     df = pd.read_json(file, orient='records')
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 366, in read_json
algo-1-U6A07_1  |     return json_reader.read()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 467, in read
algo-1-U6A07_1  |     obj = self._get_object_parser(self.data)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 484, in _get_object_parser
algo-1-U6A07_1  |     obj = FrameParser(json, **kwargs).parse()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 576, in parse
algo-1-U6A07_1  |     self._parse_no_numpy()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 806, in _parse_no_numpy
algo-1-U6A07_1  |     loads(json, precise_float=self.precise_float), dtype=None)
algo-1-U6A07_1  | ValueError: Expected object or value

The text was updated successfully, but these errors were encountered:

professoroakz · 2018-04-12T19:59:36Z

Hey @jsaedtler, other question: how are you able to run this locally? I have not been able to as of yet. When I specify train_instance_type='local', I get this error:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

iquintero · 2018-04-12T20:06:38Z

@OktayGardener I think you may be running an old version of the SDK. Can you please update to the latest?. I will reply in the issue you created soon.

@jsaedtler I have confirmed there is a bug here if your data is not on the default bucket 👎 I am working on a fix and will submit a PR today.

Update linear_time_series_forecast.ipynb

suryadeepti · 2020-07-30T22:05:16Z

location or region error.
Can be solved by taking either your region bucket or by creating your region(location) bucket.

iquintero mentioned this issue Apr 12, 2018

Fix local mode not using the right s3 bucket. #144

Merged

4 tasks

iquintero closed this as completed in #144 Apr 13, 2018

iquintero mentioned this issue Apr 13, 2018

'local' instance types don't properly download channels #145

Closed

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#137 from dhruvgm/patch-1

bd4beee

Update linear_time_series_forecast.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training data not loaded from S3 when training locally #137

Training data not loaded from S3 when training locally #137

jsaedtler commented Apr 10, 2018 •

edited

Loading

professoroakz commented Apr 12, 2018 •

edited

Loading

iquintero commented Apr 12, 2018

suryadeepti commented Jul 30, 2020

Training data not loaded from S3 when training locally #137

Training data not loaded from S3 when training locally #137

Comments

jsaedtler commented Apr 10, 2018 • edited Loading

professoroakz commented Apr 12, 2018 • edited Loading

iquintero commented Apr 12, 2018

suryadeepti commented Jul 30, 2020

jsaedtler commented Apr 10, 2018 •

edited

Loading

professoroakz commented Apr 12, 2018 •

edited

Loading