Skip to content

Training data not loaded from S3 when training locally #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jsaedtler opened this issue Apr 10, 2018 · 3 comments · Fixed by #144
Closed

Training data not loaded from S3 when training locally #137

jsaedtler opened this issue Apr 10, 2018 · 3 comments · Fixed by #144

Comments

@jsaedtler
Copy link

jsaedtler commented Apr 10, 2018

I want to run a tensorflow estimator training locally and the training data is located in a S3 bucket. The remote training is working, but when changing the instance type to "local" the training fails.

The error message indicates a problem loading the training data and when inspecting the container, I can't see any files in /opt/ml/input/data/training/ . Permissions seem to be fine: When I start a shell in the container, I can download the training data via:

aws cp s3://BUCKET/PREFIX/training-data.json .

I deleted all related containers and images already, but it didn't help. Where is the download happening? How can I debug the download?

Training code:

    TRAING_DATA_BUCKET = 's3://XXX-training-data/PREFIX/'

    estimator = TensorFlow(
        entry_point='model_with_fx_keras.py',
        source_dir='v20180404/',
        role='SageMakerFullAccess',
        training_steps=50000,
        evaluation_steps=10,
        hyperparameters={'learning_rate': 1e-03},
        train_instance_count=1,
        train_instance_type='local',
        base_job_name='model-v20180404')

    estimator.fit(TRAING_DATA_BUCKET)

Output:

INFO:sagemaker:Creating training-job with name: model-v20180404-2018-04-10-16-34-22-191
Creating tmpmlqw4e_algo-1-U6A07_1 ... done
Attaching to tmpmlqw4e_algo-1-U6A07_1
algo-1-U6A07_1  | 2018-04-10 16:34:37,279 INFO - root - running container entrypoint
algo-1-U6A07_1  | 2018-04-10 16:34:37,280 INFO - root - starting train task
algo-1-U6A07_1  | 2018-04-10 16:34:37,292 INFO - container_support.training - Training starting
algo-1-U6A07_1  | /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
algo-1-U6A07_1  |   from ._conv import register_converters as _register_converters
algo-1-U6A07_1  | 2018-04-10 16:34:37,884 INFO - botocore.credentials - Found credentials in environment variables.
algo-1-U6A07_1  | Downloading s3://XXX/model-v20180404-2018-04-10-16-34-22-191/source/sourcedir.tar.gz to /tmp/script.tar.gz
algo-1-U6A07_1  | 2018-04-10 16:34:37,965 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): XXX.s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,118 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): XXX.s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,259 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): XXX.s3.eu-west-1.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,413 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): XXX.s3.eu-west-1.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:38,656 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - ----------------------TF_CONFIG--------------------------
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - {"environment": "cloud", "cluster": {"master": ["algo-1-U6A07:2222"]}, "task": {"index": 0, "type": "master"}}
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - ---------------------------------------------------------
algo-1-U6A07_1  | 2018-04-10 16:34:39,176 INFO - tf_container - creating RunConfig:
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tf_container - {'save_checkpoints_secs': 300}
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tensorflow - TF_CONFIG environment variable: {u'environment': u'cloud', u'cluster': {u'master': [u'algo-1-U6A07:2222']}, u'task': {u'index': 0, u'type': u'master'}}
algo-1-U6A07_1  | 2018-04-10 16:34:39,177 INFO - tf_container - creating the estimator
algo-1-U6A07_1  | 2018-04-10 16:34:39,178 INFO - tensorflow - Using config: {'_save_checkpoints_secs': 300, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': None, '_task_type': u'master', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f945c20d190>, '_model_dir': u's3://XXX/model-v20180404-2018-04-10-16-34-22-191/checkpoints', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}
algo-1-U6A07_1  | 2018-04-10 16:34:39,179 INFO - tensorflow - Skip starting Tensorflow server as there is only one node in the cluster.
algo-1-U6A07_1  | 2018-04-10 16:34:39.179845: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/config and using profilePrefix = 1
algo-1-U6A07_1  | 2018-04-10 16:34:39.179924: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing config loader against fileName /root//.aws/credentials and using profilePrefix = 0
algo-1-U6A07_1  | 2018-04-10 16:34:39.180037: I tensorflow/core/platform/s3/aws_logging.cc:54] Setting provider to read credentials from /root//.aws/credentials for credentials file and /root//.aws/config for the config file , for use with profile default
algo-1-U6A07_1  | 2018-04-10 16:34:39.180075: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating HttpClient with max connections2 and scheme http
algo-1-U6A07_1  | 2018-04-10 16:34:39.180091: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 2
algo-1-U6A07_1  | 2018-04-10 16:34:39.180102: I tensorflow/core/platform/s3/aws_logging.cc:54] Creating Instance with default EC2MetadataClient and refresh rate 900000
algo-1-U6A07_1  | 2018-04-10 16:34:39.180117: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.180176: I tensorflow/core/platform/s3/aws_logging.cc:54] Initializing CurlHandleContainer with size 25
algo-1-U6A07_1  | 2018-04-10 16:34:39.180238: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.180408: I tensorflow/core/platform/s3/aws_logging.cc:54] Pool grown by 2
algo-1-U6A07_1  | 2018-04-10 16:34:39.180543: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-U6A07_1  | 2018-04-10 16:34:39.332669: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404
algo-1-U6A07_1  | 2018-04-10 16:34:39.332829: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.
algo-1-U6A07_1  | 2018-04-10 16:34:39.332919: I tensorflow/core/platform/s3/aws_logging.cc:54] Found secret key
algo-1-U6A07_1  | 2018-04-10 16:34:39.333024: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.
algo-1-U6A07_1  | -------------- debug -----------/opt/ml/input/data/training/training-data.json
algo-1-U6A07_1  | /opt/ml/input/data/training
algo-1-U6A07_1  | training-data.json
algo-1-U6A07_1  | 2018-04-10 16:34:39,393 ERROR - container_support.training - uncaught exception during training: Expected object or value
algo-1-U6A07_1  | Traceback (most recent call last):
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/container_support/training.py", line 38, in start
algo-1-U6A07_1  |     fw.train()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/train.py", line 139, in train
algo-1-U6A07_1  |     train_wrapper.train()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 73, in train
algo-1-U6A07_1  |     tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 421, in train_and_evaluate
algo-1-U6A07_1  |     executor.run()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 522, in run
algo-1-U6A07_1  |     getattr(self, task_to_run)()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 577, in run_master
algo-1-U6A07_1  |     self._start_distributed_training(saving_listeners=saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/training.py", line 715, in _start_distributed_training
algo-1-U6A07_1  |     saving_listeners=saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 352, in train
algo-1-U6A07_1  |     loss = self._train_model(input_fn, hooks, saving_listeners)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 809, in _train_model
algo-1-U6A07_1  |     input_fn, model_fn_lib.ModeKeys.TRAIN))
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 668, in _get_features_and_labels_from_input_fn
algo-1-U6A07_1  |     result = self._call_input_fn(input_fn, mode)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 760, in _call_input_fn
algo-1-U6A07_1  |     return input_fn(**kwargs)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/tf_container/trainer.py", line 112, in <lambda>
algo-1-U6A07_1  |     train_input_fn = lambda: self.customer_script.train_input_fn(**invoke_args)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 79, in train_input_fn
algo-1-U6A07_1  |     return _input_fn(training_dir, 'training-data.json', input_tensor_name = INPUT_TENSOR_NAME)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 123, in _input_fn
algo-1-U6A07_1  |     X, y = _load_data(training_dir, training_filename)
algo-1-U6A07_1  |   File "/opt/ml/code/model_with_fx_keras.py", line 103, in _load_data
algo-1-U6A07_1  |     df = pd.read_json(file, orient='records')
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 366, in read_json
algo-1-U6A07_1  |     return json_reader.read()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 467, in read
algo-1-U6A07_1  |     obj = self._get_object_parser(self.data)
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 484, in _get_object_parser
algo-1-U6A07_1  |     obj = FrameParser(json, **kwargs).parse()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 576, in parse
algo-1-U6A07_1  |     self._parse_no_numpy()
algo-1-U6A07_1  |   File "/usr/local/lib/python2.7/dist-packages/pandas/io/json/json.py", line 806, in _parse_no_numpy
algo-1-U6A07_1  |     loads(json, precise_float=self.precise_float), dtype=None)
algo-1-U6A07_1  | ValueError: Expected object or value
@professoroakz
Copy link

professoroakz commented Apr 12, 2018

Hey @jsaedtler, other question: how are you able to run this locally? I have not been able to as of yet. When I specify train_instance_type='local', I get this error:

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.c4.8xlarge, ml.c5.9xlarge, ml.c5.xlarge, ml.c4.xlarge, ml.c5.18xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

@iquintero
Copy link
Contributor

@OktayGardener I think you may be running an old version of the SDK. Can you please update to the latest?. I will reply in the issue you created soon.

@jsaedtler I have confirmed there is a bug here if your data is not on the default bucket 👎 I am working on a fix and will submit a PR today.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Update linear_time_series_forecast.ipynb
@suryadeepti
Copy link

location or region error.
Can be solved by taking either your region bucket or by creating your region(location) bucket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants