Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

tlelson · 2018-10-01T03:59:03Z

This seems very likely conected to Issues: #99, #269 and #100.

System Information

Tensorflow / DNNClassifier
TensorFlow v 1.10, sagemaker==1.11.0
conda_tensorflow_p27 on hosted notebook

Describe the problem

I'm running my own Iris classifier comparing Sagemaker to Google MLEngine. I have taken standard TensorFlow code known to work, that deploys and predicts from MLE and repeated the steps in Sagemaker. Everything goes as expected up until I invoke the endpoint. At this stage I receive the following errors:

Minimal repro / logs

Notebook Error

ModelErrorTraceback (most recent call last)
<ipython-input-92-07d9ea9b1239> in <module>()
      2 # iris_predictor.predict({u'' : list(sample0.values)})
      3 # iris_predictor.predict({ u'instances': [dict(sample0)]})
----> 4 iris_predictor.predict({ u'': [dict(sample0)]})

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/predictor.pyc in predict(self, data)
     84             request_args['Accept'] = self.accept
     85 
---> 86         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
     87 
     88         response_body = response['Body']

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
    312                     "%s() only accepts keyword arguments." % py_operation_name)
    313             # The "self" in this scope is referring to the BaseClient.
--> 314             return self._make_api_call(operation_name, kwargs)
    315 
    316         _api_call.__name__ = str(py_operation_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
    610             error_code = parsed_response.get("Error", {}).get("Code")
    611             error_class = self.exceptions.from_code(error_code)
--> 612             raise error_class(parsed_response, operation_name)
    613         else:
    614             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2018-10-01-02-17-54-088 in account XXXXX for more information.

Cloudwatch Errors:

...
10.32.0.2 - - [01/Oct/2018:03:23:33 +0000] "GET /ping HTTP/1.1" 200 0 "-" "AHC/2.0"
[2018-10-01 03:23:37,290] ERROR in serving: u''
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 208, in f
prediction = self.predict_fn(input)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 223, in predict_fn
return self.proxy_client.request(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 66, in request
request_fn = self.request_fn_map[self.prediction_type]
KeyError: u''
[2018-10-01 03:23:37,290] ERROR in serving: u''
10.32.0.2 - - [01/Oct/2018:03:23:37 +0000] "POST /invocations HTTP/1.1" 500 0 "-" "AHC/2.0"
...

Reproducing the Error

The full model definition passed to Sagemaker for training and evaluation is here

The notebook currently produces the error.

Create the estimator with

estimator = TensorFlow(entry_point='../package/trainer/model.py',
                       role=execution_role,
                       framework_version='1.10',
                       output_path=model_artifacts_location,
                       code_location=custom_code_upload_location,
                       train_instance_count=1,
                       training_steps=100,
                       evaluation_steps=10,
                       train_instance_type='ml.c4.xlarge')

# Fit
estimator.fit(s3_data)

# Deploy
iris_predictor = estimator.deploy(initial_instance_count=1,
                                  instance_type='ml.t2.medium')

test0 = {'PetalLength': 1.7, 'PetalWidth': 0.5, 'SepalLength': 5.1, 'SepalWidth': 3.3}

# Predict (ERROR)
# iris_predictor.predict(test0) 
# iris_predictor.predict([test0]) 
# iris_predictor.predict({ u'instances': [test0]}) 
iris_predictor.predict({ u'': [test0]})

... all produce the vague error.

Thanks for your help.

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2018-10-04T01:13:03Z

Hello @eL0ck,

Thank you for bringing this to our attention. I'll begin by attempting to reproduce this error.

tlelson · 2018-10-04T07:08:47Z

Hi @ChoiByungWook ,
I'm happy to take a look at fixing this. Are you are able to approve PRs on this repo ?

P.S given that the issue is likely just in the way I am passing the test instance data; if you or anyone else can suggest a way to call predict that works, that would be much appreciated.

ChoiByungWook · 2018-10-04T20:24:13Z

Hello @eL0ck,

Yes I can approve PRs. Please feel free to submit a PR.

Our TensorFlow container that your DNNClassifier is running is open sourced. Here is the line where it failed, maybe we can debug this together.

Is your model expecting a tensor with a blank label?

Also, there is a chance that maybe the serialization of the array is failing and might have to be a list? Perhaps this PR solves this? #404

tlelson · 2018-10-08T05:45:47Z

Thanks @ChoiByungWook, I've pulled those updates to the serialisation and reproduced the issue. But as you mentioned, the problem lies with the sagemaker container.

I have started developing this locally but have found it to be very difficult to work on. I've build the latest version as per the instructions and have attempted to run sagemaker.tensorflow.TensorFlow locally using the MNIST example.

from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.10.0',
                             training_steps=10, 
                             evaluation_steps=10,
                             train_instance_count=2,
                             train_instance_type='local',
                             image_name='my-sm-tensorflow:1.10.0-cpu-py2',
                            )

# mnist_estimator.fit(inputs) 
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs)

I have built the container with an ~/.aws/credentials file that has keys for a user that can assume the Sagemaker Execution Role, however when running the fit it the container fails to get the model source from s3. I believe the local container is not assuming the Execution role.

Are you aware of any documentation of how to work on the sagemaker container locally? By that I mean, without having to push every change to ECR.

ChoiByungWook · 2018-10-08T18:47:00Z

Can you please check to see what policies are attached to your role? Does it have FullSageMakerAccess? The AWS console should be able to show you.

You shouldn't have to push the image to ECR to test locally. As long as you have docker-compose installed on your local machine and the image-name matches the name of the image on your machine it should be able to execute.

tlelson · 2018-10-09T01:42:46Z

Thanks @ChoiByungWook.

The role is fine. I can assume it myself from the command line and have verified that the role has access to the specific s3 artefact that causes the container run code to fail. The role given to the TensorFlow object is not being assumed by my local container.

The reason I would like to see the recommended method for developing container code is that it might make it more clear how the container assumes the role. There are a number of possibilities. For instance, the sagemaker container might be performing sts:AssumeRole in which case the role it starts up under must have permissions to perform that action. Another option might be that the docker container comes up with environment variables of the parent process, populated by something like:

docker \
....
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
...
<dockerimage:tag>

My point here is that the mechanics of it is all quite unclear to a potential contributor like myself. As such it is difficult to debug elementary issues such as these.

I have raised an issue in the container repo.

ChoiByungWook · 2018-10-09T19:42:37Z

I apologize for the experience. We will make sure to update the docs to allow for a better experience.

icywang86rui · 2018-10-10T17:02:34Z

The role used in the container is actually not the one passed in to the TensorFlow estimator. It's using the one set on your box in ~/.aws/credentials. As you can see here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L392 localmode is using the credentials from the boto session which by default uses the credentials set locally.

@eL0ck Are you still experiencing the same problem? Please double check the policy attached to the role/user in ~/.aws/credentials.

In the meantime I will update our documentation to make this more clear.

tlelson · 2018-10-12T00:57:00Z

Excellent @icywang86rui !

This answers my question above about how the credentials are being provided. Effectively the container is run with environment variables equivalent to using -e flags. This also explains why I could not find any sts::AssumeRole operations in the container code. I assume then that when the container runs in Sagemaker, that behind the scenes you are creating an ECS task Definition with a 'Task Role' matching the role provided by the TensorFlow constructor??

Has this been something discussed there? I think there would be a reasonable argument to say the container should run under the same role in local testing as it does at full-scale in the cloud.

icywang86rui · 2018-10-15T18:03:42Z

@eL0ck I think when you run a training job in Sagemaker, the training services uses EC2MetadataService to deploy credentials of the role you specified in the Estimator constructor. We can't really do that locally. So if you want the permissions to match you will have to create a user with the same permission of the role you are running sagemaker jobs with. I hope that answers your question.

Add batch transform to image-classification notebook - Part 2

tlelson · 2018-11-20T13:33:46Z

WIP

So it looks like because the container is built with tensorflow-serving-api=1.7.0 (current version: 1.12.0) the export can't be configured.

...
algo-1-X9QR3_1  | 2018-11-20 13:19:51,972 INFO - tensorflow - Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
algo-1-X9QR3_1  | 2018-11-20 13:19:51,972 INFO - tensorflow - 'serving_default' : Classification input must be a single string Tensor; got {'SepalLength': <tf.Tensor 'SepalLength:0' shape=(?,) dtype=float32>, 'PetalLength': <tf.Tensor 'PetalLength:0' shape=(?,) dtype=float32>, 'PetalWidth': <tf.Tensor 'PetalWidth:0' shape=(?,) dtype=float32>, 'SepalWidth': <tf.Tensor 'SepalWidth:0' shape=(?,) dtype=float32>}
algo-1-X9QR3_1  | 2018-11-20 13:19:51,973 INFO - tensorflow - 'classification' : Classification input must be a single string Tensor; got {'SepalLength': <tf.Tensor 'SepalLength:0' shape=(?,) dtype=float32>, 'PetalLength': <tf.Tensor 'PetalLength:0' shape=(?,) dtype=float32>, 'PetalWidth': <tf.Tensor 'PetalWidth:0' shape=(?,) dtype=float32>, 'SepalWidth': <tf.Tensor 'SepalWidth:0' shape=(?,) dtype=float32>}
algo-1-X9QR3_1  | 2018-11-20 13:19:51,974 WARNING - tensorflow - Export includes no default signature!
...

Now I understand why all the examples have the arbitary constant INPUT_TENSOR_NAME = 'inputs' in them.

Thoughts:

Perhaps this should not be an INFO but maybe an ERROR or at least a WARNING
Maybe its time to update the version of tensorflow-serving-api

tensorflow/tensorflow#22901

from sagemaker-tensorflow-container: docker/1.10.0/final/py2/Dockerfile.cpu

...
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   53 # TODO: upgrade to tf serving 1.8, which requires more work with updating
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   54 # dependencies. See current work in progress in tfserving-1.8 branch.
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   55 ENV TF_SERVING_VERSION=1.7.0
...

However there is no such branch.

Will try upgrade it now. @yangaws do you remember what happened to that branch?

zjost · 2019-01-17T21:21:10Z

Any updates on this? Are there any easy work-arounds?

mvsusp · 2019-03-08T18:07:45Z

Hi @eL0ck and @zjost,

You can use the TensorFlow Serving Container. In this container, you can use the TFS REST API to make predictions against the container: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint. That allows you to make predictions the same way that you would do outside SageMaker:

iris_predictor.predict({ 'instances': [test0]})

laurenyu · 2020-05-21T15:43:45Z

closing due to lack of activity (and because we no longer releases fixes for TensorFlow "legacy mode")

tlelson mentioned this issue Oct 5, 2018

Using customized tensor name for prediction after deploying the model #394

Closed

tlelson mentioned this issue Oct 9, 2018

Add Instructions for Contributing to the project aws/sagemaker-tensorflow-training-toolkit#85

Open

ChoiByungWook added type: bug type: documentation labels Oct 9, 2018

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Merge pull request aws#413 from Aloha106/master

dd9f4a0

Add batch transform to image-classification notebook - Part 2

laurenyu closed this as completed May 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

tlelson commented Oct 1, 2018

ChoiByungWook commented Oct 4, 2018

tlelson commented Oct 4, 2018 •

edited

Loading

ChoiByungWook commented Oct 4, 2018

tlelson commented Oct 8, 2018

ChoiByungWook commented Oct 8, 2018

tlelson commented Oct 9, 2018 •

edited

Loading

ChoiByungWook commented Oct 9, 2018

icywang86rui commented Oct 10, 2018

tlelson commented Oct 12, 2018

icywang86rui commented Oct 15, 2018

tlelson commented Nov 20, 2018 •

edited

Loading

zjost commented Jan 17, 2019

mvsusp commented Mar 8, 2019

laurenyu commented May 21, 2020

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

Comments

tlelson commented Oct 1, 2018

System Information

Describe the problem

Minimal repro / logs

Notebook Error

Cloudwatch Errors:

Reproducing the Error

ChoiByungWook commented Oct 4, 2018

tlelson commented Oct 4, 2018 • edited Loading

ChoiByungWook commented Oct 4, 2018

tlelson commented Oct 8, 2018

ChoiByungWook commented Oct 8, 2018

tlelson commented Oct 9, 2018 • edited Loading

ChoiByungWook commented Oct 9, 2018

icywang86rui commented Oct 10, 2018

tlelson commented Oct 12, 2018

icywang86rui commented Oct 15, 2018

tlelson commented Nov 20, 2018 • edited Loading

zjost commented Jan 17, 2019

mvsusp commented Mar 8, 2019

laurenyu commented May 21, 2020

tlelson commented Oct 4, 2018 •

edited

Loading

tlelson commented Oct 9, 2018 •

edited

Loading

tlelson commented Nov 20, 2018 •

edited

Loading