Skip to content

Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tlelson opened this issue Oct 1, 2018 · 14 comments
Closed

Comments

@tlelson
Copy link

tlelson commented Oct 1, 2018

This seems very likely conected to Issues: #99, #269 and #100.

System Information

  • Tensorflow / DNNClassifier
  • TensorFlow v 1.10, sagemaker==1.11.0
  • conda_tensorflow_p27 on hosted notebook

Describe the problem

I'm running my own Iris classifier comparing Sagemaker to Google MLEngine. I have taken standard TensorFlow code known to work, that deploys and predicts from MLE and repeated the steps in Sagemaker. Everything goes as expected up until I invoke the endpoint. At this stage I receive the following errors:

Minimal repro / logs

Notebook Error

ModelErrorTraceback (most recent call last)
<ipython-input-92-07d9ea9b1239> in <module>()
      2 # iris_predictor.predict({u'' : list(sample0.values)})
      3 # iris_predictor.predict({ u'instances': [dict(sample0)]})
----> 4 iris_predictor.predict({ u'': [dict(sample0)]})

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/sagemaker/predictor.pyc in predict(self, data)
     84             request_args['Accept'] = self.accept
     85 
---> 86         response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
     87 
     88         response_body = response['Body']

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _api_call(self, *args, **kwargs)
    312                     "%s() only accepts keyword arguments." % py_operation_name)
    313             # The "self" in this scope is referring to the BaseClient.
--> 314             return self._make_api_call(operation_name, kwargs)
    315 
    316         _api_call.__name__ = str(py_operation_name)

/home/ec2-user/anaconda3/envs/tensorflow_p27/lib/python2.7/site-packages/botocore/client.pyc in _make_api_call(self, operation_name, api_params)
    610             error_code = parsed_response.get("Error", {}).get("Code")
    611             error_class = self.exceptions.from_code(error_code)
--> 612             raise error_class(parsed_response, operation_name)
    613         else:
    614             return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from model with message "". See https://ap-southeast-2.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-2#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-2018-10-01-02-17-54-088 in account XXXXX for more information.

Cloudwatch Errors:

...
10.32.0.2 - - [01/Oct/2018:03:23:33 +0000] "GET /ping HTTP/1.1" 200 0 "-" "AHC/2.0"
[2018-10-01 03:23:37,290] ERROR in serving: u''
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/container_support/serving.py", line 182, in _invoke
self.transformer.transform(content, input_content_type, requested_output_content_type)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 281, in transform
return self.transform_fn(data, content_type, accepts), accepts
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 208, in f
prediction = self.predict_fn(input)
File "/usr/local/lib/python2.7/dist-packages/tf_container/serve.py", line 223, in predict_fn
return self.proxy_client.request(data)
File "/usr/local/lib/python2.7/dist-packages/tf_container/proxy_client.py", line 66, in request
request_fn = self.request_fn_map[self.prediction_type]
KeyError: u''
[2018-10-01 03:23:37,290] ERROR in serving: u''
10.32.0.2 - - [01/Oct/2018:03:23:37 +0000] "POST /invocations HTTP/1.1" 500 0 "-" "AHC/2.0"
...

Reproducing the Error

The full model definition passed to Sagemaker for training and evaluation is here

The notebook currently produces the error.

Create the estimator with

estimator = TensorFlow(entry_point='../package/trainer/model.py',
                       role=execution_role,
                       framework_version='1.10',
                       output_path=model_artifacts_location,
                       code_location=custom_code_upload_location,
                       train_instance_count=1,
                       training_steps=100,
                       evaluation_steps=10,
                       train_instance_type='ml.c4.xlarge')

# Fit
estimator.fit(s3_data)

# Deploy
iris_predictor = estimator.deploy(initial_instance_count=1,
                                  instance_type='ml.t2.medium')

test0 = {'PetalLength': 1.7, 'PetalWidth': 0.5, 'SepalLength': 5.1, 'SepalWidth': 3.3}

# Predict (ERROR)
# iris_predictor.predict(test0) 
# iris_predictor.predict([test0]) 
# iris_predictor.predict({ u'instances': [test0]}) 
iris_predictor.predict({ u'': [test0]}) 

... all produce the vague error.

Thanks for your help.

@ChoiByungWook
Copy link
Contributor

Hello @eL0ck,

Thank you for bringing this to our attention. I'll begin by attempting to reproduce this error.

@tlelson
Copy link
Author

tlelson commented Oct 4, 2018

Hi @ChoiByungWook ,
I'm happy to take a look at fixing this. Are you are able to approve PRs on this repo ?

P.S given that the issue is likely just in the way I am passing the test instance data; if you or anyone else can suggest a way to call predict that works, that would be much appreciated.

@ChoiByungWook
Copy link
Contributor

Hello @eL0ck,

Yes I can approve PRs. Please feel free to submit a PR.

Our TensorFlow container that your DNNClassifier is running is open sourced. Here is the line where it failed, maybe we can debug this together.

Is your model expecting a tensor with a blank label?

Also, there is a chance that maybe the serialization of the array is failing and might have to be a list? Perhaps this PR solves this? #404

@tlelson
Copy link
Author

tlelson commented Oct 8, 2018

Thanks @ChoiByungWook, I've pulled those updates to the serialisation and reproduced the issue. But as you mentioned, the problem lies with the sagemaker container.

I have started developing this locally but have found it to be very difficult to work on. I've build the latest version as per the instructions and have attempted to run sagemaker.tensorflow.TensorFlow locally using the MNIST example.

from sagemaker.tensorflow import TensorFlow

mnist_estimator = TensorFlow(entry_point='mnist.py',
                             role=role,
                             framework_version='1.10.0',
                             training_steps=10, 
                             evaluation_steps=10,
                             train_instance_count=2,
                             train_instance_type='local',
                             image_name='my-sm-tensorflow:1.10.0-cpu-py2',
                            )

# mnist_estimator.fit(inputs) 
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs)

I have built the container with an ~/.aws/credentials file that has keys for a user that can assume the Sagemaker Execution Role, however when running the fit it the container fails to get the model source from s3. I believe the local container is not assuming the Execution role.

Are you aware of any documentation of how to work on the sagemaker container locally? By that I mean, without having to push every change to ECR.

@ChoiByungWook
Copy link
Contributor

Can you please check to see what policies are attached to your role? Does it have FullSageMakerAccess? The AWS console should be able to show you.

You shouldn't have to push the image to ECR to test locally. As long as you have docker-compose installed on your local machine and the image-name matches the name of the image on your machine it should be able to execute.

@tlelson
Copy link
Author

tlelson commented Oct 9, 2018

Thanks @ChoiByungWook.

The role is fine. I can assume it myself from the command line and have verified that the role has access to the specific s3 artefact that causes the container run code to fail. The role given to the TensorFlow object is not being assumed by my local container.

The reason I would like to see the recommended method for developing container code is that it might make it more clear how the container assumes the role. There are a number of possibilities. For instance, the sagemaker container might be performing sts:AssumeRole in which case the role it starts up under must have permissions to perform that action. Another option might be that the docker container comes up with environment variables of the parent process, populated by something like:

docker \
....
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
...
<dockerimage:tag>

My point here is that the mechanics of it is all quite unclear to a potential contributor like myself. As such it is difficult to debug elementary issues such as these.


I have raised an issue in the container repo.

@ChoiByungWook
Copy link
Contributor

I apologize for the experience. We will make sure to update the docs to allow for a better experience.

@icywang86rui
Copy link
Contributor

The role used in the container is actually not the one passed in to the TensorFlow estimator. It's using the one set on your box in ~/.aws/credentials. As you can see here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L392 localmode is using the credentials from the boto session which by default uses the credentials set locally.

@eL0ck Are you still experiencing the same problem? Please double check the policy attached to the role/user in ~/.aws/credentials.

In the meantime I will update our documentation to make this more clear.

@tlelson
Copy link
Author

tlelson commented Oct 12, 2018

Excellent @icywang86rui !

This answers my question above about how the credentials are being provided. Effectively the container is run with environment variables equivalent to using -e flags. This also explains why I could not find any sts::AssumeRole operations in the container code. I assume then that when the container runs in Sagemaker, that behind the scenes you are creating an ECS task Definition with a 'Task Role' matching the role provided by the TensorFlow constructor??

Has this been something discussed there? I think there would be a reasonable argument to say the container should run under the same role in local testing as it does at full-scale in the cloud.

@icywang86rui
Copy link
Contributor

@eL0ck I think when you run a training job in Sagemaker, the training services uses EC2MetadataService to deploy credentials of the role you specified in the Estimator constructor. We can't really do that locally. So if you want the permissions to match you will have to create a user with the same permission of the role you are running sagemaker jobs with. I hope that answers your question.

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Add batch transform to image-classification notebook - Part 2
@tlelson
Copy link
Author

tlelson commented Nov 20, 2018

WIP

So it looks like because the container is built with tensorflow-serving-api=1.7.0 (current version: 1.12.0) the export can't be configured.

...
algo-1-X9QR3_1  | 2018-11-20 13:19:51,972 INFO - tensorflow - Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
algo-1-X9QR3_1  | 2018-11-20 13:19:51,972 INFO - tensorflow - 'serving_default' : Classification input must be a single string Tensor; got {'SepalLength': <tf.Tensor 'SepalLength:0' shape=(?,) dtype=float32>, 'PetalLength': <tf.Tensor 'PetalLength:0' shape=(?,) dtype=float32>, 'PetalWidth': <tf.Tensor 'PetalWidth:0' shape=(?,) dtype=float32>, 'SepalWidth': <tf.Tensor 'SepalWidth:0' shape=(?,) dtype=float32>}
algo-1-X9QR3_1  | 2018-11-20 13:19:51,973 INFO - tensorflow - 'classification' : Classification input must be a single string Tensor; got {'SepalLength': <tf.Tensor 'SepalLength:0' shape=(?,) dtype=float32>, 'PetalLength': <tf.Tensor 'PetalLength:0' shape=(?,) dtype=float32>, 'PetalWidth': <tf.Tensor 'PetalWidth:0' shape=(?,) dtype=float32>, 'SepalWidth': <tf.Tensor 'SepalWidth:0' shape=(?,) dtype=float32>}
algo-1-X9QR3_1  | 2018-11-20 13:19:51,974 WARNING - tensorflow - Export includes no default signature!
...

Now I understand why all the examples have the arbitary constant INPUT_TENSOR_NAME = 'inputs' in them.


Thoughts:

  • Perhaps this should not be an INFO but maybe an ERROR or at least a WARNING
  • Maybe its time to update the version of tensorflow-serving-api

tensorflow/tensorflow#22901

from sagemaker-tensorflow-container: docker/1.10.0/final/py2/Dockerfile.cpu

...
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   53 # TODO: upgrade to tf serving 1.8, which requires more work with updating
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   54 # dependencies. See current work in progress in tfserving-1.8 branch.
d442ac7e (yangaws           2018-09-06 10:07:48 -0700)|   55 ENV TF_SERVING_VERSION=1.7.0
...

However there is no such branch.

Will try upgrade it now. @yangaws do you remember what happened to that branch?

@zjost
Copy link

zjost commented Jan 17, 2019

Any updates on this? Are there any easy work-arounds?

@mvsusp
Copy link
Contributor

mvsusp commented Mar 8, 2019

Hi @eL0ck and @zjost,

You can use the TensorFlow Serving Container. In this container, you can use the TFS REST API to make predictions against the container: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint. That allows you to make predictions the same way that you would do outside SageMaker:

iris_predictor.predict({ 'instances': [test0]}) 

@laurenyu
Copy link
Contributor

closing due to lack of activity (and because we no longer releases fixes for TensorFlow "legacy mode")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants