-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Unable to invoke SageMaker API endpoint - Vague Error: KeyError: u'' #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @eL0ck, Thank you for bringing this to our attention. I'll begin by attempting to reproduce this error. |
Hi @ChoiByungWook , P.S given that the issue is likely just in the way I am passing the test instance data; if you or anyone else can suggest a way to call |
Hello @eL0ck, Yes I can approve PRs. Please feel free to submit a PR. Our TensorFlow container that your DNNClassifier is running is open sourced. Here is the line where it failed, maybe we can debug this together. Is your model expecting a tensor with a blank label? Also, there is a chance that maybe the serialization of the array is failing and might have to be a list? Perhaps this PR solves this? #404 |
Thanks @ChoiByungWook, I've pulled those updates to the serialisation and reproduced the issue. But as you mentioned, the problem lies with the sagemaker container. I have started developing this locally but have found it to be very difficult to work on. I've build the latest version as per the instructions and have attempted to run from sagemaker.tensorflow import TensorFlow
mnist_estimator = TensorFlow(entry_point='mnist.py',
role=role,
framework_version='1.10.0',
training_steps=10,
evaluation_steps=10,
train_instance_count=2,
train_instance_type='local',
image_name='my-sm-tensorflow:1.10.0-cpu-py2',
)
# mnist_estimator.fit(inputs)
local_inputs = 'file://{}/data/'.format(os.getcwd())
mnist_estimator.fit(local_inputs) I have built the container with an Are you aware of any documentation of how to work on the sagemaker container locally? By that I mean, without having to push every change to ECR. |
Can you please check to see what policies are attached to your role? Does it have FullSageMakerAccess? The AWS console should be able to show you. You shouldn't have to push the image to ECR to test locally. As long as you have docker-compose installed on your local machine and the image-name matches the name of the image on your machine it should be able to execute. |
Thanks @ChoiByungWook. The role is fine. I can assume it myself from the command line and have verified that the role has access to the specific s3 artefact that causes the container run code to fail. The role given to the The reason I would like to see the recommended method for developing container code is that it might make it more clear how the container assumes the role. There are a number of possibilities. For instance, the sagemaker container might be performing docker \
....
-e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
...
<dockerimage:tag> My point here is that the mechanics of it is all quite unclear to a potential contributor like myself. As such it is difficult to debug elementary issues such as these. I have raised an issue in the container repo. |
I apologize for the experience. We will make sure to update the docs to allow for a better experience. |
The role used in the container is actually not the one passed in to the TensorFlow estimator. It's using the one set on your box in ~/.aws/credentials. As you can see here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L392 localmode is using the credentials from the boto session which by default uses the credentials set locally. @eL0ck Are you still experiencing the same problem? Please double check the policy attached to the role/user in ~/.aws/credentials. In the meantime I will update our documentation to make this more clear. |
Excellent @icywang86rui ! This answers my question above about how the credentials are being provided. Effectively the container is run with environment variables equivalent to using Has this been something discussed there? I think there would be a reasonable argument to say the container should run under the same role in local testing as it does at full-scale in the cloud. |
@eL0ck I think when you run a training job in Sagemaker, the training services uses EC2MetadataService to deploy credentials of the role you specified in the Estimator constructor. We can't really do that locally. So if you want the permissions to match you will have to create a user with the same permission of the role you are running sagemaker jobs with. I hope that answers your question. |
Add batch transform to image-classification notebook - Part 2
WIP So it looks like because the container is built with
Now I understand why all the examples have the arbitary constant Thoughts:
from
However there is no such branch. Will try upgrade it now. @yangaws do you remember what happened to that branch? |
Any updates on this? Are there any easy work-arounds? |
Hi @eL0ck and @zjost, You can use the TensorFlow Serving Container. In this container, you can use the TFS REST API to make predictions against the container: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst#making-predictions-against-a-sagemaker-endpoint. That allows you to make predictions the same way that you would do outside SageMaker: iris_predictor.predict({ 'instances': [test0]}) |
closing due to lack of activity (and because we no longer releases fixes for TensorFlow "legacy mode") |
This seems very likely conected to Issues: #99, #269 and #100.
System Information
DNNClassifier
sagemaker==1.11.0
conda_tensorflow_p27
on hosted notebookDescribe the problem
I'm running my own Iris classifier comparing Sagemaker to Google MLEngine. I have taken standard TensorFlow code known to work, that deploys and predicts from MLE and repeated the steps in Sagemaker. Everything goes as expected up until I invoke the endpoint. At this stage I receive the following errors:
Minimal repro / logs
Notebook Error
Cloudwatch Errors:
Reproducing the Error
The full model definition passed to Sagemaker for training and evaluation is here
The notebook currently produces the error.
Create the estimator with
... all produce the vague error.
Thanks for your help.
The text was updated successfully, but these errors were encountered: