Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

502 bad gateway? #1485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Xixiong-Guo opened this issue May 11, 2020 · 16 comments
Closed

502 bad gateway? #1485

Xixiong-Guo opened this issue May 11, 2020 · 16 comments

Comments

@Xixiong-Guo
Copy link

Following https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/, I tried to save two different model (sentiment analysis and a simple regression model) trained by tensorflow+keras, and uploaded to Sagemaker, but encountered the same 502 error, which is seldom reported here or stackoverflow. Any thoughts?

Body_review = ','.join([str(val) for val in padded_pred]).encode('utf-8')

response = runtime.invoke_endpoint(EndpointName=predictor.endpoint,

ContentType = 'text/csv',

Body = Body_review)

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (502) from model with message " <title>502 Bad Gateway</title>

502 Bad Gateway


nginx/1.16.1

I searched the CloudWatch as found attached:

2020/05/10 15:53:27 [error] 35#35: *187 connect() failed (111: Connection refused) while connecting to upstream, client: 10.32.0.1, server: , request: "POST /invocations HTTP/1.1", subrequest: "/v1/models/export:predict", upstream: "http://127.0.0.1:27001/v1/models/export:predict", host: "model.aws.local:8080"

I tried another regression model (trained outside sagemaker, saved and loaded to S3 and Sagemaker, following https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ )

Still the same issue when using the predictor:

from sagemaker.predictor import csv_serializer

predictor.content_type = 'text/csv'

predictor.serializer = csv_serializer

Y_pred = predictor.predict(test.tolist())

Error:

--------------------------------------------------------------------------- ModelError Traceback (most recent call last) in () 4 predictor.serializer = csv_serializer 5 ----> 6 Y_pred = predictor.predict(test) ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model) 108 109 request_args = self._create_request_args(data, initial_args, target_model) --> 110 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args) 111 return self._handle_response(response) 112 ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs) 314 "%s() only accepts keyword arguments." % py_operation_name) 315 # The "self" in this scope is referring to the BaseClient. --> 316 return self._make_api_call(operation_name, kwargs) 317 318 _api_call.name = str(py_operation_name) ~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params) 624 error_code = parsed_response.get("Error", {}).get("Code") 625 error_class = self.exceptions.from_code(error_code) --> 626 raise error_class(parsed_response, operation_name) 627 else: 628 return parsed_response ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (502) from model with message " <title>502 Bad Gateway</title>

502 Bad Gateway


nginx/1.16.1 ".

@Xixiong-Guo
Copy link
Author

Actually the 502 error trouble was there when running predictor.predict(test) before deploy. But my model performed well in my own machine and saved exactly the same way as https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/

@chuyang-deng
Copy link
Contributor

Hi @Xixiong-Guo, if you are following the example from https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ , it's likely that you've used the wrong Model class from step 5.

For framework versions 1.11 and above, we've split the tensorflow container into training and serving. And for deploying the model, please use this class instead: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/serving.py#L121

@Xixiong-Guo
Copy link
Author

Hi, @ChuyangDeng Thanks for your reply.

I did encountered this problem, and after I found this reference (https://sagemaker.readthedocs.io/en/stable/using_tf.html#deploying-directly-from-model-artifacts). I've changed to

from sagemaker.tensorflow.serving import Model
model = Model(model_data='s3://sagemaker-us-east-1-665159495798/model/model.tar.gz', role=role)

I guess this should not be the reason for that 502 error now? Thanks!

@chuyang-deng
Copy link
Contributor

Hi @Xixiong-Guo,

How did you tar the model? When you tar your model, please make sure to use the -C option so that the tar.gz does not add an extra layer to your folder. When it extracts, the outmost layer should be the version number, something like

$ ls -al 00000123 # version number (not model name)
total 24
drwxr-xr-x .
drwx------ ..
drwxr-xr-x assets
-rw-r--r-- saved_model.pb
drwxr-xr-x variables

@Xixiong-Guo
Copy link
Author

Hi @ChuyangDeng
I used:
"import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)"

My tar.gz looks like:
model.tar.gz\export\Servo\1\saved_model.pb
model.tar.gz\export\Servo\1\variables\

You mean the directory should like:
model.tar.gz\1\saved_model.pb
model.tar.gz\1\variables\ ? (If the version number is 1)

Thanks.

@chuyang-deng
Copy link
Contributor

Yes, SageMaker expects model to be extracted directly under "opt/ml/<model_name>/" directory inside the container. The sagemaker-tensorflow-serving container will look for model version directly under "<model_name>/". So your tar structure should be:

model.tar.gz\1\saved_model.pb
model.tar.gz\1\variables...

@Xixiong-Guo
Copy link
Author

Hi @ChuyangDeng
Unfortunately it is still not working. The error is still the same as previous. The tar structure is attached.
tar_structure

Code and errors are as follows:

import boto3, re
from sagemaker import get_execution_role

role = get_execution_role()
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('1', recursive=True)

import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')

from sagemaker.tensorflow.serving import Model
model = Model(model_data='s3://sagemaker-us-east-1-665159495798/model/model.tar.gz', role=role)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.c5.xlarge')

result = predictor.predict(input)


ModelError Traceback (most recent call last)
in ()
----> 1 result = predictor.predict(input)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/tensorflow/serving.py in predict(self, data, initial_args)
116 args["CustomAttributes"] = self._model_attributes
117
--> 118 return super(Predictor, self).predict(data, args)
119
120

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/predictor.py in predict(self, data, initial_args, target_model)
108
109 request_args = self._create_request_args(data, initial_args, target_model)
--> 110 response = self.sagemaker_session.sagemaker_runtime_client.invoke_endpoint(**request_args)
111 return self._handle_response(response)
112

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
314 "%s() only accepts keyword arguments." % py_operation_name)
315 # The "self" in this scope is referring to the BaseClient.
--> 316 return self._make_api_call(operation_name, kwargs)
317
318 _api_call.name = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
624 error_code = parsed_response.get("Error", {}).get("Code")
625 error_class = self.exceptions.from_code(error_code)
--> 626 raise error_class(parsed_response, operation_name)
627 else:
628 return parsed_response

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (502) from model with message "

<title>502 Bad Gateway</title>

502 Bad Gateway


nginx/1.16.1 ". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/sagemaker-tensorflow-serving-2020-05-16-13-45-23-880 in account 665159495798 for more information.

@Sbrikky
Copy link

Sbrikky commented May 18, 2020

I've been having the same issue after following the same examples. I've also checked me TAR and use the serving model.

The cloudwatch log is as follows from the moment I invoke the endpoint until it goes back to the regular pinging. (I used container_log_level = logging.DEBUG))

  • 2020-05-18 08:13:49.621487: F external/org_tensorflow/tensorflow/core/util/tensor_format.h:426] Check failed: index >= 0 && index < num_total_dims Invalid index from the dimension: 3, 0, C
  • 2020/05/18 08:13:49 [error] 18#18: *96 upstream prematurely closed connection while reading response header from upstream, client: 10.32.0.2, server: , request: "POST /invocations HTTP/1.1", subrequest: "/v1/models/model:predict", upstream: "http://127.0.0.1:27001/v1/models/model:predict", host: "model.aws.local:8080"
  • 2020/05/18 08:13:49 [warn] 18#18: *96 upstream server temporarily disabled while reading response header from upstream, client: 10.32.0.2, server: , request: "POST /invocations HTTP/1.1", subrequest: "/v1/models/model:predict", upstream: "http://127.0.0.1:27001/v1/models/model:predict", host: "model.aws.local:8080"
  • 10.32.0.2 - - [18/May/2020:08:13:49 +0000] "POST /invocations HTTP/1.1" 502 157 "-" "AHC/2.0"
  • WARNING:main:unexpected tensorflow serving exit (status: 6). restarting.
  • INFO:main:tensorflow version info:
  • TensorFlow ModelServer: 2.1.0-rc1+dev.sha.075ffcf
  • TensorFlow Library: 2.1.0
  • INFO:main:tensorflow serving command: tensorflow_model_server --port=27000 --rest_api_port=27001 --model_config_file=/sagemaker/model-config.cfg --max_num_load_retries=0
  • INFO:main:started tensorflow serving (pid: 131)
  • 2020-05-18 08:13:50.045247: I tensorflow_serving/model_servers/server_core.cc:462] Adding/updating models.
  • 2020-05-18 08:13:50.045284: I tensorflow_serving/model_servers/server_core.cc:573] (Re-)adding model: model
  • 2020-05-18 08:13:50.145608: I tensorflow_serving/util/retrier.cc:46] Retrying of Reserving resources for servable: {name: model version: 1} exhausted max_num_retries: 0
  • 2020-05-18 08:13:50.145645: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: model version: 1}
  • 2020-05-18 08:13:50.145657: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: model version: 1}
  • 2020-05-18 08:13:50.145670: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: model version: 1}
  • 2020-05-18 08:13:50.145704: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /opt/ml/model/1
  • 2020-05-18 08:13:50.154042: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
  • 2020-05-18 08:13:50.154072: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:264] Reading SavedModel debug info (if present) from: /opt/ml/model/1
  • 2020-05-18 08:13:50.155009: I external/org_tensorflow/tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
  • 2020-05-18 08:13:50.194113: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:203] Restoring SavedModel bundle.
  • 2020-05-18 08:13:50.252623: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:333] SavedModel load for tags { serve }; Status: success: OK. Took 106920 microseconds.
  • 2020-05-18 08:13:50.254566: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /opt/ml/model/1/assets.extra/tf_serving_warmup_requests
  • 2020-05-18 08:13:50.255778: I tensorflow_serving/util/retrier.cc:46] Retrying of Loading servable: {name: model version: 1} exhausted max_num_retries: 0
  • 2020-05-18 08:13:50.255798: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: model version: 1}
  • 2020-05-18 08:13:50.257993: I tensorflow_serving/model_servers/server.cc:362] Running gRPC ModelServer at 0.0.0.0:27000 ...
  • [warn] getaddrinfo: address family for nodename not supported
  • 2020-05-18 08:13:50.259143: I tensorflow_serving/model_servers/server.cc:382] Exporting HTTP/REST API at:localhost:27001 ...
  • [evhttp_server.cc : 238] NET_LOG: Entering the event loop ...

@Sbrikky
Copy link

Sbrikky commented May 18, 2020

For me this ended up being an issue with the shape of the input. I was uploading an individual sample, but the endpoint expects a batch, so I needed to make my input one layer deeper (As described here). Could this be happening for you as well, @Xixiong-Guo?

@Xixiong-Guo
Copy link
Author

Hi @Sbrikky , did you encounter the same 502 issue?
I have no idea if it is caused by shape of the input. I can make prediction outside of Sagemaker in Keras when a list representing an embedded sentence is considered as input. But failed inside Sagemaker.

@Sbrikky
Copy link

Sbrikky commented May 18, 2020

@Xixiong-Guo Yes, I had the exact same error in my notebook as you posted so I didn't bother posting it again.
If you look at the first line of my cloudwatch log you see it says:

2020-05-18 08:13:49.621487: F external/org_tensorflow/tensorflow/core/util/tensor_format.h:426] Check failed: index >= 0 && index < num_total_dims Invalid index from the dimension: 3, 0, C

This suggested that maybe there was something in the shape of the request. Why on earth it ends up throwing this as a 502, I have no clue.
For me I had to put my input into another list. So instead of an array with all my pixels values, I had a list of 1 element, and that 1 element was my array with all my pixel values. So instead predict(input) I had to do predict([input.tolist()])

@Xixiong-Guo
Copy link
Author

Hi @Sbrikky I got it. In your case, is there any difference in terms of the error info, when you tried both predict(input) and predict([input.tolist()])?

@Sbrikky
Copy link

Sbrikky commented May 19, 2020

When I use predict([input.tolist()]) it works and I get a prediction back. No 502.

@chuyang-deng
Copy link
Contributor

Hi @Xixiong-Guo , looks like you are using csv_seralizer and note here (https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/predictor.py#L325) that the serializer will try to serialize your input row by row delimited by "," if you are using a python list: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/predictor.py#L363

@MohGhaziAlZeyadi
Copy link

Hi all,
I am having the same problem with error 502 as below:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (502) from model with message "

<title>502 Bad Gateway</title>

502 Bad Gateway


nginx/1.16.1

@alsulke
Copy link

alsulke commented Jul 7, 2020

For me this ended up being an issue with the directory structure of the saved model.
As per the latest, Sagemaker expects model to be extracted directly under model_name/version/..

So, Your tar structure should should be :
model.tar.gz\export\1\saved_model.pb
model.tar.gz\export\1\variables\

@aws aws locked and limited conversation to collaborators May 20, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

6 participants