Skip to content

Fix: Don't load default model in MME mode #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Nov 7, 2022
Merged

Conversation

nikhil-sk
Copy link
Contributor

Issue #, if available:

Description of changes:

  1. In MME mode, no default model should be loaded. Currently, the torchserve command attempts to load a default 'model' from the path /opt/ml/models.
  2. This change removes the commandline arg based on whether the container is running in MME mode or not:
    Failure log
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,758 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - model_name: model, batchSize: 1
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,808 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Backend worker process died.
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,808 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - Traceback (most recent call last):
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 210, in <module>
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - worker.run_server()
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 181, in run_server
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self.handle_connection(cl_socket)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 139, in handle_connection
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service, result, code = self.load_model(msg)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_service_worker.py", line 104, in load_model
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - service = model_loader.load(
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/ts/model_loader.py", line 151, in load
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,809 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - initialize_fn(service.context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/handler_service.py", line 51, in initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - super().initialize(context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/default_handler_service.py", line 66, in initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._service.validate_and_initialize(model_dir=model_dir, context=context)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 178, in validate_and_initialize
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - self._model = self._run_handler_function(self._model_fn, *(model_dir,))
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_inference/transformer.py", line 266, in _run_handler_function
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,810 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - result = func(*argv)
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - File "/opt/conda/lib/python3.8/site-packages/sagemaker_pytorch_serving_container/default_pytorch_inference_handler.py", line 73, in default_model_fn
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - raise ValueError(
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,811 [INFO ] W-9000-model_1.0-stdout MODEL_LOG - ValueError: Exactly one .pth or .pt file is required for PyTorch models: []
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,817 [INFO ] epollEventLoopGroup-5-1 org.pytorch.serve.wlm.WorkerThread - 9000 Worker disconnected. WORKER_STARTED
    2022-09-07T21:03:56.608+02:00   2022-09-07T19:03:55,818 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
    2022-09-07T19:03:55,818 [WARN] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
...

Fixed log
(No default model loaded when torchserve starts)

Metrics report format: prometheus
--
Enable metrics API: true
Workflow Store: /
Model config: N/A
2022-10-31T07:12:55,633 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager -  Loading snapshot serializer plugin...
2022-10-31T07:12:55,651 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel.
2022-10-31T07:12:55,696 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://0.0.0.0:8080
2022-10-31T07:12:55,697 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel.
2022-10-31T07:12:55,698 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082
Model server started.
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:40.57777786254883\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:11.410484313964844\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:21.9\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,914 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:6150.69921875\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,915 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:1175.19921875\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:55,915 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:19.3\|#Level:Host\|#hostname:container-1.local,timestamp:1667200375
2022-10-31T07:12:57,847 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 13
2022-10-31T07:12:57,847 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:12:57,866 [INFO ] epollEventLoopGroup-3-1 ACCESS_LOG - /169.254.178.2:35152 "GET /models HTTP/1.1" 200 2
2022-10-31T07:12:57,866 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:02,752 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 1
2022-10-31T07:13:02,752 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:07,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:07,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:12,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:12,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:17,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:17,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:22,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:22,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:27,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:27,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:32,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:32,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:37,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0
2022-10-31T07:13:37,751 [INFO ] pool-2-thread-1 TS_METRICS - Requests2XX.Count:1\|#Level:Host\|#hostname:container-1.local,timestamp:1667200377
2022-10-31T07:13:42,751 [INFO ] pool-2-thread-1 ACCESS_LOG - /169.254.178.2:35152 "GET /ping HTTP/1.1" 200 0


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 69eeea9
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 8b210dd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 8514322
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 70b1278
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: b67f7fa
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 17094ed
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 260288f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: bb2945f
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@@ -70,6 +70,7 @@ deps =
six
future
pyyaml
protobuf == 3.19.6
Copy link
Contributor Author

@nikhil-sk nikhil-sk Oct 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently required, as otherwise, SageMaker imports fail with error on py37 only:

    import sagemaker.amazon.common
.tox/py37/lib/python3.7/site-packages/sagemaker/amazon/common.py:23: in <module>
  from sagemaker.amazon.record_pb2 import Record
.tox/py37/lib/python3.7/site-packages/sagemaker/amazon/record_pb2.py:52: in <module>
    file=DESCRIPTOR,
.tox/py37/lib/python3.7/site-packages/google/protobuf/descriptor.py:560: in __new__
    _message.Message._CheckCalledFromGeneratedFile()
E   TypeError: Descriptors cannot not be created directly.
E   If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
E   If you cannot immediately regenerate your protos, some other possible workarounds are:
E    1. Downgrade the protobuf package to 3.20.x or lower.
E    2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).
E
E   More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

Upgrading SageMaker version did not resolve the issue, so currently we need to pin the version, and consider a complete upgrade of dependencies in a separate PR.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: ce931fd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-inference-toolkit-pr
  • Commit ID: 604e65b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

maaquib
maaquib previously approved these changes Nov 1, 2022
@nikhil-sk nikhil-sk merged commit 1daa4c1 into aws:master Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants