aws · ahsan-z-khan · Jan 20, 2022 · Jan 20, 2022
diff --git a/doc/api/inference/serverless.rst b/doc/api/inference/serverless.rst
@@ -0,0 +1,9 @@
+Serverless Inference
+---------------------
+
+This module contains classes related to Amazon Sagemaker Serverless Inference
+
+.. automodule:: sagemaker.serverless.serverless_inference_config
+    :members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/doc/overview.rst b/doc/overview.rst
@@ -684,6 +684,63 @@ For more detailed explanations of the classes that this library provides for aut
 - `API docs for HyperparameterTuner and parameter range classes <https://sagemaker.readthedocs.io/en/stable/tuner.html>`__
 - `API docs for analytics classes <https://sagemaker.readthedocs.io/en/stable/analytics.html>`__
 
+*******************************
+SageMaker Serverless Inference
+*******************************
+Amazon SageMaker Serverless Inference enables you to easily deploy machine learning models for inference without having
+to configure or manage the underlying infrastructure. After you trained a model, you can deploy it to Amazon Sagemaker
+Serverless endpoint and then invoke the endpoint with the model to get inference results back. More information about
+SageMaker Serverless Inference can be found in the `AWS documentation <https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html>`__.
+
+To deploy serverless endpoint, you will need to create a ``ServerlessInferenceConfig``.
+If you create ``ServerlessInferenceConfig`` without specifying its arguments, the default ``MemorySizeInMB`` will be **2048** and
+the default ``MaxConcurrency`` will be **5** :
+
+.. code:: python
+
+    from sagemaker.serverless import ServerlessInferenceConfig
+
+    # Create an empty ServerlessInferenceConfig object to use default values
+    serverless_config = new ServerlessInferenceConfig()
+
+Or you can specify ``MemorySizeInMB`` and ``MaxConcurrency`` in ``ServerlessInferenceConfig`` (example shown below):
+
+.. code:: python
+
+    # Specify MemorySizeInMB and MaxConcurrency in the serverless config object
+    serverless_config = new ServerlessInferenceConfig(
+      memory_size_in_mb=4096,
+      max_concurrency=10,
+    )
+
+Then use the ``ServerlessInferenceConfig`` in the estimator's ``deploy()`` method to deploy a serverless endpoint:
+
+.. code:: python
+
+    # Deploys the model that was generated by fit() to a SageMaker serverless endpoint
+    serverless_predictor = estimator.deploy(serverless_inference_config=serverless_config)
+
+After deployment is complete, you can use predictor's ``predict()`` method to invoke the serverless endpoint just like
+real-time endpoints:
+
+.. code:: python
+
+    # Serializes data and makes a prediction request to the SageMaker serverless endpoint
+    response = serverless_predictor.predict(data)
+
+Clean up the endpoint and model if needed after inference:
+
+.. code:: python
+
+    # Tears down the SageMaker endpoint and endpoint configuration
+    serverless_predictor.delete_endpoint()
+
+    # Deletes the SageMaker model
+    serverless_predictor.delete_model()
+
+For more details about ``ServerlessInferenceConfig``,
+see the API docs for `Serverless Inference <https://sagemaker.readthedocs.io/en/stable/api/inference/serverless.html>`__
+
 *************************
 SageMaker Batch Transform
 *************************