|
| 1 | +####################################### |
| 2 | +Use DJL with the SageMaker Python SDK |
| 3 | +####################################### |
| 4 | + |
| 5 | +With the SageMaker Python SDK, you can use Deep Java Library to host models on Amazon SageMaker. |
| 6 | + |
| 7 | +`Deep Java Library (DJL) Serving <https://docs.djl.ai/docs/serving/index.html>`_ is a high performance universal stand-alone model serving solution powered by `DJL <https://docs.djl.ai/index.html>`_. |
| 8 | +DJL Serving supports loading models trained with a variety of different frameworks. With the SageMaker Python SDK you can |
| 9 | +use DJL Serving to host large models using backends like DeepSpeed and HuggingFace Accelerate. |
| 10 | + |
| 11 | +For information about supported versions of DJL Serving, see the `AWS documentation <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html>`_. |
| 12 | +We recommend that you use the latest supported version because that's where we focus our development efforts. |
| 13 | + |
| 14 | +For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`. |
| 15 | + |
| 16 | +.. contents:: |
| 17 | + |
| 18 | +******************* |
| 19 | +Deploy DJL models |
| 20 | +******************* |
| 21 | + |
| 22 | +With the SageMaker Python SDK, you can use DJL Serving to host models that have been saved in the HuggingFace pretrained format. |
| 23 | +These can either be models you have trained/fine-tuned yourself, or models available publicly from the HuggingFace Hub. |
| 24 | +DJL Serving in the SageMaker Python SDK supports hosting models for the popular HuggingFace NLP tasks, as well as Stable Diffusion. |
| 25 | + |
| 26 | +You can either deploy your model using DeepSpeed or HuggingFace Accelerate, or let DJL Serving determine the best backend based on your model architecture and configuration. |
| 27 | + |
| 28 | +.. code:: python |
| 29 | +
|
| 30 | + # Create a DJL Model, backend is chosen automatically |
| 31 | + djl_model = DJLModel( |
| 32 | + "s3://my_bucket/my_saved_model_artifacts/", |
| 33 | + "my_sagemaker_role", |
| 34 | + data_type="fp16", |
| 35 | + task="text-generation", |
| 36 | + number_of_partitions=2 # number of gpus to partition the model across |
| 37 | + ) |
| 38 | +
|
| 39 | + # Deploy the model to an Amazon SageMaker Endpoint and get a Predictor |
| 40 | + predictor = djl_model.deploy("ml.g5.12xlarge", |
| 41 | + initial_instance_count=1) |
| 42 | +
|
| 43 | +If you want to use a specific backend, then you can create an instance of the corresponding model directly. |
| 44 | + |
| 45 | +.. code:: python |
| 46 | +
|
| 47 | + # Create a model using the DeepSpeed backend |
| 48 | + deepspeed_model = DeepSpeedModel( |
| 49 | + "s3://my_bucket/my_saved_model_artifacts/", |
| 50 | + "my_sagemaker_role", |
| 51 | + data_type="bf16", |
| 52 | + task="text-generation", |
| 53 | + tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism |
| 54 | + ) |
| 55 | +
|
| 56 | + # Create a model using the HuggingFace Accelerate backend |
| 57 | +
|
| 58 | + hf_accelerate_model = HuggingFaceAccelerateModel( |
| 59 | + "s3://my_bucket/my_saved_model_artifacts/", |
| 60 | + "my_sagemaker_role", |
| 61 | + data_type="fp16", |
| 62 | + task="text-generation", |
| 63 | + number_of_partitions=2, # number of gpus to partition the model across |
| 64 | + ) |
| 65 | +
|
| 66 | + # Deploy the model to an Amazon SageMaker Endpoint and get a Predictor |
| 67 | + deepspeed_predictor = deepspeed_model.deploy("ml.g5.12xlarge", |
| 68 | + initial_instance_count=1) |
| 69 | + hf_accelerate_predictor = hf_accelerate_model.deploy("ml.g5.12xlarge", |
| 70 | + initial_instance_count=1) |
| 71 | +
|
| 72 | +Regardless of which way you choose to create your model, a ``Predictor`` object is returned. You can use this ``Predictor`` |
| 73 | +to do inference on the endpoint hosting your DJLModel. |
| 74 | + |
| 75 | +Each ``Predictor`` provides a ``predict`` method, which can do inference with json data, numpy arrays, or Python lists. |
| 76 | +Inference data are serialized and sent to the DJL Serving model server by an ``InvokeEndpoint`` SageMaker operation. The |
| 77 | +``predict`` method returns the result of inference against your model. |
| 78 | + |
| 79 | +By default, the inference data is serialized to a json string, and the inference result is a Python dictionary. |
| 80 | + |
| 81 | +Model Directory Structure |
| 82 | +========================= |
| 83 | + |
| 84 | +There are two components that are needed to deploy DJL Serving Models on Sagemaker. |
| 85 | +1. Model Artifacts (required) |
| 86 | +2. Inference code and Model Server Properties (optional) |
| 87 | + |
| 88 | +These are stored and handled separately. Model artifacts should not be stored with the custom inference code and |
| 89 | +model server configuration. |
| 90 | + |
| 91 | +Model Artifacts |
| 92 | +--------------- |
| 93 | + |
| 94 | +DJL Serving Models expect a different model structure than most of the other frameworks in the SageMaker Python SDK. |
| 95 | +Specifically, DJLModels do not support loading models stored in tar.gz format. |
| 96 | +You must provide an Amazon S3 url pointing to uncompressed model artifacts (bucket and prefix). |
| 97 | +This is because DJL Serving is optimized for large models, and it implements a fast downloading mechanism for large models that require the artifacts be uncompressed. |
| 98 | + |
| 99 | +For example, lets say you want to deploy the EleutherAI/gpt-j-6B model available on the HuggingFace Hub. |
| 100 | +You can download the model and upload to S3 like this: |
| 101 | + |
| 102 | +.. code:: |
| 103 | +
|
| 104 | + # Requires Git LFS |
| 105 | + git clone https://huggingface.co/EleutherAI/gpt-j-6B |
| 106 | +
|
| 107 | + # Upload to S3 |
| 108 | + aws s3 sync gpt-j-6B s3://my_bucket/gpt-j-6B |
| 109 | +
|
| 110 | +You would then pass "s3://my_bucket/gpt-j-6B" as ``model_s3_uri`` to the ``DJLModel``. |
| 111 | + |
| 112 | +For language models we expect that the model weights, model config, and tokenizer config are provided in S3. The model |
| 113 | +should be loadable from the HuggingFace Transformers AutoModelFor<Task>.from_pretrained API, where task |
| 114 | +is the NLP task you want to host the model for. The weights must be stored as PyTorch compatible checkpoints. |
| 115 | + |
| 116 | +Example: |
| 117 | + |
| 118 | +.. code:: |
| 119 | +
|
| 120 | + my_bucket/my_model/ |
| 121 | + |- config.json |
| 122 | + |- added_tokens.json |
| 123 | + |- config.json |
| 124 | + |- pytorch_model-*-of-*.bin # model weights can be partitioned into multiple checkpoints |
| 125 | + |- tokenizer.json |
| 126 | + |- tokenizer_config.json |
| 127 | + |- vocab.json |
| 128 | +
|
| 129 | +For Stable Diffusion models, the model should be loadable from the HuggingFace Diffusers DiffusionPipeline.from_pretrained API. |
| 130 | + |
| 131 | +Inference code and Model Server Properties |
| 132 | +------------------------------------------ |
| 133 | + |
| 134 | +You can provide custom inference code and model server configuration by specifying the ``source_dir`` and |
| 135 | +``entry_point`` arguments of the ``DJLModel``. These are not required. The model server configuration can be generated |
| 136 | +based on the arguments passed to the constructor, and we provide default inference handler code for DeepSpeed, |
| 137 | +HuggingFaceAccelerate, and Stable Diffusion. You can find these handler implementations in the `DJL Serving Github repository. <https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python>`_ |
| 138 | + |
| 139 | +You can find documentation for the model server configurations on the `DJL Serving Docs website <https://docs.djl.ai/docs/serving/serving/docs/configurations.html>`_. |
| 140 | + |
| 141 | +The code and configuration you want to deploy can either be stored locally or in S3. These files will be bundled into |
| 142 | +a tar.gz file that will be uploaded to SageMaker. |
| 143 | + |
| 144 | +For example: |
| 145 | + |
| 146 | +.. code:: |
| 147 | +
|
| 148 | + sourcedir/ |
| 149 | + |- script.py # Inference handler code |
| 150 | + |- serving.properties # Model Server configuration file |
| 151 | + |- requirements.txt # Additional Python requirements that will be installed at runtime via PyPi |
| 152 | +
|
| 153 | +In the above example, sourcedir will be bundled and compressed into a tar.gz file and uploaded as part of creating the Inference Endpoint. |
| 154 | + |
| 155 | +The DJL Serving Model Server |
| 156 | +============================ |
| 157 | + |
| 158 | +The endpoint you create with ``deploy`` runs the DJL Serving model server. |
| 159 | +The model server loads the model from S3 and performs inference on the model in response to SageMaker ``InvokeEndpoint`` API calls. |
| 160 | + |
| 161 | +DJL Serving is highly customizable. You can control aspects of both model loading and model serving. Most of the model server |
| 162 | +configuration are exposed through the ``DJLModel`` API. The SageMaker Python SDK will use the values it is passed to |
| 163 | +create the proper configuration file used when creating the inference endpoint. You can optionally provide your own |
| 164 | +``serving.properties`` file via the ``source_dir`` argument. You can find documentation about serving.properties in the |
| 165 | +`DJL Serving Documentation for model specific settings. <https://docs.djl.ai/docs/serving/serving/docs/configurations.html#model-specific-settings>`_ |
| 166 | + |
| 167 | +Within the SageMaker Python SDK, DJL Serving is used in Python mode. This allows users to provide their inference script, |
| 168 | +and data processing scripts in python. For details on how to write custom inference and data processing code, please |
| 169 | +see the `DJL Serving Documentation on Python Mode. <https://docs.djl.ai/docs/serving/serving/docs/modes.html#python-mode>`_ |
| 170 | + |
| 171 | +For more information about DJL Serving, see the `DJL Serving documentation. <https://docs.djl.ai/docs/serving/index.html>`_ |
| 172 | + |
| 173 | +*********************** |
| 174 | +SageMaker DJL Classes |
| 175 | +*********************** |
| 176 | + |
| 177 | +For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/sagemaker.djl_inference.html. |
| 178 | + |
| 179 | +******************************** |
| 180 | +SageMaker DJL Serving Containers |
| 181 | +******************************** |
| 182 | + |
| 183 | +For information about the SageMaker DJL Serving containers, see: |
| 184 | + |
| 185 | +- `Deep Learning Container (DLC) Images <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html>`_ and `release notes <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/dlc-release-notes.html>`_ |
0 commit comments