Skip to content

Commit 86d96e0

Browse files
committed
update DJLModel documentation on docs site
1 parent e7b8154 commit 86d96e0

File tree

5 files changed

+27
-242
lines changed

5 files changed

+27
-242
lines changed

doc/frameworks/djl/sagemaker.djl_inference.rst

+2-26
Original file line numberDiff line numberDiff line change
@@ -5,39 +5,15 @@ DJL Classes
55
DJLModel
66
---------------------------
77

8-
.. autoclass:: sagemaker.djl_inference.model.DJLModel
9-
:members:
10-
:undoc-members:
11-
:show-inheritance:
12-
13-
DeepSpeedModel
14-
---------------------------
15-
16-
.. autoclass:: sagemaker.djl_inference.model.DeepSpeedModel
17-
:members:
18-
:undoc-members:
19-
:show-inheritance:
20-
21-
HuggingFaceAccelerateModel
22-
---------------------------
23-
24-
.. autoclass:: sagemaker.djl_inference.model.HuggingFaceAccelerateModel
25-
:members:
26-
:undoc-members:
27-
:show-inheritance:
28-
29-
FasterTransformerModel
30-
---------------------------
31-
32-
.. autoclass:: sagemaker.djl_inference.model.FasterTransformerModel
8+
.. autoclass:: sagemaker.djl_inference.DJLModel
339
:members:
3410
:undoc-members:
3511
:show-inheritance:
3612

3713
DJLPredictor
3814
---------------------------
3915

40-
.. autoclass:: sagemaker.djl_inference.model.DJLPredictor
16+
.. autoclass:: sagemaker.djl_inference.DJLPredictor
4117
:members:
4218
:undoc-members:
4319
:show-inheritance:

doc/frameworks/djl/using_djl.rst

+21-210
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,11 @@
22
Use DJL with the SageMaker Python SDK
33
#######################################
44

5-
With the SageMaker Python SDK, you can use Deep Java Library to host models on Amazon SageMaker.
6-
75
`Deep Java Library (DJL) Serving <https://docs.djl.ai/docs/serving/index.html>`_ is a high performance universal stand-alone model serving solution powered by `DJL <https://docs.djl.ai/index.html>`_.
86
DJL Serving supports loading models trained with a variety of different frameworks. With the SageMaker Python SDK you can
9-
use DJL Serving to host large models using backends like DeepSpeed and HuggingFace Accelerate.
7+
use DJL Serving to host large language models for text-generation and text-embedding use-cases.
108

11-
For information about supported versions of DJL Serving, see the `AWS documentation <https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html>`_.
12-
We recommend that you use the latest supported version because that's where we focus our development efforts.
9+
You can learn more about Large Model Inference using DJLServing on the `docs site <https://docs.djl.ai/docs/serving/serving/docs/lmi/index.html>`_.
1310

1411
For general information about using the SageMaker Python SDK, see :ref:`overview:Using the SageMaker Python SDK`.
1512

@@ -19,238 +16,52 @@ For general information about using the SageMaker Python SDK, see :ref:`overview
1916
Deploy DJL models
2017
*******************
2118

22-
With the SageMaker Python SDK, you can use DJL Serving to host models that have been saved in the HuggingFace pretrained format.
19+
With the SageMaker Python SDK, you can use DJL Serving to host text-generation and text-embedding models that have been saved in the HuggingFace pretrained format.
2320
These can either be models you have trained/fine-tuned yourself, or models available publicly from the HuggingFace Hub.
24-
DJL Serving in the SageMaker Python SDK supports hosting models for the popular HuggingFace NLP tasks, as well as Stable Diffusion.
25-
26-
You can either deploy your model using DeepSpeed, FasterTransformer, or HuggingFace Accelerate, or let DJL Serving determine the best backend based on your model architecture and configuration.
2721

2822
.. code:: python
2923
30-
# Create a DJL Model, backend is chosen automatically
24+
# DJLModel will infer which container to use, and apply some starter configuration
3125
djl_model = DJLModel(
32-
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
33-
"my_sagemaker_role",
34-
dtype="fp16",
26+
model_id="<hf hub id | s3 uri>",
27+
role="my_sagemaker_role",
3528
task="text-generation",
36-
number_of_partitions=2 # number of gpus to partition the model across
3729
)
3830
3931
# Deploy the model to an Amazon SageMaker Endpoint and get a Predictor
4032
predictor = djl_model.deploy("ml.g5.12xlarge",
4133
initial_instance_count=1)
4234
43-
If you want to use a specific backend, then you can create an instance of the corresponding model directly.
35+
Alternatively, you can provide full specifications to the DJLModel to have full control over the model configuration:
4436

4537
.. code:: python
4638
47-
# Create a model using the DeepSpeed backend
48-
deepspeed_model = DeepSpeedModel(
49-
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
50-
"my_sagemaker_role",
51-
dtype="bf16",
52-
task="text-generation",
53-
tensor_parallel_degree=2, # number of gpus to partition the model across using tensor parallelism
54-
)
55-
56-
# Create a model using the HuggingFace Accelerate backend
57-
58-
hf_accelerate_model = HuggingFaceAccelerateModel(
59-
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
60-
"my_sagemaker_role",
61-
dtype="fp16",
62-
task="text-generation",
63-
number_of_partitions=2, # number of gpus to partition the model across
64-
)
65-
66-
# Create a model using the FasterTransformer backend
67-
68-
fastertransformer_model = FasterTransformerModel(
69-
"s3://my_bucket/my_saved_model_artifacts/", # This can also be a HuggingFace Hub model id
70-
"my_sagemaker_role",
71-
data_type="fp16",
39+
djl_model = DJLModel(
40+
model_id="<hf hub id | s3 uri>",
41+
role="my_sagemaker_role",
7242
task="text-generation",
73-
tensor_parallel_degree=2, # number of gpus to partition the model across
43+
engine="Python",
44+
env={
45+
"OPTION_ROLLING_BATCH": "lmi-dist",
46+
"OPTION_DTYPE": "bf16",
47+
"MAX_BATCH_SIZE": "256",
48+
},
49+
image_uri=<djl lmi image uri>,
7450
)
7551
76-
# Deploy the model to an Amazon SageMaker Endpoint and get a Predictor
77-
deepspeed_predictor = deepspeed_model.deploy("ml.g5.12xlarge",
78-
initial_instance_count=1)
79-
hf_accelerate_predictor = hf_accelerate_model.deploy("ml.g5.12xlarge",
80-
initial_instance_count=1)
81-
fastertransformer_predictor = fastertransformer_model.deploy("ml.g5.12xlarge",
82-
initial_instance_count=1)
83-
84-
Regardless of which way you choose to create your model, a ``Predictor`` object is returned. You can use this ``Predictor``
85-
to do inference on the endpoint hosting your DJLModel.
86-
8752
Each ``Predictor`` provides a ``predict`` method, which can do inference with json data, numpy arrays, or Python lists.
8853
Inference data are serialized and sent to the DJL Serving model server by an ``InvokeEndpoint`` SageMaker operation. The
8954
``predict`` method returns the result of inference against your model.
9055

9156
By default, the inference data is serialized to a json string, and the inference result is a Python dictionary.
9257

93-
Model Directory Structure
94-
=========================
95-
96-
There are two components that are needed to deploy DJL Serving Models on Sagemaker.
97-
1. Model Artifacts (required)
98-
2. Inference code and Model Server Properties (optional)
99-
100-
These are stored and handled separately. Model artifacts should not be stored with the custom inference code and
101-
model server configuration.
102-
103-
Model Artifacts
104-
---------------
105-
106-
DJL Serving supports two ways to load models for inference.
107-
1. A HuggingFace Hub model id.
108-
2. Uncompressed model artifacts stored in a S3 bucket.
109-
110-
HuggingFace Hub model id
111-
^^^^^^^^^^^^^^^^^^^^^^^^
112-
113-
Using a HuggingFace Hub model id is the easiest way to get started with deploying Large Models via DJL Serving on SageMaker.
114-
DJL Serving will use this model id to download the model at runtime via the HuggingFace Transformers ``from_pretrained`` API.
115-
This method makes it easy to deploy models quickly, but for very large models the download time can become unreasonable.
116-
117-
For example, you can deploy the EleutherAI gpt-j-6B model like this:
118-
119-
.. code::
120-
121-
model = DJLModel(
122-
"EleutherAI/gpt-j-6B",
123-
"my_sagemaker_role",
124-
dtype="fp16",
125-
number_of_partitions=2
126-
)
127-
128-
predictor = model.deploy("ml.g5.12xlarge")
129-
130-
Uncompressed Model Artifacts stored in a S3 bucket
131-
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
132-
133-
For models that are larger than 20GB (total checkpoint size), we recommend that you store the model in S3.
134-
Download times will be much faster compared to downloading from the HuggingFace Hub at runtime.
135-
DJL Serving Models expect a different model structure than most of the other frameworks in the SageMaker Python SDK.
136-
Specifically, DJLModels do not support loading models stored in tar.gz format.
137-
This is because DJL Serving is optimized for large models, and it implements a fast downloading mechanism for large models that require the artifacts be uncompressed.
138-
139-
For example, lets say you want to deploy the EleutherAI/gpt-j-6B model available on the HuggingFace Hub.
140-
You can download the model and upload to S3 like this:
141-
142-
.. code::
143-
144-
# Requires Git LFS
145-
git clone https://huggingface.co/EleutherAI/gpt-j-6B
146-
147-
# Upload to S3
148-
aws s3 sync gpt-j-6B s3://my_bucket/gpt-j-6B
149-
150-
You would then pass "s3://my_bucket/gpt-j-6B" as ``model_id`` to the ``DJLModel`` like this:
151-
152-
.. code::
153-
154-
model = DJLModel(
155-
"s3://my_bucket/gpt-j-6B",
156-
"my_sagemaker_role",
157-
dtype="fp16",
158-
number_of_partitions=2
159-
)
160-
161-
predictor = model.deploy("ml.g5.12xlarge")
162-
163-
For language models we expect that the model weights, model config, and tokenizer config are provided in S3. The model
164-
should be loadable from the HuggingFace Transformers AutoModelFor<Task>.from_pretrained API, where task
165-
is the NLP task you want to host the model for. The weights must be stored as PyTorch compatible checkpoints.
166-
167-
Example:
168-
169-
.. code::
170-
171-
my_bucket/my_model/
172-
|- config.json
173-
|- added_tokens.json
174-
|- config.json
175-
|- pytorch_model-*-of-*.bin # model weights can be partitioned into multiple checkpoints
176-
|- tokenizer.json
177-
|- tokenizer_config.json
178-
|- vocab.json
179-
180-
For Stable Diffusion models, the model should be loadable from the HuggingFace Diffusers DiffusionPipeline.from_pretrained API.
181-
182-
Inference code and Model Server Properties
183-
------------------------------------------
184-
185-
You can provide custom inference code and model server configuration by specifying the ``source_dir`` and
186-
``entry_point`` arguments of the ``DJLModel``. These are not required. The model server configuration can be generated
187-
based on the arguments passed to the constructor, and we provide default inference handler code for DeepSpeed,
188-
HuggingFaceAccelerate, and Stable Diffusion. You can find these handler implementations in the `DJL Serving Github repository. <https://github.com/deepjavalibrary/djl-serving/tree/master/engines/python/setup/djl_python>`_
189-
190-
You can find documentation for the model server configurations on the `DJL Serving Docs website <https://docs.djl.ai/docs/serving/serving/docs/configurations.html>`_.
191-
192-
The code and configuration you want to deploy can either be stored locally or in S3. These files will be bundled into
193-
a tar.gz file that will be uploaded to SageMaker.
194-
195-
For example:
196-
197-
.. code::
198-
199-
sourcedir/
200-
|- script.py # Inference handler code
201-
|- serving.properties # Model Server configuration file
202-
|- requirements.txt # Additional Python requirements that will be installed at runtime via PyPi
203-
204-
In the above example, sourcedir will be bundled and compressed into a tar.gz file and uploaded as part of creating the Inference Endpoint.
205-
206-
The DJL Serving Model Server
207-
============================
208-
209-
The endpoint you create with ``deploy`` runs the DJL Serving model server.
210-
The model server loads the model from S3 and performs inference on the model in response to SageMaker ``InvokeEndpoint`` API calls.
211-
212-
DJL Serving is highly customizable. You can control aspects of both model loading and model serving. Most of the model server
213-
configuration are exposed through the ``DJLModel`` API. The SageMaker Python SDK will use the values it is passed to
214-
create the proper configuration file used when creating the inference endpoint. You can optionally provide your own
215-
``serving.properties`` file via the ``source_dir`` argument. You can find documentation about serving.properties in the
216-
`DJL Serving Documentation for model specific settings. <https://docs.djl.ai/docs/serving/serving/docs/configurations.html#model-specific-settings>`_
217-
218-
Within the SageMaker Python SDK, DJL Serving is used in Python mode. This allows users to provide their inference script,
219-
and data processing scripts in python. For details on how to write custom inference and data processing code, please
220-
see the `DJL Serving Documentation on Python Mode. <https://docs.djl.ai/docs/serving/serving/docs/modes.html#python-mode>`_
221-
222-
For more information about DJL Serving, see the `DJL Serving documentation. <https://docs.djl.ai/docs/serving/index.html>`_
223-
224-
**************************
225-
Ahead of time partitioning
226-
**************************
227-
228-
To optimize the deployment of large models that do not fit in a single GPU, the model’s tensor weights are partitioned at
229-
runtime and each partition is loaded in individual GPU. But runtime partitioning takes significant amount of time and
230-
memory on model loading. So, DJLModel offers an ahead of time partitioning capability for DeepSpeed and FasterTransformer
231-
engines, which lets you partition your model weights and save them before deployment. HuggingFace does not support
232-
tensor parallelism, so ahead of time partitioning cannot be done for it. In our experiment with GPT-J model, loading
233-
this model with partitioned checkpoints increased the model loading time by 40%.
234-
235-
`partition` method invokes an Amazon SageMaker Training job to partition the model and upload those partitioned
236-
checkpoints to S3 bucket. You can either provide your desired S3 bucket to upload the partitioned checkpoints or it will be
237-
uploaded to the default SageMaker S3 bucket. Please note that this S3 bucket will be remembered for deployment. When you
238-
call `deploy` method after partition, DJLServing downloads the partitioned model checkpoints directly from the uploaded
239-
s3 url, if available.
240-
241-
.. code::
242-
243-
# partitions the model using Amazon Sagemaker Training Job.
244-
djl_model.partition("ml.g5.12xlarge")
58+
**************************************
59+
DJL Serving for Large Model Inference
60+
**************************************
24561

246-
predictor = deepspeed_model.deploy("ml.g5.12xlarge",
247-
initial_instance_count=1)
62+
You can learn more about using DJL Serving for Large Model Inference use-cases on our `documentation site <https://docs.djl.ai/docs/serving/serving/docs/lmi/index.html>`_.
24863

249-
***********************
250-
SageMaker DJL Classes
251-
***********************
25264

253-
For information about the different DJL Serving related classes in the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html.
25465

25566
********************************
25667
SageMaker DJL Serving Containers

src/sagemaker/djl_inference/model.py

-2
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,6 @@ def __init__(
108108
**kwargs: Keyword arguments passed to the superclass
109109
:class:`~sagemaker.model.FrameworkModel` and, subsequently, its
110110
superclass :class:`~sagemaker.model.Model`.
111-
112-
.. tip::
113111
"""
114112
super(DJLModel, self).__init__(predictor_cls=predictor_cls, **kwargs)
115113
self.model_id = model_id

tests/unit/sagemaker/serve/model_server/djl_serving/test_djl_prepare.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
from __future__ import absolute_import
1414

1515
from unittest import TestCase
16-
from unittest.mock import Mock, PropertyMock, patch, mock_open, ANY
16+
from unittest.mock import Mock, PropertyMock, patch, mock_open
1717

1818
from sagemaker.serve.model_server.djl_serving.prepare import (
1919
_copy_jumpstart_artifacts,
@@ -186,4 +186,4 @@ def test_extract_js_resources_success(self, mock_tarfile, mock_path):
186186

187187
mock_path.assert_called_once_with(js_model_dir)
188188
mock_path_obj.joinpath.assert_called_once_with(f"infer-prepack-{MOCK_JUMPSTART_ID}.tar.gz")
189-
mock_resource_obj.extractall.assert_called_once_with(path=code_dir, members=ANY)
189+
mock_resource_obj.extractall.assert_called_once_with(path=code_dir, filter="data")

tests/unit/sagemaker/serve/model_server/tgi/test_tgi_prepare.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
from __future__ import absolute_import
1414

1515
from unittest import TestCase
16-
from unittest.mock import Mock, PropertyMock, patch, ANY
16+
from unittest.mock import Mock, PropertyMock, patch
1717

1818
from sagemaker.serve.model_server.tgi.prepare import (
1919
_create_dir_structure,
@@ -168,4 +168,4 @@ def test_extract_js_resources_success(self, mock_tarfile, mock_path):
168168

169169
mock_path.assert_called_once_with(js_model_dir)
170170
mock_path_obj.joinpath.assert_called_once_with(f"infer-prepack-{MOCK_JUMPSTART_ID}.tar.gz")
171-
mock_resource_obj.extractall.assert_called_once_with(path=code_dir, members=ANY)
171+
mock_resource_obj.extractall.assert_called_once_with(path=code_dir, filter="data")

0 commit comments

Comments
 (0)