@@ -221,6 +221,31 @@ see the `DJL Serving Documentation on Python Mode. <https://docs.djl.ai/docs/ser
221
221
222
222
For more information about DJL Serving, see the `DJL Serving documentation. <https://docs.djl.ai/docs/serving/index.html >`_
223
223
224
+ **************************
225
+ Ahead of time partitioning
226
+ **************************
227
+
228
+ To optimize the deployment of large models that do not fit in a single GPU, the model’s tensor weights are partitioned at
229
+ runtime and each partition is loaded in individual GPU. But runtime partitioning takes significant amount of time and
230
+ memory on model loading. So, DJLModel offers an ahead of time partitioning capability for DeepSpeed and FasterTransformer
231
+ engines, which lets you partition your model weights and save them before deployment. HuggingFace does not support
232
+ tensor parallelism, so ahead of time partitioning cannot be done for it. In our experiment with GPT-J model, loading
233
+ this model with partitioned checkpoints increased the model loading time by 40%.
234
+
235
+ `partition ` method invokes an Amazon SageMaker Training job to partition the model and upload those partitioned
236
+ checkpoints to S3 bucket. You can either provide your desired S3 bucket to upload the partitioned checkpoints or it will be
237
+ uploaded to the default SageMaker S3 bucket. Please note that this S3 bucket will be remembered for deployment. When you
238
+ call `deploy ` method after partition, DJLServing downloads the partitioned model checkpoints directly from the uploaded
239
+ s3 url, if available.
240
+
241
+ .. code ::
242
+
243
+ # partitions the model using Amazon Sagemaker Training Job.
244
+ djl_model.partition("ml.g5.12xlarge")
245
+
246
+ predictor = deepspeed_model.deploy("ml.g5.12xlarge",
247
+ initial_instance_count=1)
248
+
224
249
***********************
225
250
SageMaker DJL Classes
226
251
***********************
0 commit comments