Skip to content

Commit 3bce287

Browse files
schinmayeepintaoz-aws
authored andcommitted
Usage docs for training recipes (#1630)
* Feature: Support GPU training recipes with Sagemaker Python SDK (#1516) * v0 estimator for launching kandinksy training * code cleanup * option to over-ride git repos for kandinsky for testing purposes * update dependencies * update comment * formatting fixes * style fixes * code cleanup * Add warning messages for ingored arguments * cleanup, address comments * fix * clone launcher repo only if necessary * add a cleanup method to call after fit * fix docstring * fix warning * cleanup update * fix * code style fix * rename cleanup method for clarity * missed change * move cleanup to when object is destroyed * add unit tests * formatting fix * removing tests which don't work as recipe repos are private * removing tests which don't work as recipe repos are private * resolve comments * resolve comments * Feature: Support Neuron training recipes. (#1526) * Feature: Resolve recipes correctly before launching (#1529) * fix to work with launcher recipes * fix suffix for temp file * fix path and error message * fix for recipes from launcher * resolve recipes correctly * fix imports * reformat message to avoid code-doc test issue * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * doc formatting * check if resolver exists before registering * Feature: Add unit tests for recipes and minor bug fixes. (#1532) * basic checks and unit test for recipes * More testing for recipes. Move recipe overrides to top before accessing any recipe fields. * check that we use customer provided image uri if it is set * reformat * test fixes * update git urls for recipes * revert to ssh git urls for recipes * Feature: Move image uris and git repos for training recipes to json (#1547) * Update MANIFEST.in so that wheel builds correctly (#1563) * Remove default values for fields in recipe_overrides and fix recipe path. (#1566) * add optional source dir for recipes, copy training code and requirements to source dir * diff names for recipe file and local script option * format and add unit test * make entry point script and recipe file temp files that can be gced * formatting and fix * test fix * test fixes * format fix * break function up because it is too long * fixes * fix * fix * remove references to launcher and adapter dir as we copy out everything needed into source dir * reformat * copy all directory contents for trainium as there is more than one source file * fix * fix * remove debugging message * Change default source directory to current, add option to specify source dir (#1593) * update to public uris for hyperpod recipe repos and smp image * fixes * remove debug copies * change caps for env vars * skip some tests for now * format * neuron json for retrieving images * update training_recipes.json * add unit test * reformat * fix long line * add source dir check when using training recipe * adding more regions * reformat * doc update * doc update * doc update * doc update * fix capitalization issues * fix capitalization issues * doc check issue
1 parent fdf2e9a commit 3bce287

File tree

2 files changed

+131
-12
lines changed

2 files changed

+131
-12
lines changed

doc/frameworks/pytorch/using_pytorch.rst

+125-7
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,9 @@ To train a PyTorch model by using the SageMaker Python SDK:
2121
.. |create pytorch estimator| replace:: Create a ``sagemaker.pytorch.PyTorch`` Estimator
2222
.. _create pytorch estimator: #create-an-estimator
2323

24-
.. |call fit| replace:: Call the estimator's ``fit`` method
25-
.. _call fit: #call-the-fit-method
26-
27-
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_
24+
1. `Prepare a training script <#prepare-a-pytorch-training-script>`_ OR `Choose an Amazon SageMaker HyperPod recipe`_
2825
2. |create pytorch estimator|_
29-
3. |call fit|_
26+
3. `Call the estimator's fit method or ModelTrainer's train method`_
3027

3128
Prepare a PyTorch Training Script
3229
=================================
@@ -175,6 +172,16 @@ see `AWS Deep Learning Containers <https://github.com/aws/deep-learning-containe
175172
- `Images for HuggingFace <https://github.com/aws/deep-learning-containers/tree/master/huggingface>`__
176173

177174

175+
Choose an Amazon Sagemaker HyperPod recipe
176+
==========================================
177+
178+
Alternatively, instead of using your own training script, you can choose an
179+
`Amazon SageMaker HyperPod recipe <https://github.com/aws/sagemaker-hyperpod-recipes>`_ to launch training for a supported model.
180+
If using a recipe, you do not need to provide your own training script. You only need to determine
181+
which recipe you want to run. You can modify a recipe as explained in the next section.
182+
183+
184+
178185
Create an Estimator
179186
===================
180187

@@ -196,10 +203,121 @@ directories ('train' and 'test').
196203
'test': 's3://my-data-bucket/path/to/my/test/data'})
197204
198205
206+
Amazon Sagemaker HyperPod recipes
207+
---------------------------------
208+
Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:
199209

210+
Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.
200211

201-
Call the fit Method
202-
===================
212+
When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
213+
This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
214+
or a local file or a custom url. Please note that you must override the following using
215+
``recipe_overrides``:
216+
217+
* directory paths for the local container in the recipe as appropriate for Python SDK
218+
* the output s3 URIs
219+
* Huggingface access token
220+
* any other recipe fields you wish to edit
221+
222+
The code snippet below shows an example.
223+
Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
224+
for more details about the expected local paths in the container and the Amazon SageMaker
225+
HyperPod recipes tutorial for more examples.
226+
You can override the fields by either setting ``recipe_overrides`` or
227+
providing a modified ``training_recipe`` through a local file or a custom url.
228+
When using the recipe, any provided ``entry_point`` will be ignored.
229+
230+
SageMaker will automatically set up the distribution args.
231+
It will also determine the image to use for your model and device type,
232+
but you can override this with the ``image_uri`` arg.
233+
234+
You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
235+
``source_dir`` will default to current working directory unless specified.
236+
A local copy of training scripts and recipe will be saved in the ``source_dir``.
237+
You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.
238+
239+
Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
240+
Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.
241+
242+
243+
Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.
244+
245+
246+
.. code:: python
247+
248+
overrides = {
249+
"run": {
250+
"results_dir": "/opt/ml/model",
251+
},
252+
"exp_manager": {
253+
"exp_dir": "",
254+
"explicit_log_dir": "/opt/ml/output/tensorboard",
255+
"checkpoint_dir": "/opt/ml/checkpoints",
256+
},
257+
"model": {
258+
"data": {
259+
"train_dir": "/opt/ml/input/data/train",
260+
"val_dir": "/opt/ml/input/data/val",
261+
},
262+
},
263+
}
264+
pytorch_estimator = PyTorch(
265+
output_path=output_path,
266+
base_job_name=f"llama-recipe",
267+
role=role,
268+
instance_type="ml.p5.48xlarge",
269+
training_recipe="hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
270+
recipe_overrides=recipe_overrides,
271+
sagemaker_session=sagemaker_session,
272+
tensorboard_output_config=tensorboard_output_config,
273+
)
274+
pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
275+
'test': 's3://my-data-bucket/path/to/my/test/data'})
276+
277+
# Or alternatively with ModelTrainer
278+
recipe_overrides = {
279+
"run": {
280+
"results_dir": "/opt/ml/model",
281+
},
282+
"exp_manager": {
283+
"exp_dir": "",
284+
"explicit_log_dir": "/opt/ml/output/tensorboard",
285+
"checkpoint_dir": "/opt/ml/checkpoints",
286+
},
287+
"model": {
288+
"data": {
289+
"train_dir": "/opt/ml/input/data/train",
290+
"val_dir": "/opt/ml/input/data/val",
291+
},
292+
},
293+
}
294+
295+
model_trainer = ModelTrainer.from_recipe(
296+
output_path=output_path,
297+
base_job_name=f"llama-recipe",
298+
training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
299+
recipe_overrides=recipe_overrides,
300+
compute=Compute(instance_type="ml.p5.48xlarge"),
301+
sagemaker_session=sagemaker_session
302+
).with_tensorboard_output_config(
303+
tensorboard_output_config=tensorboard_output_config
304+
)
305+
306+
train_input = Input(
307+
channel_name="train",
308+
data_source="s3://my-data-bucket/path/to/my/training/data"
309+
)
310+
311+
test_input = Input(
312+
channel_name="test",
313+
data_source="s3://my-data-bucket/path/to/my/test/data"
314+
)
315+
316+
model_trainer.train(input_data_config=[train_input, test_input)
317+
318+
319+
Call the estimator's fit method or ModelTrainer's train method
320+
==============================================================
203321
204322
You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional
205323
arguments.

src/sagemaker/pytorch/estimator.py

+6-5
Original file line numberDiff line numberDiff line change
@@ -334,12 +334,13 @@ def __init__(
334334
compiler_config (:class:`~sagemaker.pytorch.TrainingCompilerConfig`):
335335
Configures SageMaker Training Compiler to accelerate training.
336336
337-
training_recipe (str): Training recipe to use. This is a local file path,
338-
a url to fetch, or a recipe provided by Saagemaker
339-
training.
340-
337+
training_recipe (str): Training recipe to use. This is a local file path, a url,
338+
or a recipe provided by Amazon SageMaker HyperPod recipes,
339+
such as training/llama/hf_llama3_70b_seq8k_gpu_p5x64_pretrain.
340+
This is required when using recipes.
341341
recipe_overrides (Dict): Dictionary specifying key values to override in the
342-
training_recipe.
342+
training_recipe. This is optional when using
343+
Amazon SageMaker HyperPod recipes.
343344
344345
**kwargs: Additional kwargs passed to the :class:`~sagemaker.estimator.Framework`
345346
constructor.

0 commit comments

Comments
 (0)