Usage docs for training recipes (#1630)

schinmayee · pintaoz-aws · commit 3bce287fa484 · 2024-12-04T01:49:13.000-08:00
* Feature: Support GPU training recipes with Sagemaker Python SDK (#1516) * v0 estimator for launching kandinksy training * code cleanup * option to over-ride git repos for kandinsky for testing purposes * update dependencies * update comment * formatting fixes * style fixes * code cleanup * Add warning messages for ingored arguments * cleanup, address comments * fix * clone launcher repo only if necessary * add a cleanup method to call after fit * fix docstring * fix warning * cleanup update * fix * code style fix * rename cleanup method for clarity * missed change * move cleanup to when object is destroyed * add unit tests * formatting fix * removing tests which don't work as recipe repos are private * removing tests which don't work as recipe repos are private * resolve comments * resolve comments * Feature: Support Neuron training recipes. (#1526) * Feature: Resolve recipes correctly before launching (#1529) * fix to work with launcher recipes * fix suffix for temp file * fix path and error message * fix for recipes from launcher * resolve recipes correctly * fix imports * reformat message to avoid code-doc test issue * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * doc formatting * check if resolver exists before registering * Feature: Add unit tests for recipes and minor bug fixes. (#1532) * basic checks and unit test for recipes * More testing for recipes. Move recipe overrides to top before accessing any recipe fields. * check that we use customer provided image uri if it is set * reformat * test fixes * update git urls for recipes * revert to ssh git urls for recipes * Feature: Move image uris and git repos for training recipes to json (#1547) * Update MANIFEST.in so that wheel builds correctly (#1563) * Remove default values for fields in recipe_overrides and fix recipe path. (#1566) * add optional source dir for recipes, copy training code and requirements to source dir * diff names for recipe file and local script option * format and add unit test * make entry point script and recipe file temp files that can be gced * formatting and fix * test fix * test fixes * format fix * break function up because it is too long * fixes * fix * fix * remove references to launcher and adapter dir as we copy out everything needed into source dir * reformat * copy all directory contents for trainium as there is more than one source file * fix * fix * remove debugging message * Change default source directory to current, add option to specify source dir (#1593) * update to public uris for hyperpod recipe repos and smp image * fixes * remove debug copies * change caps for env vars * skip some tests for now * format * neuron json for retrieving images * update training_recipes.json * add unit test * reformat * fix long line * add source dir check when using training recipe * adding more regions * reformat * doc update * doc update * doc update * doc update * fix capitalization issues * fix capitalization issues * doc check issue
diff --git a/doc/frameworks/pytorch/using_pytorch.rst b/doc/frameworks/pytorch/using_pytorch.rst
@@ -21,12 +21,9 @@ To train a PyTorch model by using the SageMaker Python SDK:
 .. |create pytorch estimator| replace:: Create a ``sagemaker.pytorch.PyTorch`` Estimator
 .. _create pytorch estimator: #create-an-estimator
 
-.. |call fit| replace:: Call the estimator's ``fit`` method
-.. _call fit: #call-the-fit-method
-
-1. `Prepare a training script <#prepare-a-pytorch-training-script>`_
+1. `Prepare a training script <#prepare-a-pytorch-training-script>`_ OR `Choose an Amazon SageMaker HyperPod recipe`_
 2. |create pytorch estimator|_
-3. |call fit|_
+3. `Call the estimator's fit method or ModelTrainer's train method`_
 
 Prepare a PyTorch Training Script
 =================================
@@ -175,6 +172,16 @@ see `AWS Deep Learning Containers <https://github.com/aws/deep-learning-containe
 - `Images for HuggingFace <https://github.com/aws/deep-learning-containers/tree/master/huggingface>`__
 
 
+Choose an Amazon Sagemaker HyperPod recipe
+==========================================
+
+Alternatively, instead of using your own training script, you can choose an
+`Amazon SageMaker HyperPod recipe <https://github.com/aws/sagemaker-hyperpod-recipes>`_ to launch training for a supported model.
+If using a recipe, you do not need to provide your own training script. You only need to determine
+which recipe you want to run. You can modify a recipe as explained in the next section.
+
+
+
 Create an Estimator
 ===================
 
@@ -196,10 +203,121 @@ directories ('train' and 'test').
                            'test': 's3://my-data-bucket/path/to/my/test/data'})
 
 
+Amazon Sagemaker HyperPod recipes
+---------------------------------
+Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:
 
+Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.
 
-Call the fit Method
-===================
+When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
+This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
+or a local file or a custom url.  Please note that you must override the following using
+``recipe_overrides``:
+
+* directory paths for the local container in the recipe as appropriate for Python SDK
+* the output s3 URIs
+* Huggingface access token
+* any other recipe fields you wish to edit
+
+The code snippet below shows an example.
+Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
+for more details about the expected local paths in the container and the Amazon SageMaker
+HyperPod recipes tutorial for more examples.
+You can override the fields by either setting ``recipe_overrides`` or
+providing a modified ``training_recipe`` through a local file or a custom url.
+When using the recipe, any provided ``entry_point`` will be ignored.
+
+SageMaker will automatically set up the distribution args.
+It will also determine the image to use for your model and device type,
+but you can override this with the ``image_uri`` arg.
+
+You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
+``source_dir`` will default to current working directory unless specified.
+A local copy of training scripts and recipe will be saved in the ``source_dir``.
+You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.
+
+Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
+Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.
+
+
+Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.
+
+
+.. code:: python
+
+    overrides = {
+        "run": {
+            "results_dir": "/opt/ml/model",
+        },
+        "exp_manager": {
+            "exp_dir": "",
+            "explicit_log_dir": "/opt/ml/output/tensorboard",
+            "checkpoint_dir": "/opt/ml/checkpoints",
+        },
+        "model": {
+            "data": {
+                "train_dir": "/opt/ml/input/data/train",
+                "val_dir": "/opt/ml/input/data/val",
+            },
+        },
+    }
+    pytorch_estimator = PyTorch(
+      output_path=output_path,
+      base_job_name=f"llama-recipe",
+      role=role,
+      instance_type="ml.p5.48xlarge",
+      training_recipe="hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
+      recipe_overrides=recipe_overrides,
+      sagemaker_session=sagemaker_session,
+      tensorboard_output_config=tensorboard_output_config,
+    )
+    pytorch_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
+                           'test': 's3://my-data-bucket/path/to/my/test/data'})
+
+    # Or alternatively with ModelTrainer
+    recipe_overrides = {
+        "run": {
+            "results_dir": "/opt/ml/model",
+        },
+        "exp_manager": {
+            "exp_dir": "",
+            "explicit_log_dir": "/opt/ml/output/tensorboard",
+            "checkpoint_dir": "/opt/ml/checkpoints",
+        },
+        "model": {
+            "data": {
+                "train_dir": "/opt/ml/input/data/train",
+                "val_dir": "/opt/ml/input/data/val",
+            },
+        },
+    }
+
+    model_trainer = ModelTrainer.from_recipe(
+        output_path=output_path,
+        base_job_name=f"llama-recipe",
+        training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
+        recipe_overrides=recipe_overrides,
+        compute=Compute(instance_type="ml.p5.48xlarge"),
+        sagemaker_session=sagemaker_session
+    ).with_tensorboard_output_config(
+        tensorboard_output_config=tensorboard_output_config
+    )
+
+    train_input = Input(
+        channel_name="train",
+        data_source="s3://my-data-bucket/path/to/my/training/data"
+    )
+
+    test_input = Input(
+        channel_name="test",
+        data_source="s3://my-data-bucket/path/to/my/test/data"
+    )
+
+    model_trainer.train(input_data_config=[train_input, test_input)
+
+
+Call the estimator's fit method or ModelTrainer's train method
+==============================================================
 
 You start your training script by calling ``fit`` on a ``PyTorch`` Estimator. ``fit`` takes both required and optional
 arguments.
diff --git a/src/sagemaker/pytorch/estimator.py b/src/sagemaker/pytorch/estimator.py
@@ -334,12 +334,13 @@ def __init__(
                 compiler_config (:class:`~sagemaker.pytorch.TrainingCompilerConfig`):
                 Configures SageMaker Training Compiler to accelerate training.
 
-            training_recipe (str): Training recipe to use. This is a local file path,
-                                   a url to fetch, or a recipe provided by Saagemaker
-                                   training.
-
+            training_recipe (str): Training recipe to use. This is a local file path, a url,
+                                   or a recipe provided by Amazon SageMaker HyperPod recipes,
+                                   such as training/llama/hf_llama3_70b_seq8k_gpu_p5x64_pretrain.
+                                   This is required when using recipes.
             recipe_overrides (Dict): Dictionary specifying key values to override in the
-                                     training_recipe.
+                                     training_recipe. This is optional when using
+                                     Amazon SageMaker HyperPod recipes.
 
             **kwargs: Additional kwargs passed to the :class:`~sagemaker.estimator.Framework`
                 constructor.