You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Feature: Support GPU training recipes with Sagemaker Python SDK (#1516)
* v0 estimator for launching kandinksy training
* code cleanup
* option to over-ride git repos for kandinsky for testing purposes
* update dependencies
* update comment
* formatting fixes
* style fixes
* code cleanup
* Add warning messages for ingored arguments
* cleanup, address comments
* fix
* clone launcher repo only if necessary
* add a cleanup method to call after fit
* fix docstring
* fix warning
* cleanup update
* fix
* code style fix
* rename cleanup method for clarity
* missed change
* move cleanup to when object is destroyed
* add unit tests
* formatting fix
* removing tests which don't work as recipe repos are private
* removing tests which don't work as recipe repos are private
* resolve comments
* resolve comments
* Feature: Support Neuron training recipes. (#1526)
* Feature: Resolve recipes correctly before launching (#1529)
* fix to work with launcher recipes
* fix suffix for temp file
* fix path and error message
* fix for recipes from launcher
* resolve recipes correctly
* fix imports
* reformat message to avoid code-doc test issue
* code style fix
* code style fix
* code style fix
* code style fix
* code style fix
* code style fix
* code style fix
* code style fix
* code style fix
* doc formatting
* check if resolver exists before registering
* Feature: Add unit tests for recipes and minor bug fixes. (#1532)
* basic checks and unit test for recipes
* More testing for recipes. Move recipe overrides to top before accessing any recipe fields.
* check that we use customer provided image uri if it is set
* reformat
* test fixes
* update git urls for recipes
* revert to ssh git urls for recipes
* Feature: Move image uris and git repos for training recipes to json (#1547)
* Update MANIFEST.in so that wheel builds correctly (#1563)
* Remove default values for fields in recipe_overrides and fix recipe path. (#1566)
* add optional source dir for recipes, copy training code and requirements to source dir
* diff names for recipe file and local script option
* format and add unit test
* make entry point script and recipe file temp files that can be gced
* formatting and fix
* test fix
* test fixes
* format fix
* break function up because it is too long
* fixes
* fix
* fix
* remove references to launcher and adapter dir as we copy out everything needed into source dir
* reformat
* copy all directory contents for trainium as there is more than one source file
* fix
* fix
* remove debugging message
* Change default source directory to current, add option to specify source dir (#1593)
* update to public uris for hyperpod recipe repos and smp image
* fixes
* remove debug copies
* change caps for env vars
* skip some tests for now
* format
* neuron json for retrieving images
* update training_recipes.json
* add unit test
* reformat
* fix long line
* add source dir check when using training recipe
* adding more regions
* reformat
* doc update
* doc update
* doc update
* doc update
* fix capitalization issues
* fix capitalization issues
* doc check issue
Alternatively, if you are using Amazon SageMaker HyperPod recipes, you can follow the following instructions:
199
209
210
+
Prerequisites: you need ``git`` installed on your client to access Amazon SageMaker HyperPod recipes code.
200
211
201
-
Call the fit Method
202
-
===================
212
+
When using a recipe, you must set the ``training_recipe`` arg in place of providing a training script.
213
+
This can be a recipe from `here <https://github.com/aws/sagemaker-hyperpod-recipes>`_
214
+
or a local file or a custom url. Please note that you must override the following using
215
+
``recipe_overrides``:
216
+
217
+
* directory paths for the local container in the recipe as appropriate for Python SDK
218
+
* the output s3 URIs
219
+
* Huggingface access token
220
+
* any other recipe fields you wish to edit
221
+
222
+
The code snippet below shows an example.
223
+
Please refer to `SageMaker docs <https://docs.aws.amazon.com/sagemaker/latest/dg/model-train-storage.html>`_
224
+
for more details about the expected local paths in the container and the Amazon SageMaker
225
+
HyperPod recipes tutorial for more examples.
226
+
You can override the fields by either setting ``recipe_overrides`` or
227
+
providing a modified ``training_recipe`` through a local file or a custom url.
228
+
When using the recipe, any provided ``entry_point`` will be ignored.
229
+
230
+
SageMaker will automatically set up the distribution args.
231
+
It will also determine the image to use for your model and device type,
232
+
but you can override this with the ``image_uri`` arg.
233
+
234
+
You can also override the number of nodes in the recipe with the ``instance_count`` arg to estimator.
235
+
``source_dir`` will default to current working directory unless specified.
236
+
A local copy of training scripts and recipe will be saved in the ``source_dir``.
237
+
You can specify any additional packages you want to install for training in an optional ``requirements.txt`` in the ``source_dir``.
238
+
239
+
Note for llama3.2 multi-modal models, you need to upgrade transformers library by providing a ``requirements.txt`` in the source file with ``transformers==4.45.2``.
240
+
Please refer to the Amazon SageMaker HyperPod recipes documentation for more details.
241
+
242
+
243
+
Here is an example usage for recipe ``hf_llama3_8b_seq8k_gpu_p5x16_pretrain``.
0 commit comments