Skip to content

Commit 0ebc6d8

Browse files
beniericpintaoz-aws
authored andcommitted
Update ModelTrainer Notebooks (#1597)
1 parent 7b564ad commit 0ebc6d8

File tree

4 files changed

+41
-26
lines changed

4 files changed

+41
-26
lines changed

src/sagemaker/modules/testing_notebooks/base_model_trainer.ipynb

+37-25
Original file line numberDiff line numberDiff line change
@@ -10,28 +10,24 @@
1010
]
1111
},
1212
{
13-
"cell_type": "code",
14-
"execution_count": null,
15-
"metadata": {},
16-
"outputs": [],
17-
"source": [
18-
"! pip install \"pydantic>=2.0.0\" sagemaker-core"
19-
]
20-
},
21-
{
22-
"cell_type": "code",
23-
"execution_count": null,
13+
"cell_type": "markdown",
2414
"metadata": {},
25-
"outputs": [],
2615
"source": [
27-
"! pip install sagemaker-2.232.4.dev0.tar.gz"
16+
"# ModelTrainer\n",
17+
"The ModelTrainer is a new interface for training designed to tackle many of the challenges that exist in todays Estimator class. Some key features include:\n",
18+
"\n",
19+
"1. Improved Design Principles making the interface much more user friendly and approachable than previous Estimator.\n",
20+
"2. Remove Dependency on Training-Toolkit for Script Mode and BYOC, and bring the training toolkit’s runtime driver code within PySDK for a smoother 1-stop and an extendible distributed runner.\n",
21+
"3. Training Recipe support for easier setup of training for LLMs with PySDK"
2822
]
2923
},
3024
{
3125
"cell_type": "markdown",
3226
"metadata": {},
3327
"source": [
34-
"## Script Mode Case - 1"
28+
"## ModelTrainer - Script Mode Case - 1\n",
29+
"\n",
30+
"This case show cases the minimal setup for a ModelTrainer. A user need only to provide a desired training image and the commands they wish to execute in the container using the `SourceCode` class object config.\n"
3531
]
3632
},
3733
{
@@ -68,7 +64,9 @@
6864
"cell_type": "markdown",
6965
"metadata": {},
7066
"source": [
71-
"## Script Mode Case - 2"
67+
"## ModelTrainer - Script Mode Case - 2\n",
68+
"\n",
69+
"This case show cases an abstracted setup for script mode where a user can provide their training image and a `SourceCode` object config with path to their `source_dir`, `enty_script`, and any additional `requirements` to install in the training container."
7270
]
7371
},
7472
{
@@ -77,8 +75,11 @@
7775
"metadata": {},
7876
"outputs": [],
7977
"source": [
78+
"from sagemaker.modules.train import ModelTrainer\n",
8079
"from sagemaker.modules.configs import SourceCode\n",
8180
"\n",
81+
"pytorch_image = \"763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310\"\n",
82+
"\n",
8283
"source_code = SourceCode(\n",
8384
" source_dir=\"basic-script-mode\",\n",
8485
" requirements=\"requirements.txt\",\n",
@@ -244,7 +245,9 @@
244245
"cell_type": "markdown",
245246
"metadata": {},
246247
"source": [
247-
"## Distributed Training - Case 1"
248+
"## ModelTrainer - Distributed Training - Case 1\n",
249+
"\n",
250+
"This use cases shows how a user could perform a more complex setup for DistributedTraining using `torchrun` and setting up commands for the execution directly using the `command` parameter in the `SourceCode` class"
248251
]
249252
},
250253
{
@@ -262,8 +265,6 @@
262265
"env[\"NCCL_SOCKET_IFNAME\"] = \"eth0\"\n",
263266
"env[\"NCCL_IB_DISABLE\"] = \"1\"\n",
264267
"env[\"NCCL_DEBUG\"] = \"WARN\"\n",
265-
"env[\"FI_EFA_USE_DEVICE_RDMA\"] = \"1\"\n",
266-
"env[\"RDMAV_FORK_SAFE\"] = \"1\"\n",
267268
"\n",
268269
"compute = Compute(\n",
269270
" instance_count=1,\n",
@@ -332,7 +333,9 @@
332333
"cell_type": "markdown",
333334
"metadata": {},
334335
"source": [
335-
"## Distributed Training - Case 2"
336+
"## ModelTrainer - Distributed Training - Case 2\n",
337+
"\n",
338+
"This distributed training case showcases how a user could perform distributed training using an abstracted approach provided by the PySDK via the `Torchrun` distributed runner object."
336339
]
337340
},
338341
{
@@ -421,7 +424,9 @@
421424
"cell_type": "markdown",
422425
"metadata": {},
423426
"source": [
424-
"## Distributed Training - Case 3"
427+
"## ModelTrainer - Distributed Training - Case 3\n",
428+
"\n",
429+
"In this case, we show how distributed training job can be setup and ran using a more generic `MPI` distributed runner with some additional mpi options set through the `mpi_additional_options` parameter."
425430
]
426431
},
427432
{
@@ -472,7 +477,6 @@
472477
" entry_script=\"run_clm_no_trainer.py\",\n",
473478
")\n",
474479
"\n",
475-
"\n",
476480
"# Run using MPI\n",
477481
"mpi = MPI(\n",
478482
" mpi_additional_options=[\n",
@@ -510,14 +514,18 @@
510514
"cell_type": "markdown",
511515
"metadata": {},
512516
"source": [
513-
"# ModelTrainer Recipes"
517+
"# ModelTrainer Recipes\n",
518+
"\n",
519+
"Training recipes is an abstracted way to create a ModelTrainer for training of LLM models via a recipe.yaml with configuration for the Trainer, model weights, etc."
514520
]
515521
},
516522
{
517523
"cell_type": "markdown",
518524
"metadata": {},
519525
"source": [
520-
"## Training Recipes - Case 1"
526+
"## ModelTrainer - Training Recipes - Case 1\n",
527+
"\n",
528+
"This example showcases how a user could leverage SageMaker pre-defined training recipe `training/llama/hf_llama3_8b_seq8192_gpu` for training a llama3 model using synthetic data."
521529
]
522530
},
523531
{
@@ -580,7 +588,9 @@
580588
"cell_type": "markdown",
581589
"metadata": {},
582590
"source": [
583-
"## Training Recipes - Case 2"
591+
"## ModelTrainer - Training Recipes - Case 2\n",
592+
"\n",
593+
"This example showcases how a user can leverage the sagemaker recipe adaptors to train a model with configurations in a custom local `custom-recipe.yaml`."
584594
]
585595
},
586596
{
@@ -615,7 +625,9 @@
615625
"cell_type": "markdown",
616626
"metadata": {},
617627
"source": [
618-
"## Training Recipes - Case 3"
628+
"## ModelTrainer - Training Recipes - Case 3\n",
629+
"\n",
630+
"This usecase shows how a `neuronx-distributed-training` recipe from github url can be training on a trainum instance using custom data in an S3 bucket."
619631
]
620632
},
621633
{

src/sagemaker/modules/testing_notebooks/install_docker.sh renamed to src/sagemaker/modules/testing_notebooks/bootstrap.sh

+4-1
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,7 @@ VERSION_STRING=5:20.10.24~3-0~ubuntu-jammy
1717
sudo apt-get install docker-ce-cli=$VERSION_STRING docker-compose-plugin -y
1818

1919
# validate the Docker Client is able to access Docker Server at [unix:///docker/proxy.sock]
20-
docker version
20+
docker version
21+
22+
pip install "pydantic>=2.0.0"
23+
pip install sagemaker-2.232.4.dev0.tar.gz
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)