|
10 | 10 | ]
|
11 | 11 | },
|
12 | 12 | {
|
13 |
| - "cell_type": "code", |
14 |
| - "execution_count": null, |
15 |
| - "metadata": {}, |
16 |
| - "outputs": [], |
17 |
| - "source": [ |
18 |
| - "! pip install \"pydantic>=2.0.0\" sagemaker-core" |
19 |
| - ] |
20 |
| - }, |
21 |
| - { |
22 |
| - "cell_type": "code", |
23 |
| - "execution_count": null, |
| 13 | + "cell_type": "markdown", |
24 | 14 | "metadata": {},
|
25 |
| - "outputs": [], |
26 | 15 | "source": [
|
27 |
| - "! pip install sagemaker-2.232.4.dev0.tar.gz" |
| 16 | + "# ModelTrainer\n", |
| 17 | + "The ModelTrainer is a new interface for training designed to tackle many of the challenges that exist in todays Estimator class. Some key features include:\n", |
| 18 | + "\n", |
| 19 | + "1. Improved Design Principles making the interface much more user friendly and approachable than previous Estimator.\n", |
| 20 | + "2. Remove Dependency on Training-Toolkit for Script Mode and BYOC, and bring the training toolkit’s runtime driver code within PySDK for a smoother 1-stop and an extendible distributed runner.\n", |
| 21 | + "3. Training Recipe support for easier setup of training for LLMs with PySDK" |
28 | 22 | ]
|
29 | 23 | },
|
30 | 24 | {
|
31 | 25 | "cell_type": "markdown",
|
32 | 26 | "metadata": {},
|
33 | 27 | "source": [
|
34 |
| - "## Script Mode Case - 1" |
| 28 | + "## ModelTrainer - Script Mode Case - 1\n", |
| 29 | + "\n", |
| 30 | + "This case show cases the minimal setup for a ModelTrainer. A user need only to provide a desired training image and the commands they wish to execute in the container using the `SourceCode` class object config.\n" |
35 | 31 | ]
|
36 | 32 | },
|
37 | 33 | {
|
|
68 | 64 | "cell_type": "markdown",
|
69 | 65 | "metadata": {},
|
70 | 66 | "source": [
|
71 |
| - "## Script Mode Case - 2" |
| 67 | + "## ModelTrainer - Script Mode Case - 2\n", |
| 68 | + "\n", |
| 69 | + "This case show cases an abstracted setup for script mode where a user can provide their training image and a `SourceCode` object config with path to their `source_dir`, `enty_script`, and any additional `requirements` to install in the training container." |
72 | 70 | ]
|
73 | 71 | },
|
74 | 72 | {
|
|
77 | 75 | "metadata": {},
|
78 | 76 | "outputs": [],
|
79 | 77 | "source": [
|
| 78 | + "from sagemaker.modules.train import ModelTrainer\n", |
80 | 79 | "from sagemaker.modules.configs import SourceCode\n",
|
81 | 80 | "\n",
|
| 81 | + "pytorch_image = \"763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310\"\n", |
| 82 | + "\n", |
82 | 83 | "source_code = SourceCode(\n",
|
83 | 84 | " source_dir=\"basic-script-mode\",\n",
|
84 | 85 | " requirements=\"requirements.txt\",\n",
|
|
244 | 245 | "cell_type": "markdown",
|
245 | 246 | "metadata": {},
|
246 | 247 | "source": [
|
247 |
| - "## Distributed Training - Case 1" |
| 248 | + "## ModelTrainer - Distributed Training - Case 1\n", |
| 249 | + "\n", |
| 250 | + "This use cases shows how a user could perform a more complex setup for DistributedTraining using `torchrun` and setting up commands for the execution directly using the `command` parameter in the `SourceCode` class" |
248 | 251 | ]
|
249 | 252 | },
|
250 | 253 | {
|
|
262 | 265 | "env[\"NCCL_SOCKET_IFNAME\"] = \"eth0\"\n",
|
263 | 266 | "env[\"NCCL_IB_DISABLE\"] = \"1\"\n",
|
264 | 267 | "env[\"NCCL_DEBUG\"] = \"WARN\"\n",
|
265 |
| - "env[\"FI_EFA_USE_DEVICE_RDMA\"] = \"1\"\n", |
266 |
| - "env[\"RDMAV_FORK_SAFE\"] = \"1\"\n", |
267 | 268 | "\n",
|
268 | 269 | "compute = Compute(\n",
|
269 | 270 | " instance_count=1,\n",
|
|
332 | 333 | "cell_type": "markdown",
|
333 | 334 | "metadata": {},
|
334 | 335 | "source": [
|
335 |
| - "## Distributed Training - Case 2" |
| 336 | + "## ModelTrainer - Distributed Training - Case 2\n", |
| 337 | + "\n", |
| 338 | + "This distributed training case showcases how a user could perform distributed training using an abstracted approach provided by the PySDK via the `Torchrun` distributed runner object." |
336 | 339 | ]
|
337 | 340 | },
|
338 | 341 | {
|
|
421 | 424 | "cell_type": "markdown",
|
422 | 425 | "metadata": {},
|
423 | 426 | "source": [
|
424 |
| - "## Distributed Training - Case 3" |
| 427 | + "## ModelTrainer - Distributed Training - Case 3\n", |
| 428 | + "\n", |
| 429 | + "In this case, we show how distributed training job can be setup and ran using a more generic `MPI` distributed runner with some additional mpi options set through the `mpi_additional_options` parameter." |
425 | 430 | ]
|
426 | 431 | },
|
427 | 432 | {
|
|
472 | 477 | " entry_script=\"run_clm_no_trainer.py\",\n",
|
473 | 478 | ")\n",
|
474 | 479 | "\n",
|
475 |
| - "\n", |
476 | 480 | "# Run using MPI\n",
|
477 | 481 | "mpi = MPI(\n",
|
478 | 482 | " mpi_additional_options=[\n",
|
|
510 | 514 | "cell_type": "markdown",
|
511 | 515 | "metadata": {},
|
512 | 516 | "source": [
|
513 |
| - "# ModelTrainer Recipes" |
| 517 | + "# ModelTrainer Recipes\n", |
| 518 | + "\n", |
| 519 | + "Training recipes is an abstracted way to create a ModelTrainer for training of LLM models via a recipe.yaml with configuration for the Trainer, model weights, etc." |
514 | 520 | ]
|
515 | 521 | },
|
516 | 522 | {
|
517 | 523 | "cell_type": "markdown",
|
518 | 524 | "metadata": {},
|
519 | 525 | "source": [
|
520 |
| - "## Training Recipes - Case 1" |
| 526 | + "## ModelTrainer - Training Recipes - Case 1\n", |
| 527 | + "\n", |
| 528 | + "This example showcases how a user could leverage SageMaker pre-defined training recipe `training/llama/hf_llama3_8b_seq8192_gpu` for training a llama3 model using synthetic data." |
521 | 529 | ]
|
522 | 530 | },
|
523 | 531 | {
|
|
580 | 588 | "cell_type": "markdown",
|
581 | 589 | "metadata": {},
|
582 | 590 | "source": [
|
583 |
| - "## Training Recipes - Case 2" |
| 591 | + "## ModelTrainer - Training Recipes - Case 2\n", |
| 592 | + "\n", |
| 593 | + "This example showcases how a user can leverage the sagemaker recipe adaptors to train a model with configurations in a custom local `custom-recipe.yaml`." |
584 | 594 | ]
|
585 | 595 | },
|
586 | 596 | {
|
|
615 | 625 | "cell_type": "markdown",
|
616 | 626 | "metadata": {},
|
617 | 627 | "source": [
|
618 |
| - "## Training Recipes - Case 3" |
| 628 | + "## ModelTrainer - Training Recipes - Case 3\n", |
| 629 | + "\n", |
| 630 | + "This usecase shows how a `neuronx-distributed-training` recipe from github url can be training on a trainum instance using custom data in an S3 bucket." |
619 | 631 | ]
|
620 | 632 | },
|
621 | 633 | {
|
|
0 commit comments