Reinvent 2024 early #4946

pintaoz-aws · 2024-12-04T10:17:03Z

Issue #, if available:

Description of changes:

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

I have read the CONTRIBUTING doc
I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
I used the commit message format described in CONTRIBUTING
I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

I have added tests that prove my fix is effective or that my feature works (if appropriate)
I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
I have checked that my tests are not configured for a specific region or account (if appropriate)
I have used unique_name_from_base to create resource names in integ tests (if appropriate)
If adding any dependency in requirements.txt files, I have spell checked and ensured they exist in PyPi

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method

* feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint

* Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests

* Add unit tests for ModelTrainer * Flake8 * format

* Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format

* Add enviornment variables scripts * format * fix comment * add docstrings * fix comment

* local snapshot * Update pip list command * Remove function calls * Address comments * Address comments

* Support intelligent parameters * fix codestyle

* General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region

* Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks

…1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo

… and deployment configs (#1572)

* add in-process mode for DJL server * fix format * add inference_spec as a member of DJL * add the validations for model server * fix typo * fix test assertion * add unit-testing * have a common server for inprocess mode * fix failing tests * add support to torchserve * fix tests to include torchserve servers * use custom inference_spec code instead of HF pipelines * fix tests for app.py * fix unit test failure * fix format * use schema_builder for serialization and deserialization * remove task field * remove unused import

* Base model trainer (#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Change to make Model Trainer return a Model Object * Fix * Cleanup * Support intelligent parameters (#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (#1541) * Cleanup ModelTrainer (#1542) * General image builder (#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Latest Container Image (#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Cleanup ModelTrainer code (#1552) * Updates * feat: add pre-processing and post-processing logic to inference_spec (#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (#1536) * Add path to set Additional Settings in ModelTrainer (#1555) * Updates * Mask Sensitive Env Logs in Container (#1568) * Cleanup PR * Codestyle fixes * Update logic to use model parameter instead of model_path * Fixes * Fixes * Tests * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes --------- Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: Pravali Uppugunduri <[email protected]>

Co-authored-by: Gokul Anantha Narayanan <[email protected]>

* Base model trainer (#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Support intelligent parameters (#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (#1541) * Cleanup ModelTrainer (#1542) * General image builder (#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Latest Container Image (#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Cleanup ModelTrainer code (#1552) * feat: add pre-processing and post-processing logic to inference_spec (#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (#1536) * Add path to set Additional Settings in ModelTrainer (#1555) * Support building image from Dockerfile * Fix test * Fix test * Rename functions --------- Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: Gokul Anantha Narayanan <[email protected]> Co-authored-by: Pravali Uppugunduri <[email protected]>

* Base model trainer (#1521) * Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method * feature: support script mode with local train.sh (#1523) * feature: support script mode with local train.sh * Stop tracking train.sh and add it to .gitignore * update message * make dir if not exist * fix docs * fix: docstyle * Address comments * fix hyperparams * Revert pydantic custom error * pylint * Image Spec refactoring and updates (#1525) * Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests * Add unit tests for ModelTrainer (#1527) * Add unit tests for ModelTrainer * Flake8 * format * Add example notebook (#1528) * Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format * Add enviornment variable bootstrapping script (#1530) * Add enviornment variables scripts * format * fix comment * add docstrings * fix comment * feature: add utility function to capture local snapshot (#1524) * local snapshot * Update pip list command * Remove function calls * Address comments * Address comments * Support intelligent parameters (#1540) * Support intelligent parameters * fix codestyle * Revert Image Spec (#1541) * Cleanup ModelTrainer (#1542) * Initial Prototype * General image builder (#1546) * General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region * Unified deploying in ModelBuilder * Latest Container Image (#1545) * Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks * Address PR comments * Address Codestyle errors * Cleanup ModelTrainer code (#1552) * Black format * Codestyle changes * Codestyle changes * from __future__ import absolute_import * DocString formatting * Black formatting * Address PR comments * Noteboook changes and fixes * feat: add pre-processing and post-processing logic to inference_spec (#1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo * Add Distributed Training Support Model Trainer (#1536) * Add path to set Additional Settings in ModelTrainer (#1555) * Checkstyle Fixes * Address PR comments * Fixes * Merge Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Codestyle Fixes * Update Docstring --------- Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: pintaoz-aws <[email protected]> Co-authored-by: Pravali Uppugunduri <[email protected]>

* Parameterized intelligent defaults tests * Parameterized intelligent defaults tests * Parameterized intelligent defaults tests * Tests for all Model Builder deployment modes * Fix * CodeStyle Fixes * CodeStyle Fixes * Add Deepdiff dependency * Add Deepdiff dependency * Add Codestyle fix

Co-authored-by: Edward Sun <[email protected]>

* change: fix the file uploading signature verification error **Description** The URL contains charater(+) which is not escaped properly. Fixed by removing the conditional logic to escape for the character. **Testing** 1. Changed UT passed 2. Test in sample notebook * **Description** Changed from x-mlapp-sm-app-server-arn to x-sagemaker-partner-app-server-arn Also make some small format adjusting for the signing context information. **Testing Done** UT passed --------- Co-authored-by: Edward Sun <[email protected]>

* v0 estimator for launching kandinksy training * code cleanup * option to over-ride git repos for kandinsky for testing purposes * update dependencies * update comment * formatting fixes * style fixes * code cleanup * Add warning messages for ingored arguments * cleanup, address comments * fix * clone launcher repo only if necessary * add a cleanup method to call after fit * fix docstring * fix warning * cleanup update * fix * code style fix * rename cleanup method for clarity * missed change * move cleanup to when object is destroyed * add unit tests * formatting fix * removing tests which don't work as recipe repos are private * removing tests which don't work as recipe repos are private * resolve comments * resolve comments

* fix to work with launcher recipes * fix suffix for temp file * fix path and error message * fix for recipes from launcher * resolve recipes correctly * fix imports * reformat message to avoid code-doc test issue * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * doc formatting * check if resolver exists before registering

* basic checks and unit test for recipes * More testing for recipes. Move recipe overrides to top before accessing any recipe fields. * check that we use customer provided image uri if it is set * reformat * test fixes * update git urls for recipes * revert to ssh git urls for recipes

…1547)

…ath. (#1566)

…rce dir (#1593)

Co-authored-by: Tian <[email protected]>

Resolve recipes correctly before launching (#1529) fixes. (#1532) fix recipe path. (#1566)

* Feature: Support GPU training recipes with Sagemaker Python SDK (#1516) * v0 estimator for launching kandinksy training * code cleanup * option to over-ride git repos for kandinsky for testing purposes * update dependencies * update comment * formatting fixes * style fixes * code cleanup * Add warning messages for ingored arguments * cleanup, address comments * fix * clone launcher repo only if necessary * add a cleanup method to call after fit * fix docstring * fix warning * cleanup update * fix * code style fix * rename cleanup method for clarity * missed change * move cleanup to when object is destroyed * add unit tests * formatting fix * removing tests which don't work as recipe repos are private * removing tests which don't work as recipe repos are private * resolve comments * resolve comments * Feature: Support Neuron training recipes. (#1526) * Feature: Resolve recipes correctly before launching (#1529) * fix to work with launcher recipes * fix suffix for temp file * fix path and error message * fix for recipes from launcher * resolve recipes correctly * fix imports * reformat message to avoid code-doc test issue * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * code style fix * doc formatting * check if resolver exists before registering * Feature: Add unit tests for recipes and minor bug fixes. (#1532) * basic checks and unit test for recipes * More testing for recipes. Move recipe overrides to top before accessing any recipe fields. * check that we use customer provided image uri if it is set * reformat * test fixes * update git urls for recipes * revert to ssh git urls for recipes * Feature: Move image uris and git repos for training recipes to json (#1547) * Update MANIFEST.in so that wheel builds correctly (#1563) * Remove default values for fields in recipe_overrides and fix recipe path. (#1566) * add optional source dir for recipes, copy training code and requirements to source dir * diff names for recipe file and local script option * format and add unit test * make entry point script and recipe file temp files that can be gced * formatting and fix * test fix * test fixes * format fix * break function up because it is too long * fixes * fix * fix * remove references to launcher and adapter dir as we copy out everything needed into source dir * reformat * copy all directory contents for trainium as there is more than one source file * fix * fix * remove debugging message * Change default source directory to current, add option to specify source dir (#1593) * update to public uris for hyperpod recipe repos and smp image * fixes * remove debug copies * change caps for env vars * skip some tests for now * format * neuron json for retrieving images * update training_recipes.json * add unit test * reformat * fix long line * add source dir check when using training recipe * adding more regions * reformat * doc update * doc update * doc update * doc update * fix capitalization issues * fix capitalization issues * doc check issue

…d recipe code unavailable" (#1642)

src/sagemaker/modules/train/container_drivers/mpi_utils.py

codecov · 2024-12-04T11:18:09Z

Codecov Report

Attention: Patch coverage is 68.52130% with 628 lines in your changes missing coverage. Please review.

Project coverage is 86.54%. Comparing base (6333914) to head (ad3538b).
Report is 187 commits behind head on master.

Files with missing lines	Patch %	Lines
...rc/sagemaker/modules/local_core/local_container.py	50.20%	124 Missing ⚠️
...maker/modules/train/container_drivers/mpi_utils.py	0.00%	119 Missing ⚠️
src/sagemaker/pytorch/estimator.py	67.33%	49 Missing ⚠️
src/sagemaker/modules/train/model_trainer.py	86.11%	44 Missing ⚠️
...sagemaker/modules/train/container_drivers/utils.py	58.16%	41 Missing ⚠️
.../serve/model_server/in_process_model_server/app.py	56.04%	40 Missing ⚠️
...les/train/container_drivers/scripts/environment.py	71.73%	39 Missing ⚠️
src/sagemaker/modules/train/sm_recipes/utils.py	74.82%	35 Missing ⚠️
...les/train/container_drivers/basic_script_driver.py	0.00%	29 Missing ⚠️
src/sagemaker/serve/builder/model_builder.py	85.35%	23 Missing ⚠️
... and 8 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4946      +/-   ##
==========================================
- Coverage   87.35%   86.54%   -0.81%     
==========================================
  Files         418      438      +20     
  Lines       40549    42374    +1825     
==========================================
+ Hits        35421    36673    +1252     
- Misses       5128     5701     +573

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

benieric and others added 30 commits December 4, 2024 01:26

Base model trainer (#1521)

e267244

* Base model trainer * flake8 * add testing notebook * add param validation & set defaults * Implement simple train method

Image Spec refactoring and updates (#1525)

fba3285

* Image Spec refactoring and updates * Unit tests and update function for Image Spec * Fix hugging face test * Fix Tests

Add unit tests for ModelTrainer (#1527)

6a0224f

* Add unit tests for ModelTrainer * Flake8 * format

Add example notebook (#1528)

7446b09

* Add testing notebook * format * use smaller data * remove large dataset * update * pylint * flake8 * ignore docstyle in directories with test * format * format

Add enviornment variable bootstrapping script (#1530)

cb7af78

* Add enviornment variables scripts * format * fix comment * add docstrings * fix comment

feature: add utility function to capture local snapshot (#1524)

80a1b89

* local snapshot * Update pip list command * Remove function calls * Address comments * Address comments

Support intelligent parameters (#1540)

93a3c6d

* Support intelligent parameters * fix codestyle

Revert Image Spec (#1541)

4fe8738

Cleanup ModelTrainer (#1542)

72e4266

General image builder (#1546)

89edb6d

* General image builder * General image builder * Fix codestyle * Fix codestyle * Move location * Add warnings * Add integ tests * Fix integ test * Fix integ test * Fix region error * Add region

Latest Container Image (#1545)

b40a499

* Latest Container Image * Test Fixes * Parameterized tests and some logic updates * Test fixes * Move to Image URI * Fixes for unit test * Fixes for unit test * Fix codestyle error checks

Cleanup ModelTrainer code (#1552)

c3f432c

feat: add pre-processing and post-processing logic to inference_spec (#…

2e17bcb

…1560) * add pre-processing and post-processing logic to inference_spec * fix format * make accept_type and content_type optional * remove accept_type and content_type from pre/post processing * correct typo

Add Distributed Training Support Model Trainer (#1536)

21a11a9

Add path to set Additional Settings in ModelTrainer (#1555)

a406f64

Mask Sensitive Env Logs in Container (#1568)

8cc19a3

Fix bug in script mode setup ModelTrainer (#1575)

a8ed4ec

Feature: ModelBuilder supports HuggingFace Models with benchmark data…

ce55d45

… and deployment configs (#1572)

Simplify Config Class Names and DistributedRunner structures (#1573)

1ad75c9

Remove ignored files

2aad9cd

Pass hyperparameters as CLI args (#1577)

24b0dc0

Add Support for Training Recipes (#1565)

debcdc2

Co-authored-by: Gokul Anantha Narayanan <[email protected]>

Use exact python path in trainer template (#1584)

51fb427

Add recipes examples (#1582)

6b90f89

update notebooks (#1588)

67f535d

nargokul and others added 18 commits December 4, 2024 01:49

remove example notebooks artifacts (#1634)

98d1d23

feat: Partner App Auth Provider for SDK support (#1548)

96db5c7

Co-authored-by: Edward Sun <[email protected]>

Feature: Support Neuron training recipes. (#1526)

13e10c9

Feature: Move image uris and git repos for training recipes to json (#…

2cc2caf

…1547)

Update MANIFEST.in so that wheel builds correctly (#1563)

9480ee0

Remove default values for fields in recipe_overrides and fix recipe p…

30dfdca

…ath. (#1566)

Change default source directory to current, add option to specify sou…

c0e3958

…rce dir (#1593)

Changes for SMP v2.7.0 (#1609)

ce2376f

Co-authored-by: Tian <[email protected]>

Update URIs to public for training recipes (#1621)

74d6b7c

Resolve recipes correctly before launching (#1529) fixes. (#1532) fix recipe path. (#1566)

Neuron URIs update (#1626)

fdf2e9a

Resolve recipes correctly before launching (#1529) fixes. (#1532) fix recipe path. (#1566)

Add model trainer documentation (#1639)

bd4a6cc

Enable the Recipe tests marked with @pytest.mark.skip(reason="Hyperpo…

9a5b32f

…d recipe code unavailable" (#1642)

pintaoz-aws requested a review from a team as a code owner December 4, 2024 10:17

pintaoz-aws requested a review from knikure December 4, 2024 10:17

pintaoz-aws temporarily deployed to auto-approve December 4, 2024 10:17 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Dec 4, 2024

View reviewed changes

src/sagemaker/modules/train/container_drivers/mpi_utils.py Dismissed Show resolved Hide resolved

Add graphne to the doc requirements

659244b

pintaoz-aws temporarily deployed to auto-approve December 4, 2024 10:34 — with GitHub Actions Inactive

Add graphene to doc requirements

ad3538b

pintaoz-aws temporarily deployed to auto-approve December 4, 2024 10:45 — with GitHub Actions Inactive

pintaoz-aws merged commit 9371aee into master Dec 4, 2024
7 of 14 checks passed

pintaoz-aws deleted the reinvent-2024-early branch December 4, 2024 12:38

sage-maker mentioned this pull request Dec 16, 2024

Fix ssh host policy #4966

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reinvent 2024 early #4946

Reinvent 2024 early #4946

pintaoz-aws commented Dec 4, 2024

codecov bot commented Dec 4, 2024 •

edited

Loading

Reinvent 2024 early #4946

Reinvent 2024 early #4946

Conversation

pintaoz-aws commented Dec 4, 2024

Merge Checklist

General

Tests

codecov bot commented Dec 4, 2024 • edited Loading

Codecov Report

codecov bot commented Dec 4, 2024 •

edited

Loading