Skip to content

Reinvent 2024 early #4946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Dec 4, 2024
Merged

Reinvent 2024 early #4946

merged 79 commits into from
Dec 4, 2024

Conversation

pintaoz-aws
Copy link
Contributor

Issue #, if available:

Description of changes:

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)
  • If adding any dependency in requirements.txt files, I have spell checked and ensured they exist in PyPi

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

benieric and others added 30 commits December 4, 2024 01:26
* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method
* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint
* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests
* Add unit tests for ModelTrainer

* Flake8

* format
* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format
* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment
* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments
* Support intelligent parameters

* fix codestyle
* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region
* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks
…1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo
* add in-process mode for DJL server

* fix format

* add inference_spec as a member of DJL

* add the validations for model server

* fix typo

* fix test assertion

* add unit-testing

* have a common server for inprocess mode

* fix failing tests

* add support to torchserve

* fix tests to include torchserve servers

* use custom inference_spec code instead of HF pipelines

* fix tests for app.py

* fix unit test failure

* fix format

* use schema_builder for serialization and deserialization

* remove task field

* remove unused import
* Base model trainer (#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Change to make Model Trainer return a Model Object

* Fix

* Cleanup

* Support intelligent parameters (#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (#1541)

* Cleanup ModelTrainer (#1542)

* General image builder (#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Latest Container Image (#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Cleanup ModelTrainer code (#1552)

* Updates

* feat: add pre-processing and post-processing logic to inference_spec (#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (#1536)

* Add path to set Additional Settings in ModelTrainer (#1555)

* Updates

* Mask Sensitive Env Logs in Container (#1568)

* Cleanup PR

* Codestyle fixes

* Update logic to use model parameter instead of model_path

* Fixes

* Fixes

* Tests

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

---------

Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: Pravali Uppugunduri <[email protected]>
Co-authored-by: Gokul Anantha Narayanan <[email protected]>
* Base model trainer (#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Support intelligent parameters (#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (#1541)

* Cleanup ModelTrainer (#1542)

* General image builder (#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Latest Container Image (#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Cleanup ModelTrainer code (#1552)

* feat: add pre-processing and post-processing logic to inference_spec (#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (#1536)

* Add path to set Additional Settings in ModelTrainer (#1555)

* Support building image from Dockerfile

* Fix test

* Fix test

* Rename functions

---------

Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Gokul Anantha Narayanan <[email protected]>
Co-authored-by: Pravali Uppugunduri <[email protected]>
* Base model trainer (#1521)

* Base model trainer

* flake8

* add testing notebook

* add param validation & set defaults

* Implement simple train method

* feature: support script mode with local train.sh (#1523)

* feature: support script mode with local train.sh

* Stop tracking train.sh and add it to .gitignore

* update message

* make dir if not exist

* fix docs

* fix: docstyle

* Address comments

* fix hyperparams

* Revert pydantic custom error

* pylint

* Image Spec refactoring and updates (#1525)

* Image Spec refactoring and updates

* Unit tests and update function for Image Spec

* Fix hugging face test

* Fix Tests

* Add unit tests for ModelTrainer (#1527)

* Add unit tests for ModelTrainer

* Flake8

* format

* Add example notebook (#1528)

* Add testing notebook

* format

* use smaller data

* remove large dataset

* update

* pylint

* flake8

* ignore docstyle in directories with test

* format

* format

* Add enviornment variable bootstrapping script (#1530)

* Add enviornment variables scripts

* format

* fix comment

* add docstrings

* fix comment

* feature: add utility function to capture local snapshot (#1524)

* local snapshot

* Update pip list command

* Remove function calls

* Address comments

* Address comments

* Support intelligent parameters (#1540)

* Support intelligent parameters

* fix codestyle

* Revert Image Spec (#1541)

* Cleanup ModelTrainer (#1542)

* Initial Prototype

* General image builder (#1546)

* General image builder

* General image builder

* Fix codestyle

* Fix codestyle

* Move location

* Add warnings

* Add integ tests

* Fix integ test

* Fix integ test

* Fix region error

* Add region

* Unified deploying in ModelBuilder

* Latest Container Image (#1545)

* Latest Container Image

* Test Fixes

* Parameterized tests and some logic updates

* Test fixes

* Move to Image URI

* Fixes for unit test

* Fixes for unit test

* Fix codestyle error checks

* Address PR comments

* Address Codestyle errors

* Cleanup ModelTrainer code (#1552)

* Black format

* Codestyle changes

* Codestyle changes

* from __future__ import absolute_import

* DocString formatting

* Black formatting

* Address PR comments

* Noteboook changes and fixes

* feat: add pre-processing and post-processing logic to inference_spec (#1560)

* add pre-processing and post-processing logic to inference_spec

* fix format

* make  accept_type and content_type optional

* remove accept_type and content_type from pre/post processing

* correct typo

* Add Distributed Training Support Model Trainer (#1536)

* Add path to set Additional Settings in ModelTrainer (#1555)

* Checkstyle Fixes

* Address PR comments

* Fixes

* Merge Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Codestyle Fixes

* Update Docstring

---------

Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: pintaoz-aws <[email protected]>
Co-authored-by: Pravali Uppugunduri <[email protected]>
nargokul and others added 18 commits December 4, 2024 01:49
* Parameterized intelligent defaults tests

* Parameterized intelligent defaults tests

* Parameterized intelligent defaults tests

* Tests for all Model Builder deployment modes

* Fix

* CodeStyle Fixes

* CodeStyle Fixes

* Add Deepdiff dependency

* Add Deepdiff dependency

* Add Codestyle fix
* change: fix the file uploading signature verification error

**Description**
The URL contains charater(+) which is not escaped properly. Fixed by removing the conditional logic to escape for the character.

**Testing**
1. Changed UT passed
2. Test in sample notebook

* **Description**
Changed from x-mlapp-sm-app-server-arn to x-sagemaker-partner-app-server-arn
Also make some small format adjusting for the signing context information.

**Testing Done**
UT passed

---------

Co-authored-by: Edward Sun <[email protected]>
* v0 estimator for launching kandinksy training

* code cleanup

* option to over-ride git repos for kandinsky for testing purposes

* update dependencies

* update comment

* formatting fixes

* style fixes

* code cleanup

* Add warning messages for ingored arguments

* cleanup, address comments

* fix

* clone launcher repo only if necessary

* add a cleanup method to call after fit

* fix docstring

* fix warning

* cleanup update

* fix

* code style fix

* rename cleanup method for clarity

* missed change

* move cleanup to when object is destroyed

* add unit tests

* formatting fix

* removing tests which don't work as recipe repos are private

* removing tests which don't work as recipe repos are private

* resolve comments

* resolve comments
* fix to work with launcher recipes

* fix suffix for temp file

* fix path and error message

* fix for recipes from launcher

* resolve recipes correctly

* fix imports

* reformat message to avoid code-doc test issue

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* doc formatting

* check if resolver exists before registering
* basic checks and unit test for recipes

* More testing for recipes. Move recipe overrides to top before accessing any recipe fields.

* check that we use customer provided image uri if it is set

* reformat

* test fixes

* update git urls for recipes

* revert to ssh git urls for recipes
Resolve recipes correctly before launching (#1529)
fixes. (#1532)
fix recipe path. (#1566)
Resolve recipes correctly before launching (#1529)
fixes. (#1532)
fix recipe path. (#1566)
* Feature: Support GPU training recipes with Sagemaker Python SDK (#1516)

* v0 estimator for launching kandinksy training

* code cleanup

* option to over-ride git repos for kandinsky for testing purposes

* update dependencies

* update comment

* formatting fixes

* style fixes

* code cleanup

* Add warning messages for ingored arguments

* cleanup, address comments

* fix

* clone launcher repo only if necessary

* add a cleanup method to call after fit

* fix docstring

* fix warning

* cleanup update

* fix

* code style fix

* rename cleanup method for clarity

* missed change

* move cleanup to when object is destroyed

* add unit tests

* formatting fix

* removing tests which don't work as recipe repos are private

* removing tests which don't work as recipe repos are private

* resolve comments

* resolve comments

* Feature: Support Neuron training recipes. (#1526)

* Feature: Resolve recipes correctly before launching (#1529)

* fix to work with launcher recipes

* fix suffix for temp file

* fix path and error message

* fix for recipes from launcher

* resolve recipes correctly

* fix imports

* reformat message to avoid code-doc test issue

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* code style fix

* doc formatting

* check if resolver exists before registering

* Feature: Add unit tests for recipes and minor bug fixes. (#1532)

* basic checks and unit test for recipes

* More testing for recipes. Move recipe overrides to top before accessing any recipe fields.

* check that we use customer provided image uri if it is set

* reformat

* test fixes

* update git urls for recipes

* revert to ssh git urls for recipes

* Feature: Move image uris and git repos for training recipes to json (#1547)

* Update MANIFEST.in so that wheel builds correctly (#1563)

* Remove default values for fields in recipe_overrides and fix recipe path. (#1566)

* add optional source dir for recipes, copy training code and requirements to source dir

* diff names for recipe file and local script option

* format and add unit test

* make entry point script and recipe file temp files that can be gced

* formatting and fix

* test fix

* test fixes

* format fix

* break function up because it is too long

* fixes

* fix

* fix

* remove references to launcher and adapter dir as we copy out everything needed into source dir

* reformat

* copy all directory contents for trainium as there is more than one source file

* fix

* fix

* remove debugging message

* Change default source directory to current, add option to specify source dir (#1593)

* update to public uris for hyperpod recipe repos and smp image

* fixes

* remove debug copies

* change caps for env vars

* skip some tests for now

* format

* neuron json for retrieving images

* update training_recipes.json

* add unit test

* reformat

* fix long line

* add source dir check when using training recipe

* adding more regions

* reformat

* doc update

* doc update

* doc update

* doc update

* fix capitalization issues

* fix capitalization issues

* doc check issue
@pintaoz-aws pintaoz-aws requested a review from a team as a code owner December 4, 2024 10:17
@pintaoz-aws pintaoz-aws requested a review from knikure December 4, 2024 10:17
Copy link

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 68.52130% with 628 lines in your changes missing coverage. Please review.

Project coverage is 86.54%. Comparing base (6333914) to head (ad3538b).
Report is 187 commits behind head on master.

Files with missing lines Patch % Lines
...rc/sagemaker/modules/local_core/local_container.py 50.20% 124 Missing ⚠️
...maker/modules/train/container_drivers/mpi_utils.py 0.00% 119 Missing ⚠️
src/sagemaker/pytorch/estimator.py 67.33% 49 Missing ⚠️
src/sagemaker/modules/train/model_trainer.py 86.11% 44 Missing ⚠️
...sagemaker/modules/train/container_drivers/utils.py 58.16% 41 Missing ⚠️
.../serve/model_server/in_process_model_server/app.py 56.04% 40 Missing ⚠️
...les/train/container_drivers/scripts/environment.py 71.73% 39 Missing ⚠️
src/sagemaker/modules/train/sm_recipes/utils.py 74.82% 35 Missing ⚠️
...les/train/container_drivers/basic_script_driver.py 0.00% 29 Missing ⚠️
src/sagemaker/serve/builder/model_builder.py 85.35% 23 Missing ⚠️
... and 8 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4946      +/-   ##
==========================================
- Coverage   87.35%   86.54%   -0.81%     
==========================================
  Files         418      438      +20     
  Lines       40549    42374    +1825     
==========================================
+ Hits        35421    36673    +1252     
- Misses       5128     5701     +573     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pintaoz-aws pintaoz-aws merged commit 9371aee into master Dec 4, 2024
7 of 14 checks passed
@pintaoz-aws pintaoz-aws deleted the reinvent-2024-early branch December 4, 2024 12:38
@sage-maker sage-maker mentioned this pull request Dec 16, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.