-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feature: Adding support in HuggingFace estimator for Training Compiler enhanced PyTorch 1.11 #3307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -24,7 +26,7 @@ | |||
class TrainingCompilerConfig(BaseConfig): | |||
"""The SageMaker Training Compiler configuration class.""" | |||
|
|||
SUPPORTED_INSTANCE_CLASS_PREFIXES = ["p3", "g4dn", "p4"] | |||
SUPPORTED_INSTANCE_CLASS_PREFIXES = ["p3", "g4dn", "p4d", "g5"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you mean to remove p4? is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p4d
is the only available flavor of p4
AFAIK. Just making the check more specific.
if estimator.pytorch_version: | ||
if Version(estimator.pytorch_version) in SpecifierSet(">= 1.11"): | ||
error_helper_string = ( | ||
"'pytorch_xla' is the only distribution mechanism currently supported " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can still run without specifying distribution, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is just blocking customers from using other built-in distribution mechanisms like mpi or pytorch ddp.
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bot run all
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
for i in cls.SUPPORTED_INSTANCE_CLASS_PREFIXES | ||
] | ||
): | ||
error_helper_string = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add info
logging to inform customers that p3dn.24xlarge
& p4d.24xlarge
instances offer the best performance for distributed training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we just want to generalize to EFA enabled instances or what is special about these 2 instance types ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, we eventually would be to recommend all EFA instances that improve performance. We should start with recommending EFA instances we benchmarked thoroughly, we can add other instances to the list once we benchmark them internally.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Miyoung <[email protected]>
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
…r enhanced PyTorch 1.11 (aws#3307) * feature: Adding support in HuggingFace estimator for Training Compiler enhanced PyTorch 1.11 * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * fix: renaming distribution parameters pytorch_xla -> pytorchxla * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/estimator.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/estimator.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/huggingface/estimator.py Co-authored-by: Miyoung <[email protected]> * Fix: syntax error in trcomp tests * fix: linting * fix: linting to break up long lines * fix: fixture scoping issue in integ test * fix: broken unit tests for trcomp * fix: broken skip logic in version fixtures * fix: update test and version compatibility * feature: added warning recommending EFA instances with training compiler * Update src/sagemaker/huggingface/estimator.py Co-authored-by: Miyoung <[email protected]> * Update src/sagemaker/training_compiler/config.py Co-authored-by: Miyoung <[email protected]> Co-authored-by: Miyoung <[email protected]>
Description of changes:
Testing done:
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.General
Tests
unique_name_from_base
to create resource names in integ tests (if appropriate)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.