Skip to content

fix: derive master node from training environment #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 21, 2022

Conversation

satishpasumarthi
Copy link
Contributor

Issue #, if available:
With the latest heterogeneous cluster changes, the master node is not always algo-1.

Description of changes:
Derive master node from the training env rather than relying on the first node of the hosts.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-pr
  • Commit ID: ae2344e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

josephevans
josephevans previously approved these changes Jul 21, 2022
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-gpu-tests
  • Commit ID: ae2344e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-toolkit-pr
  • Commit ID: ae2344e
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-unit-tests
  • Commit ID: ae2344e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

rahul003
rahul003 previously approved these changes Jul 21, 2022
@satishpasumarthi satishpasumarthi dismissed stale reviews from rahul003 and josephevans via 5d63df5 July 21, 2022 20:59
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-pr
  • Commit ID: 5d63df5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-gpu-tests
  • Commit ID: 5d63df5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-toolkit-pr
  • Commit ID: 5d63df5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-pytorch-training-container-unit-tests
  • Commit ID: 5d63df5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@satishpasumarthi satishpasumarthi merged commit a12bc7d into aws:master Jul 21, 2022
@satishpasumarthi satishpasumarthi changed the title fix: deriver master node from training environment fix: derive master node from training environment Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants