Skip to content

Support Distributed Training Strategies #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andrewortman opened this issue Aug 5, 2018 · 6 comments
Closed

Support Distributed Training Strategies #62

andrewortman opened this issue Aug 5, 2018 · 6 comments

Comments

@andrewortman
Copy link

In TF 1.8, you can specify a distributed training strategy as part of the estimator's runspec. One strategy I'd like to use is MirroredStrategy, which would allow me to train my model across multiple GPUs in an instance that has more than 1 GPU. Right now, it seems the only way to do this approach is manually by building towers for each GPU in the model_fn.

Please consider making the strategy somehow configurable. It looks like the easiest approach may be to specify the strategy as a string that can be changed with a customer-provided hyperparameter. However, this would require manual remapping for new strategies, and it gets complicated whenever there are additional parameters involved (like # of devices in the MirroredStrategy.) Instead of the hyperparameter approach, an alternative implementation could be to have an additional (optional) interface function that returns a distribution strategy (say, distribution_fn) that returns a configured distribution strategy.

You may be able to scope this out further and allow customers to provide a tf_runspec_fn that returns a custom runspec instead of building one for them using the Sagemaker hyperparameters.

@ChoiByungWook
Copy link
Contributor

Hello,

Thanks for the suggestion! I'll mark this down as a feature request, but I can't commit to when we'll be able to deliver this. If you need this feature and can send us a pull request, we can also work with you on getting it merged in. Thanks!

@laurenyu
Copy link
Contributor

laurenyu commented Jun 7, 2019

The TensorFlow Script Mode images now allow for greater flexibility with training scripts, which should allow you to specify your desired distributed training strategy. You can read more about that here: https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script

@laurenyu laurenyu closed this as completed Jun 7, 2019
@anirudhacharya
Copy link

anirudhacharya commented Jun 23, 2020

@laurenyu The TF script mode allows for asynchronous distributed training using Parameter server. But the latest TF2 comes with many distribution strategies like multi-worker mirrored strategy which uses synchronous communication. For multi-worker mirrored strategy we need to set TF_CONFIG( https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras), and each node will have a different role in the distributed training.

I am not sure if script mode allows for us to use multi-worker mirrored strategy. Can you please reopen this issue?

@laurenyu
Copy link
Contributor

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

The TF images should allow for this, but you'll have to set up things on your own (i.e. configure your own TF_CONFIG, etc.)

@anirudhacharya
Copy link

anirudhacharya commented Jun 23, 2020

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

I will open an issue.

The TF images should allow for this, but you'll have to set up things on your own (i.e. configure your own TF_CONFIG, etc.)

@laurenyu If I have to set up things myself, then how can I use esitmator.fit() API to launch the training job? Can't the TF_CONFIG be configured with the distributions parameter (like here)

@anirudhacharya
Copy link

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

issue to track - #391

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants