-
Notifications
You must be signed in to change notification settings - Fork 160
Support Distributed Training Strategies #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, Thanks for the suggestion! I'll mark this down as a feature request, but I can't commit to when we'll be able to deliver this. If you need this feature and can send us a pull request, we can also work with you on getting it merged in. Thanks! |
The TensorFlow Script Mode images now allow for greater flexibility with training scripts, which should allow you to specify your desired distributed training strategy. You can read more about that here: https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script |
@laurenyu The TF script mode allows for asynchronous distributed training using Parameter server. But the latest TF2 comes with many distribution strategies like multi-worker mirrored strategy which uses synchronous communication. For multi-worker mirrored strategy we need to set TF_CONFIG( https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras), and each node will have a different role in the distributed training. I am not sure if script mode allows for us to use multi-worker mirrored strategy. Can you please reopen this issue? |
@anirudhacharya can you open a new issue? that'll help with our internal tracking. The TF images should allow for this, but you'll have to set up things on your own (i.e. configure your own |
I will open an issue.
@laurenyu If I have to set up things myself, then how can I use |
issue to track - #391 |
In TF 1.8, you can specify a distributed training strategy as part of the estimator's runspec. One strategy I'd like to use is MirroredStrategy, which would allow me to train my model across multiple GPUs in an instance that has more than 1 GPU. Right now, it seems the only way to do this approach is manually by building towers for each GPU in the model_fn.
Please consider making the strategy somehow configurable. It looks like the easiest approach may be to specify the strategy as a string that can be changed with a customer-provided hyperparameter. However, this would require manual remapping for new strategies, and it gets complicated whenever there are additional parameters involved (like # of devices in the MirroredStrategy.) Instead of the hyperparameter approach, an alternative implementation could be to have an additional (optional) interface function that returns a distribution strategy (say, distribution_fn) that returns a configured distribution strategy.
You may be able to scope this out further and allow customers to provide a
tf_runspec_fn
that returns a custom runspec instead of building one for them using the Sagemaker hyperparameters.The text was updated successfully, but these errors were encountered: