Support Distributed Training Strategies #62

andrewortman · 2018-08-05T21:42:38Z

In TF 1.8, you can specify a distributed training strategy as part of the estimator's runspec. One strategy I'd like to use is MirroredStrategy, which would allow me to train my model across multiple GPUs in an instance that has more than 1 GPU. Right now, it seems the only way to do this approach is manually by building towers for each GPU in the model_fn.

Please consider making the strategy somehow configurable. It looks like the easiest approach may be to specify the strategy as a string that can be changed with a customer-provided hyperparameter. However, this would require manual remapping for new strategies, and it gets complicated whenever there are additional parameters involved (like # of devices in the MirroredStrategy.) Instead of the hyperparameter approach, an alternative implementation could be to have an additional (optional) interface function that returns a distribution strategy (say, distribution_fn) that returns a configured distribution strategy.

You may be able to scope this out further and allow customers to provide a tf_runspec_fn that returns a custom runspec instead of building one for them using the Sagemaker hyperparameters.

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2018-08-15T01:28:34Z

Hello,

Thanks for the suggestion! I'll mark this down as a feature request, but I can't commit to when we'll be able to deliver this. If you need this feature and can send us a pull request, we can also work with you on getting it merged in. Thanks!

laurenyu · 2019-06-07T16:39:27Z

The TensorFlow Script Mode images now allow for greater flexibility with training scripts, which should allow you to specify your desired distributed training strategy. You can read more about that here: https://sagemaker.readthedocs.io/en/stable/using_tf.html#preparing-a-script-mode-training-script

anirudhacharya · 2020-06-23T07:55:06Z

@laurenyu The TF script mode allows for asynchronous distributed training using Parameter server. But the latest TF2 comes with many distribution strategies like multi-worker mirrored strategy which uses synchronous communication. For multi-worker mirrored strategy we need to set TF_CONFIG( https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras), and each node will have a different role in the distributed training.

I am not sure if script mode allows for us to use multi-worker mirrored strategy. Can you please reopen this issue?

laurenyu · 2020-06-23T16:47:53Z

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

The TF images should allow for this, but you'll have to set up things on your own (i.e. configure your own TF_CONFIG, etc.)

anirudhacharya · 2020-06-23T17:57:41Z

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

I will open an issue.

The TF images should allow for this, but you'll have to set up things on your own (i.e. configure your own TF_CONFIG, etc.)

@laurenyu If I have to set up things myself, then how can I use esitmator.fit() API to launch the training job? Can't the TF_CONFIG be configured with the distributions parameter (like here)

anirudhacharya · 2020-06-23T21:10:03Z

@anirudhacharya can you open a new issue? that'll help with our internal tracking.

issue to track - #391

ChoiByungWook added the feature request label Aug 15, 2018

laurenyu closed this as completed Jun 7, 2019

ajaykarpur added type: enhancement and removed feature request labels Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Distributed Training Strategies #62

Support Distributed Training Strategies #62

andrewortman commented Aug 5, 2018

ChoiByungWook commented Aug 15, 2018

Uh oh!

laurenyu commented Jun 7, 2019

Uh oh!

anirudhacharya commented Jun 23, 2020 •

edited

Loading

Uh oh!

laurenyu commented Jun 23, 2020

Uh oh!

anirudhacharya commented Jun 23, 2020 •

edited

Loading

Uh oh!

anirudhacharya commented Jun 23, 2020

Uh oh!

Support Distributed Training Strategies #62

Support Distributed Training Strategies #62

Comments

andrewortman commented Aug 5, 2018

ChoiByungWook commented Aug 15, 2018

Uh oh!

laurenyu commented Jun 7, 2019

Uh oh!

anirudhacharya commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurenyu commented Jun 23, 2020

Uh oh!

anirudhacharya commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anirudhacharya commented Jun 23, 2020

Uh oh!

anirudhacharya commented Jun 23, 2020 •

edited

Loading

anirudhacharya commented Jun 23, 2020 •

edited

Loading