Skip to content

Commit ec2894c

Browse files
committed
documentation: add notes for allreduce and oob_allreduce apis
1 parent 38c49bd commit ec2894c

File tree

1 file changed

+37
-18
lines changed

1 file changed

+37
-18
lines changed

doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst

+37-18
Original file line numberDiff line numberDiff line change
@@ -243,16 +243,25 @@ TensorFlow API
243243

244244
.. function:: smdistributed.dataparallel.tensorflow.allreduce(tensor, param_index, num_params, compression=Compression.none, op=ReduceOp.AVERAGE)
245245

246-
Performs an all-reduce operation on a tensor (``tf.Tensor``).
246+
Performs an ``allreduce`` operation on a tensor (``tf.Tensor``).
247+
248+
The ``smdistributed.dataparallel`` package's AllReduce API for TensorFlow to allreduce
249+
gradient tensors. By default, ``smdistributed.dataparallel`` allreduce averages the
250+
gradient tensors across participating workers.
251+
252+
.. note::
253+
254+
:class:`smdistributed.dataparallel.tensorflow.allreduce()` should
255+
only be used to allreduce gradient tensors.
256+
For other (non-gradient) tensors, you must use
257+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`.
258+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
259+
for non-gradient tensors,
260+
the distributed training job might stall or stop.
247261

248-
``smdistributed.dataparallel`` AllReduce API can be used for all
249-
reducing gradient tensors or any other tensors. By
250-
default, ``smdistributed.dataparallel`` AllReduce averages the
251-
tensors across the participating workers.
252-
253262
**Inputs:**
254263

255-
- ``tensor (tf.Tensor)(required)``: The tensor to be all-reduced. The shape of the input must be identical across all ranks.
264+
- ``tensor (tf.Tensor)(required)``: The tensor to be allreduced. The shape of the input must be identical across all ranks.
256265
- ``param_index (int)(required):`` 0 if you are reducing a single tensor. Index of the tensor if you are reducing a list of tensors.
257266
- ``num_params (int)(required):`` len(tensor).
258267
- ``compression (smdistributed.dataparallel.tensorflow.Compression)(optional)``: Compression algorithm used to reduce the amount of data sent and received by each worker node. Defaults to not using compression.
@@ -306,9 +315,9 @@ TensorFlow API
306315

307316
.. function:: smdistributed.dataparallel.tensorflow.oob_allreduce(tensor, compression=Compression.none, op=ReduceOp.AVERAGE)
308317

309-
OutOfBand (oob) AllReduce is simplified AllReduce function for use cases
318+
Out-of-band (oob) AllReduce is simplified AllReduce function for use-cases
310319
such as calculating total loss across all the GPUs in the training.
311-
oob_allreduce average the tensors, as reduction operation, across the
320+
``oob_allreduce`` average the tensors, as reduction operation, across the
312321
worker nodes.
313322

314323
**Inputs:**
@@ -326,15 +335,25 @@ TensorFlow API
326335

327336
- ``None``
328337

329-
.. rubric:: Notes
330-
331-
``smdistributed.dataparallel.tensorflow.oob_allreduce``, in most
332-
cases, is ~2x slower
333-
than ``smdistributed.dataparallel.tensorflow.allreduce``  so it is not
334-
recommended to be used for performing gradient reduction during the
335-
training
336-
process. ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
337-
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
338+
.. note::
339+
340+
In most cases, the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
341+
function is ~2x slower
342+
than :class:`smdistributed.dataparallel.tensorflow.allreduce()`. It is not
343+
recommended to use the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`
344+
function for performing gradient
345+
reduction during the training process.
346+
``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally
347+
uses NCCL AllReduce with ``ncclSum`` as the reduction operation.
348+
349+
.. note::
350+
351+
:class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` should
352+
only be used to allreduce non-gradient tensors.
353+
If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()`
354+
for non-gradient tensors,
355+
the distributed training job might stall or stop.
356+
To allreduce gradients, use :class:`smdistributed.dataparallel.tensorflow.allreduce()`.
338357

339358

340359
.. function:: smdistributed.dataparallel.tensorflow.overlap(tensor)

0 commit comments

Comments
 (0)