diff --git a/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst b/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst index c615ad67aa..329788c41d 100644 --- a/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst +++ b/doc/api/training/sdp_versions/latest/smd_data_parallel_tensorflow.rst @@ -243,16 +243,25 @@ TensorFlow API .. function:: smdistributed.dataparallel.tensorflow.allreduce(tensor, param_index, num_params, compression=Compression.none, op=ReduceOp.AVERAGE) - Performs an all-reduce operation on a tensor (``tf.Tensor``). + Performs an ``allreduce`` operation on a tensor (``tf.Tensor``). + + The ``smdistributed.dataparallel`` package's AllReduce API for TensorFlow to allreduce + gradient tensors. By default, ``smdistributed.dataparallel`` allreduce averages the + gradient tensors across participating workers. + + .. note:: + + :class:`smdistributed.dataparallel.tensorflow.allreduce()` should + only be used to allreduce gradient tensors. + For other (non-gradient) tensors, you must use + :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()`. + If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()` + for non-gradient tensors, + the distributed training job might stall or stop. - ``smdistributed.dataparallel`` AllReduce API can be used for all - reducing gradient tensors or any other tensors. By - default, ``smdistributed.dataparallel`` AllReduce averages the - tensors across the participating workers. - ​ **Inputs:** - - ``tensor (tf.Tensor)(required)``: The tensor to be all-reduced. The shape of the input must be identical across all ranks. + - ``tensor (tf.Tensor)(required)``: The tensor to be allreduced. The shape of the input must be identical across all ranks. - ``param_index (int)(required):`` 0 if you are reducing a single tensor. Index of the tensor if you are reducing a list of tensors. - ``num_params (int)(required):`` len(tensor). - ``compression (smdistributed.dataparallel.tensorflow.Compression)(optional)``: Compression algorithm used to reduce the amount of data sent and received by each worker node. Defaults to not using compression. @@ -306,9 +315,9 @@ TensorFlow API .. function:: smdistributed.dataparallel.tensorflow.oob_allreduce(tensor, compression=Compression.none, op=ReduceOp.AVERAGE) - OutOfBand (oob) AllReduce is simplified AllReduce function for use cases + Out-of-band (oob) AllReduce is simplified AllReduce function for use-cases such as calculating total loss across all the GPUs in the training. - oob_allreduce average the tensors, as reduction operation, across the + ``oob_allreduce`` average the tensors, as reduction operation, across the worker nodes. **Inputs:** @@ -326,15 +335,25 @@ TensorFlow API - ``None`` - .. rubric:: Notes - - ``smdistributed.dataparallel.tensorflow.oob_allreduce``, in most - cases, is ~2x slower - than ``smdistributed.dataparallel.tensorflow.allreduce``  so it is not - recommended to be used for performing gradient reduction during the - training - process. ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally - uses NCCL AllReduce with ``ncclSum`` as the reduction operation. + .. note:: + + In most cases, the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` + function is ~2x slower + than :class:`smdistributed.dataparallel.tensorflow.allreduce()`. It is not + recommended to use the :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` + function for performing gradient + reduction during the training process. + ``smdistributed.dataparallel.tensorflow.oob_allreduce`` internally + uses NCCL AllReduce with ``ncclSum`` as the reduction operation. + + .. note:: + + :class:`smdistributed.dataparallel.tensorflow.oob_allreduce()` should + only be used to allreduce non-gradient tensors. + If you use :class:`smdistributed.dataparallel.tensorflow.allreduce()` + for non-gradient tensors, + the distributed training job might stall or stop. + To allreduce gradients, use :class:`smdistributed.dataparallel.tensorflow.allreduce()`. .. function:: smdistributed.dataparallel.tensorflow.overlap(tensor)