You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- If ``True``, the library shards the optimizer state of all parameters across
157
-
the data parallel processes which hold the same parameter.
158
-
This optimizer state sharding happens in a balanced manner.
159
-
Note that when sharding optimizer state, full optimizer saving is not currently supported.
160
-
Please save partial optimizer state. For more information about saving and loading checkpoints with
161
-
optimizer state sharding, see `Instructions for Checkpointing with Tensor Parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-saving-loading-checkpoints.html>`_.
- Skips the initial tracing step. This can be useful in very large models
174
-
where even model tracing at the CPU is not possible due to memory constraints.
175
107
176
108
TensorFlow-specific Parameters
177
109
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -198,47 +130,78 @@ PyTorch-specific Parameters
198
130
~~~~~~~~~~~~~~~~~~~~~~~~~~~
199
131
200
132
.. list-table::
201
-
:widths: 10 20 10 60
202
-
:header-rows: 1
203
-
204
-
* - Parameter
205
-
- Type / Valid values
206
-
- Default
207
-
- Description
208
-
* - ``memory_weight``
209
-
- float [0.0, 1.0]
210
-
- ``0.2`` if ``optimize`` is ``"speed"``, else ``0.8``
211
-
- The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.
212
-
* - ``ddp``
213
-
- bool
214
-
- ``False``
215
-
- Must be set to True if hybrid model/data parallelism is used with DistributedDataParallel. DistributedDataParallel is used with NCCL backend, and uses the MASTER_PORT provided by SageMaker.
- This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.
of pipeline tasks. This determines how early the activations should
237
-
be loaded back to the GPU, expressed in number of pipeline tasks.
238
-
Smaller value indicates that activations are loaded closer in time to
239
-
when they are needed for backward pass. Setting this value too small
240
-
might improve memory usage, but might potentially cause throughput
241
-
loss and GPU bottlenecks during the CPU-to-GPU data transfer.
133
+
:widths: 10 20 10 60
134
+
:header-rows: 1
135
+
136
+
* - Parameter
137
+
- Type / Valid values
138
+
- Default
139
+
- Description
140
+
* - ``memory_weight``
141
+
- float [0.0, 1.0]
142
+
- ``0.2`` if ``optimize`` is ``"speed"``, else ``0.8``
143
+
- The weight of memory balancing in the auto-partitioni ng objective, as opposed to balancing computational load. If 0.0, the library only tries to balance computation; if 1.0 the library only tries to balance the memory use. Any value in between interpolates between these extremes.
144
+
* - ``ddp``
145
+
- bool
146
+
- ``False``
147
+
- Must be set to True if hybrid model/data parallelism is used with DistributedDataParallel. DistributedDataParallel is used with NCCL backend, and uses the MASTER_PORT provided by SageMaker.
- This is the maximum number of microbatches that are simultaneously in execution during pipelining. Jointly scaling batch size and number of microbatches can often mitigate the pipeline bubble overhead, but that can lead to increased memory usage if too many microbatches are simultaneously in execution. In such cases setting the number of active microbatches to a lower number can help control memory usage. By default this is set to two plus the number of partitions of the model.
- If ``True``, the library shards the optimizer state of all parameters across
187
+
the data parallel processes which hold the same parameter.
188
+
This optimizer state sharding happens in a balanced manner.
189
+
Note that when sharding optimizer state, full optimizer saving is not currently supported.
190
+
Please save partial optimizer state. For more information about saving and loading checkpoints with
191
+
optimizer state sharding, see `Instructions for Checkpointing with Tensor Parallelism <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-saving-loading-checkpoints.html>`_.
0 commit comments