You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
def warn_if_parameter_server_with_multi_gpu(training_instance_type, distributions):
"""Warn the user that training will not fully leverage all the GPU
cores if parameter server is enabled and a multi-GPU instance is selected.
Distributed training with the default parameter server setup doesn't
support multi-GPU instances.
Args:
training_instance_type (str): A string representing the type of training instance selected.
distributions (dict): A dictionary with information to enable distributed training.
(Defaults to None if distributed training is not enabled.) For example:
.. code:: python
{
'parameter_server':
{
'enabled': True
}
}
"""
if training_instance_type == "local" or distributions is None:
return
is_multi_gpu_instance = (
> training_instance_type.split(".")[1].startswith("p")
and training_instance_type not in SINGLE_GPU_INSTANCE_TYPES
)
E IndexError: list index out of range
.tox/py37/lib/python3.7/site-packages/sagemaker/fw_utils.py:620: IndexError
Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here
if training_instance_type in ["local","local_gpu"] or distributions is None:
return
System information
A description of your system. Please provide:
SageMaker Python SDK version: Build from source [1.62.1.dev0]
Framework name (eg. PyTorch) or algorithm (eg. KMeans):MXNet
Framework version:1.6.0
Python version:3
CPU or GPU:GPU
Custom Docker image (Y/N):Y
The text was updated successfully, but these errors were encountered:
Describe the bug
Upon testing for local-session [sagemaker], single instance, multi-gpu distributed training
It fails at
Stack Trace
To reproduce
https://github.com/ChaiBapchya/sagemaker-mxnet-training-toolkit/tree/mx_hvd_mpi
Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here
System information
A description of your system. Please provide:
The text was updated successfully, but these errors were encountered: