local_gpu with distributed training for single instance multi-gpu distributed training #1582

ChaiBapchya · 2020-06-12T07:42:04Z

Describe the bug
Upon testing for local-session [sagemaker], single instance, multi-gpu distributed training

It fails at

Input
training_instance_type = 'local_gpu', distributions = {'mpi': {'enabled': True, 'processes_per_host': 4}}

Stack Trace

    def warn_if_parameter_server_with_multi_gpu(training_instance_type, distributions):
        """Warn the user that training will not fully leverage all the GPU
        cores if parameter server is enabled and a multi-GPU instance is selected.
        Distributed training with the default parameter server setup doesn't
        support multi-GPU instances.

        Args:
            training_instance_type (str): A string representing the type of training instance selected.
            distributions (dict): A dictionary with information to enable distributed training.
                (Defaults to None if distributed training is not enabled.) For example:

                .. code:: python

                    {
                        'parameter_server':
                        {
                            'enabled': True
                        }
                    }


        """
        if training_instance_type == "local" or distributions is None:
            return

        is_multi_gpu_instance = (
>           training_instance_type.split(".")[1].startswith("p")
            and training_instance_type not in SINGLE_GPU_INSTANCE_TYPES
        )
E       IndexError: list index out of range

.tox/py37/lib/python3.7/site-packages/sagemaker/fw_utils.py:620: IndexError

To reproduce

tox -e py37 -- tests/integ/test_horovod_mx.py

Build custom docker image :
https://github.com/ChaiBapchya/sagemaker-mxnet-training-toolkit/tree/mx_hvd_mpi
custom Sagemaker python SDK build from source : https://github.com/ChaiBapchya/sagemaker-python-sdk/tree/mx_estimator_horovod_mpi

Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here

if training_instance_type in ["local","local_gpu"] or distributions is None:
            return

System information
A description of your system. Please provide:

SageMaker Python SDK version: Build from source [1.62.1.dev0]
Framework name (eg. PyTorch) or algorithm (eg. KMeans):MXNet
Framework version:1.6.0
Python version:3
CPU or GPU:GPU
Custom Docker image (Y/N):Y

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-06-17T17:59:37Z

warning message update has been released in https://github.com/aws/sagemaker-python-sdk/releases/tag/v1.65.0

chuyang-deng mentioned this issue Jun 12, 2020

update warning message #1587

Merged

7 tasks

laurenyu added In progress type: bug labels Jun 16, 2020

laurenyu closed this as completed Jun 17, 2020

pintaoz-aws pushed a commit that referenced this issue Dec 4, 2024

Add recipes examples (#1582)

6b90f89

pintaoz-aws pushed a commit that referenced this issue Dec 4, 2024

Add recipes examples (#1582)

856b192

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local_gpu with distributed training for single instance multi-gpu distributed training #1582

local_gpu with distributed training for single instance multi-gpu distributed training #1582

ChaiBapchya commented Jun 12, 2020 •

edited

Loading

laurenyu commented Jun 17, 2020

local_gpu with distributed training for single instance multi-gpu distributed training #1582

local_gpu with distributed training for single instance multi-gpu distributed training #1582

Comments

ChaiBapchya commented Jun 12, 2020 • edited Loading

laurenyu commented Jun 17, 2020

ChaiBapchya commented Jun 12, 2020 •

edited

Loading