Skip to content

local_gpu with distributed training for single instance multi-gpu distributed training #1582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ChaiBapchya opened this issue Jun 12, 2020 · 1 comment

Comments

@ChaiBapchya
Copy link
Contributor

ChaiBapchya commented Jun 12, 2020

Describe the bug
Upon testing for local-session [sagemaker], single instance, multi-gpu distributed training

It fails at

Input
training_instance_type = 'local_gpu', distributions = {'mpi': {'enabled': True, 'processes_per_host': 4}}

Stack Trace

    def warn_if_parameter_server_with_multi_gpu(training_instance_type, distributions):
        """Warn the user that training will not fully leverage all the GPU
        cores if parameter server is enabled and a multi-GPU instance is selected.
        Distributed training with the default parameter server setup doesn't
        support multi-GPU instances.

        Args:
            training_instance_type (str): A string representing the type of training instance selected.
            distributions (dict): A dictionary with information to enable distributed training.
                (Defaults to None if distributed training is not enabled.) For example:

                .. code:: python

                    {
                        'parameter_server':
                        {
                            'enabled': True
                        }
                    }


        """
        if training_instance_type == "local" or distributions is None:
            return

        is_multi_gpu_instance = (
>           training_instance_type.split(".")[1].startswith("p")
            and training_instance_type not in SINGLE_GPU_INSTANCE_TYPES
        )
E       IndexError: list index out of range

.tox/py37/lib/python3.7/site-packages/sagemaker/fw_utils.py:620: IndexError

To reproduce

tox -e py37 -- tests/integ/test_horovod_mx.py

Expected behavior
For running Distributed training on single instance multi-gpu for mpi-based horovod, I encounter this error.
Since I'm using horovod [mpi] this warning isn't relevant.
I suggest we should also add local_gpu here

if training_instance_type in ["local","local_gpu"] or distributions is None:
            return

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: Build from source [1.62.1.dev0]
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):MXNet
  • Framework version:1.6.0
  • Python version:3
  • CPU or GPU:GPU
  • Custom Docker image (Y/N):Y
@laurenyu
Copy link
Contributor

warning message update has been released in https://github.com/aws/sagemaker-python-sdk/releases/tag/v1.65.0

pintaoz-aws pushed a commit that referenced this issue Dec 4, 2024
pintaoz-aws pushed a commit that referenced this issue Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants