Skip to content

Local mode failure for MXNet estimator #1349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ehsanmok opened this issue Mar 11, 2020 · 3 comments
Closed

Local mode failure for MXNet estimator #1349

ehsanmok opened this issue Mar 11, 2020 · 3 comments

Comments

@ehsanmok
Copy link

ehsanmok commented Mar 11, 2020

Describe the bug
I'm on ml.p3.2xlarge and mxnet_p36 conda env and installed python -m pip install "sagemaker[local]". The following fails training in local mode train_instance_type='local' or 'local_gpu' but works on any non-local instance type

estimator = MXNet(entry_point='main.py',
                  source_dir='code',
                  role=role,
                  train_instance_count=1, 
                  train_instance_type='local',  # 'ml.c4.2xlarge'
                  framework_version="1.4.1",
                  py_version='py3',
                  hyperparameters=hyperparameters,
                  output_path=train_output,
                  code_location=code_location,
                  sagemaker_session=session,
                 )

Screenshots or logs

ClientError                               Traceback (most recent call last)
<ipython-input-11-2c228015dcf6> in <module>()
     15                  )
     16 
---> 17 estimator.fit(input_data)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    460         self._prepare_for_training(job_name=job_name)
    461 
--> 462         self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
    463         self.jobs.append(self.latest_training_job)
    464         if wait:

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
   1008             train_args["enable_sagemaker_metrics"] = estimator.enable_sagemaker_metrics
   1009 
-> 1010         estimator.sagemaker_session.train(**train_args)
   1011 
   1012         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image, algorithm_arn, encrypt_inter_container_traffic, train_use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics)
    567         LOGGER.info("Creating training-job with name: %s", job_name)
    568         LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 569         self.sagemaker_client.create_training_job(**train_request)
    570 
    571     def process(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'local' at 'resourceConfig.instanceType' failed to satisfy constraint: Member must satisfy enum value set: [ml.p2.xlarge, ml.m5.4xlarge, ml.m4.16xlarge, ml.p3.16xlarge, ml.m5.large, ml.p2.16xlarge, ml.c4.2xlarge, ml.c5.2xlarge, ml.c4.4xlarge, ml.c5.4xlarge, ml.g4dn.xlarge, ml.g4dn.12xlarge, ml.c4.8xlarge, ml.g4dn.2xlarge, ml.c5.9xlarge, ml.g4dn.4xlarge, ml.c5.xlarge, ml.g4dn.16xlarge, ml.c4.xlarge, ml.g4dn.8xlarge, ml.c5.18xlarge, ml.p3dn.24xlarge, ml.p3.2xlarge, ml.m5.xlarge, ml.m4.10xlarge, ml.m5.12xlarge, ml.m4.xlarge, ml.m5.24xlarge, ml.m4.2xlarge, ml.p2.8xlarge, ml.m5.2xlarge, ml.p3.8xlarge, ml.m4.4xlarge]

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 1.50.16
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
  • Framework version: 1.4.1 and 1.6.0
  • Python version: py3
  • CPU or GPU: Both
  • Custom Docker image (Y/N):
@chuyang-deng
Copy link
Contributor

Hi @ehsanmok, thanks for using SageMaker.

If you are using 'local' or 'local_gpu' and passing a 'sagemaker_session' in the estimator constructor at the same time (from the code you provided in description), please make sure that your session is a LocalSession object, like here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/estimator.py#L220

Otherwise you'll encounter error from instance type validation.

@ehsanmok
Copy link
Author

ehsanmok commented Mar 12, 2020

Thanks @ChuyangDeng for your help! switching to sagemaker.LocalSession() resolved the issue. Ideally there should be warning or more insightful error message detecting whether the user is accidentally using the Session for local mode.

@laurenyu
Copy link
Contributor

glad your issue got resolved!

and someone else actually has opened a PR to address the same concern around having a more informative error message: #1356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants