Skip to content

[bug] broken custom image local mode training due to 1.11.1 #421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yifeim opened this issue Oct 7, 2018 · 13 comments
Closed

[bug] broken custom image local mode training due to 1.11.1 #421

yifeim opened this issue Oct 7, 2018 · 13 comments

Comments

@yifeim
Copy link
Contributor

yifeim commented Oct 7, 2018

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): BYO scikit-learn
  • Framework Version: custom image
  • Python Version: 3.6.5
  • CPU or GPU: CPU
  • Python SDK Version: 1.11.1
  • Are you using a custom image: Yes

Describe the problem

The following codes were fine with sagemaker ==1.11.0, but not okay with sagemaker==1.11.1. Any guesses what might be missing? Thanks!

PS: I did downgrade sagemaker to 1.11.0 and tested out that the example is okay.

Minimal repro / logs

    m = Estimator(
        image_name='<some image>',
                  role=get_execution_role(),
                  train_instance_count=1,
                  train_instance_type='local_gpu')
    
    m.set_hyperparameters(**hyperparameters)
    m.fit({k:'file://'+v for k,v in inputs.items()})
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-104847d86296> in <module>()
     61     m.set_hyperparameters(**hyperparameters)
     62 
---> 63     m.fit({k:'file://'+v for k,v in inputs.items()})
     64 
     65     predictor = m.deploy(1, 'local_gpu')

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    189         self._prepare_for_training(job_name=job_name)
    190 
--> 191         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    192         if wait:
    193             self.latest_training_job.wait(logs=logs)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs)
    415                                           resource_config=config['resource_config'], vpc_config=config['vpc_config'],
    416                                           hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 417                                           tags=estimator.tags)
    418 
    419         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
    276         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    277         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 278         self.sagemaker_client.create_training_job(**train_request)
    279 
    280     def tune(self, job_name, strategy, objective_type, objective_metric_name,

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
     73         training_job = _LocalTrainingJob(container)
     74         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 75         training_job.start(InputDataConfig, hyperparameters)
     76 
     77         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in start(self, input_data_config, hyperparameters)
     58         self.state = self._TRAINING
     59 
---> 60         self.model_artifacts = self.container.train(input_data_config, hyperparameters)
     61         self.end = datetime.datetime.now()
     62         self.state = self._COMPLETED

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, hyperparameters)
    109         training_env_vars = {
    110             REGION_ENV_NAME: self.sagemaker_session.boto_region_name,
--> 111             TRAINING_JOB_NAME_ENV_NAME: json.loads(hyperparameters.get(sagemaker.model.JOB_NAME_PARAM_NAME)),
    112         }
    113         compose_data = self._generate_compose_file('train', additional_volumes=volumes,

~/anaconda3/envs/mxnet_p36/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346         if not isinstance(s, (bytes, bytearray)):
    347             raise TypeError('the JSON object must be str, bytes or bytearray, '
--> 348                             'not {!r}'.format(s.__class__.__name__))
    349         s = s.decode(detect_encoding(s), 'surrogatepass')
    350 

TypeError: the JSON object must be str, bytes or bytearray, not 'NoneType'
@yifeim yifeim changed the title [bug] broken custom image training due to 1.11.1 [bug] broken custom image local mode training due to 1.11.1 Oct 7, 2018
@yifeim
Copy link
Contributor Author

yifeim commented Oct 8, 2018

(Update: the custom setting is unrelated, but I remember it lead to some other issues that is out of scope of this discussion.)

```
mkdir -p .sagemaker
printf "local:\n   container_root: /home/ec2-user/tmp\n" > .sagemaker/config.yaml
```
~~~Not sure if it is related to the bug.~~~

@ChoiByungWook
Copy link
Contributor

Hello,

Thank you for bringing this bug to our attention.

This is related to a change made here #411

Specifically here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L112

That value is provided through hyperparameters, which is normally provided if you use a deep learning framework estimator such as TensorFlow. This is a bug and a fix will be released soon.

@yifeim
Copy link
Contributor Author

yifeim commented Oct 10, 2018

Thanks for the attention and glad that you found it useful. Next release will be totally fine with me. Thanks.

Will report when things change.

@ChoiByungWook
Copy link
Contributor

@yifeim

Has been released! Please update and let me know if the change addresses your problem!

@yifeim
Copy link
Contributor Author

yifeim commented Oct 10, 2018

Thanks for the fast response. Still have problems. Not sure if this is related?

AttributeError                            Traceback (most recent call last)
<ipython-input-35-d98be4cea25e> in <module>()
     64     m.fit({k:'file://'+v for k,v in inputs.items()})
     65 
---> 66     predictor = m.deploy(1, 'local_gpu')
     67     predictor.serializer = json_serializer
     68     predictor.deserializer = json_deserializer

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, endpoint_name, **kwargs)
    274             instance_type=instance_type,
    275             initial_instance_count=initial_instance_count,
--> 276             endpoint_name=endpoint_name)
    277 
    278     @property

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, endpoint_name, tags)
    106         production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count)
    107         self.endpoint_name = endpoint_name or self.name
--> 108         self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)
    109         if self.predictor_cls:
    110             return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, wait)
    779 
    780             self.sagemaker_client.create_endpoint_config(**config_options)
--> 781         return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
    782 
    783     def expand_role(self, role):

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, wait)
    557         """
    558         LOGGER.info('Creating endpoint with name {}'.format(endpoint_name))
--> 559         self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
    560         if wait:
    561             self.wait_for_endpoint(endpoint_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_endpoint(self, EndpointName, EndpointConfigName)
    130         endpoint = _LocalEndpoint(EndpointName, EndpointConfigName, self.sagemaker_session)
    131         LocalSagemakerClient._endpoints[EndpointName] = endpoint
--> 132         endpoint.serve()
    133 
    134     def delete_endpoint(self, EndpointName):

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in serve(self)
    142         self.create_time = datetime.datetime.now()
    143         self.container = _SageMakerContainer(instance_type, instance_count, image, self.local_session)
--> 144         self.container.serve(self.primary_container['ModelDataUrl'], self.primary_container['Environment'])
    145 
    146         i = 0

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in serve(self, model_dir, environment)
    173         self._generate_compose_file('serve',
    174                                     additional_env_vars=environment,
--> 175                                     additional_volumes=volumes)
    176         compose_command = self._compose()
    177         self.container = _HostingContainer(compose_command)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _generate_compose_file(self, command, additional_volumes, additional_env_vars)
    395             environment.extend(aws_creds)
    396 
--> 397         additional_env_var_list = ['{}={}'.format(k, v) for k, v in additional_env_vars.items()]
    398         environment.extend(additional_env_var_list)
    399 

AttributeError: 'list' object has no attribute 'items'

@icywang86rui
Copy link
Contributor

@yifeim - It looks like this time it failed at creating an endpoint. Does the same code work with 1.11 for the hosting part? Have you tried creating a model first like this:

model = m.create_model()
predictor = model.deploy(1, 'local')

I have walked through the code path that failed and the addtional_env_var is suppose to be an empty dict since you did not pass in any additional arguments to the deploy() call. I am trying to reproduce this error now.

@yifeim
Copy link
Contributor Author

yifeim commented Oct 11, 2018

I could create a model, but could not deploy it. I had the exact same errors.

@yifeim
Copy link
Contributor Author

yifeim commented Oct 11, 2018

@yifeim
Copy link
Contributor Author

yifeim commented Oct 11, 2018

It seemed that the codes detected that it is an empty dict and replaced it with empty list. Not sure why a list is the expected format.

@icywang86rui
Copy link
Contributor

@yifeim Your are right and we have a bug there. It suppose to be:

additional_env_vars = additional_env_vars or {}
additional_volumes = additional_volumes or []

Thanks for pointing it out. I will get a fix out as soon as possible.

@icywang86rui icywang86rui mentioned this issue Oct 12, 2018
4 tasks
@icywang86rui
Copy link
Contributor

PR to fix this #429

@icywang86rui
Copy link
Contributor

PR is merged and this change is scheduled to be released tomorrow.

@yifeim
Copy link
Contributor Author

yifeim commented Oct 21, 2018

I can verify that the new release solved both problems. Thanks a lot everyone!

@yifeim yifeim closed this as completed Oct 21, 2018
apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants