[bug] broken custom image local mode training due to 1.11.1 #421

yifeim · 2018-10-07T07:49:11Z

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): BYO scikit-learn
Framework Version: custom image
Python Version: 3.6.5
CPU or GPU: CPU
Python SDK Version: 1.11.1
Are you using a custom image: Yes

Describe the problem

The following codes were fine with sagemaker ==1.11.0, but not okay with sagemaker==1.11.1. Any guesses what might be missing? Thanks!

PS: I did downgrade sagemaker to 1.11.0 and tested out that the example is okay.

Minimal repro / logs

    m = Estimator(
        image_name='<some image>',
                  role=get_execution_role(),
                  train_instance_count=1,
                  train_instance_type='local_gpu')
    
    m.set_hyperparameters(**hyperparameters)
    m.fit({k:'file://'+v for k,v in inputs.items()})

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-34-104847d86296> in <module>()
     61     m.set_hyperparameters(**hyperparameters)
     62 
---> 63     m.fit({k:'file://'+v for k,v in inputs.items()})
     64 
     65     predictor = m.deploy(1, 'local_gpu')

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name)
    189         self._prepare_for_training(job_name=job_name)
    190 
--> 191         self.latest_training_job = _TrainingJob.start_new(self, inputs)
    192         if wait:
    193             self.latest_training_job.wait(logs=logs)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs)
    415                                           resource_config=config['resource_config'], vpc_config=config['vpc_config'],
    416                                           hyperparameters=hyperparameters, stop_condition=config['stop_condition'],
--> 417                                           tags=estimator.tags)
    418 
    419         return cls(estimator.sagemaker_session, estimator._current_job_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in train(self, image, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags)
    276         LOGGER.info('Creating training-job with name: {}'.format(job_name))
    277         LOGGER.debug('train request: {}'.format(json.dumps(train_request, indent=4)))
--> 278         self.sagemaker_client.create_training_job(**train_request)
    279 
    280     def tune(self, job_name, strategy, objective_type, objective_metric_name,

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, InputDataConfig, OutputDataConfig, ResourceConfig, **kwargs)
     73         training_job = _LocalTrainingJob(container)
     74         hyperparameters = kwargs['HyperParameters'] if 'HyperParameters' in kwargs else {}
---> 75         training_job.start(InputDataConfig, hyperparameters)
     76 
     77         LocalSagemakerClient._training_jobs[TrainingJobName] = training_job

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in start(self, input_data_config, hyperparameters)
     58         self.state = self._TRAINING
     59 
---> 60         self.model_artifacts = self.container.train(input_data_config, hyperparameters)
     61         self.end = datetime.datetime.now()
     62         self.state = self._COMPLETED

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in train(self, input_data_config, hyperparameters)
    109         training_env_vars = {
    110             REGION_ENV_NAME: self.sagemaker_session.boto_region_name,
--> 111             TRAINING_JOB_NAME_ENV_NAME: json.loads(hyperparameters.get(sagemaker.model.JOB_NAME_PARAM_NAME)),
    112         }
    113         compose_data = self._generate_compose_file('train', additional_volumes=volumes,

~/anaconda3/envs/mxnet_p36/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346         if not isinstance(s, (bytes, bytearray)):
    347             raise TypeError('the JSON object must be str, bytes or bytearray, '
--> 348                             'not {!r}'.format(s.__class__.__name__))
    349         s = s.decode(detect_encoding(s), 'surrogatepass')
    350 

TypeError: the JSON object must be str, bytes or bytearray, not 'NoneType'

The text was updated successfully, but these errors were encountered:

yifeim · 2018-10-08T03:42:42Z

(Update: the custom setting is unrelated, but I remember it lead to some other issues that is out of scope of this discussion.)

```
mkdir -p .sagemaker
printf "local:\n   container_root: /home/ec2-user/tmp\n" > .sagemaker/config.yaml
```
~~~Not sure if it is related to the bug.~~~

ChoiByungWook · 2018-10-09T17:51:02Z

Hello,

Thank you for bringing this bug to our attention.

This is related to a change made here #411

Specifically here: https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L112

That value is provided through hyperparameters, which is normally provided if you use a deep learning framework estimator such as TensorFlow. This is a bug and a fix will be released soon.

yifeim · 2018-10-10T03:22:46Z

Thanks for the attention and glad that you found it useful. Next release will be totally fine with me. Thanks.

Will report when things change.

ChoiByungWook · 2018-10-10T23:24:22Z

@yifeim

Has been released! Please update and let me know if the change addresses your problem!

yifeim · 2018-10-10T23:36:01Z

Thanks for the fast response. Still have problems. Not sure if this is related?

AttributeError                            Traceback (most recent call last)
<ipython-input-35-d98be4cea25e> in <module>()
     64     m.fit({k:'file://'+v for k,v in inputs.items()})
     65 
---> 66     predictor = m.deploy(1, 'local_gpu')
     67     predictor.serializer = json_serializer
     68     predictor.deserializer = json_deserializer

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/estimator.py in deploy(self, initial_instance_count, instance_type, endpoint_name, **kwargs)
    274             instance_type=instance_type,
    275             initial_instance_count=initial_instance_count,
--> 276             endpoint_name=endpoint_name)
    277 
    278     @property

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, endpoint_name, tags)
    106         production_variant = sagemaker.production_variant(self.name, instance_type, initial_instance_count)
    107         self.endpoint_name = endpoint_name or self.name
--> 108         self.sagemaker_session.endpoint_from_production_variants(self.endpoint_name, [production_variant], tags)
    109         if self.predictor_cls:
    110             return self.predictor_cls(self.endpoint_name, self.sagemaker_session)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in endpoint_from_production_variants(self, name, production_variants, tags, wait)
    779 
    780             self.sagemaker_client.create_endpoint_config(**config_options)
--> 781         return self.create_endpoint(endpoint_name=name, config_name=name, wait=wait)
    782 
    783     def expand_role(self, role):

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/session.py in create_endpoint(self, endpoint_name, config_name, wait)
    557         """
    558         LOGGER.info('Creating endpoint with name {}'.format(endpoint_name))
--> 559         self.sagemaker_client.create_endpoint(EndpointName=endpoint_name, EndpointConfigName=config_name)
    560         if wait:
    561             self.wait_for_endpoint(endpoint_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/local_session.py in create_endpoint(self, EndpointName, EndpointConfigName)
    130         endpoint = _LocalEndpoint(EndpointName, EndpointConfigName, self.sagemaker_session)
    131         LocalSagemakerClient._endpoints[EndpointName] = endpoint
--> 132         endpoint.serve()
    133 
    134     def delete_endpoint(self, EndpointName):

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/entities.py in serve(self)
    142         self.create_time = datetime.datetime.now()
    143         self.container = _SageMakerContainer(instance_type, instance_count, image, self.local_session)
--> 144         self.container.serve(self.primary_container['ModelDataUrl'], self.primary_container['Environment'])
    145 
    146         i = 0

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in serve(self, model_dir, environment)
    173         self._generate_compose_file('serve',
    174                                     additional_env_vars=environment,
--> 175                                     additional_volumes=volumes)
    176         compose_command = self._compose()
    177         self.container = _HostingContainer(compose_command)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/local/image.py in _generate_compose_file(self, command, additional_volumes, additional_env_vars)
    395             environment.extend(aws_creds)
    396 
--> 397         additional_env_var_list = ['{}={}'.format(k, v) for k, v in additional_env_vars.items()]
    398         environment.extend(additional_env_var_list)
    399 

AttributeError: 'list' object has no attribute 'items'

icywang86rui · 2018-10-11T22:12:32Z

@yifeim - It looks like this time it failed at creating an endpoint. Does the same code work with 1.11 for the hosting part? Have you tried creating a model first like this:

model = m.create_model()
predictor = model.deploy(1, 'local')

I have walked through the code path that failed and the addtional_env_var is suppose to be an empty dict since you did not pass in any additional arguments to the deploy() call. I am trying to reproduce this error now.

yifeim · 2018-10-11T22:29:08Z

I could create a model, but could not deploy it. I had the exact same errors.

yifeim · 2018-10-11T22:32:34Z

Here?

https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L344

yifeim · 2018-10-11T22:34:37Z

It seemed that the codes detected that it is an empty dict and replaced it with empty list. Not sure why a list is the expected format.

icywang86rui · 2018-10-12T18:09:14Z

@yifeim Your are right and we have a bug there. It suppose to be:

additional_env_vars = additional_env_vars or {}
additional_volumes = additional_volumes or []

Thanks for pointing it out. I will get a fix out as soon as possible.

icywang86rui · 2018-10-12T22:20:45Z

PR to fix this #429

icywang86rui · 2018-10-15T21:40:46Z

PR is merged and this change is scheduled to be released tomorrow.

yifeim · 2018-10-21T20:50:02Z

I can verify that the new release solved both problems. Thanks a lot everyone!

yifeim changed the title ~~[bug] broken custom image training due to 1.11.1~~ [bug] broken custom image local mode training due to 1.11.1 Oct 7, 2018

ChoiByungWook added the type: bug label Oct 9, 2018

ChoiByungWook mentioned this issue Oct 10, 2018

Add job_name parameter for local mode #424

Merged

4 tasks

ChoiByungWook mentioned this issue Oct 10, 2018

Bump version to 1.11.2 #425

Merged

4 tasks

icywang86rui mentioned this issue Oct 12, 2018

Fix bug in local mode #429

Merged

4 tasks

yifeim closed this as completed Oct 21, 2018

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this issue Nov 15, 2018

Add regional S3 endpoint to PySpark notebooks (aws#421)

a4d231d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] broken custom image local mode training due to 1.11.1 #421

[bug] broken custom image local mode training due to 1.11.1 #421

yifeim commented Oct 7, 2018

yifeim commented Oct 8, 2018 •

edited

Loading

ChoiByungWook commented Oct 9, 2018

yifeim commented Oct 10, 2018

ChoiByungWook commented Oct 10, 2018

yifeim commented Oct 10, 2018

icywang86rui commented Oct 11, 2018

yifeim commented Oct 11, 2018

yifeim commented Oct 11, 2018

yifeim commented Oct 11, 2018

icywang86rui commented Oct 12, 2018

icywang86rui commented Oct 12, 2018

icywang86rui commented Oct 15, 2018

yifeim commented Oct 21, 2018

[bug] broken custom image local mode training due to 1.11.1 #421

[bug] broken custom image local mode training due to 1.11.1 #421

Comments

yifeim commented Oct 7, 2018

System Information

Describe the problem

Minimal repro / logs

yifeim commented Oct 8, 2018 • edited Loading

ChoiByungWook commented Oct 9, 2018

yifeim commented Oct 10, 2018

ChoiByungWook commented Oct 10, 2018

yifeim commented Oct 10, 2018

icywang86rui commented Oct 11, 2018

yifeim commented Oct 11, 2018

yifeim commented Oct 11, 2018

yifeim commented Oct 11, 2018

icywang86rui commented Oct 12, 2018

icywang86rui commented Oct 12, 2018

icywang86rui commented Oct 15, 2018

yifeim commented Oct 21, 2018

yifeim commented Oct 8, 2018 •

edited

Loading