[BUG] Error thrown when trying to enable instance metrics for MXNet #276

humanzz · 2018-07-05T13:28:07Z

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): MXNet
Framework Version: 1.1
Python Version: 3
CPU or GPU: CPU
Python SDK Version: 1.5.1
Are you using a custom image: No

Describe the problem

The MXNet container by default does not emit instance metrics. When trying to enable it setting enable_cloudwatch_metrics=True, the training job fails with the exception pasted below.

Minimal repro / logs

Command to train (note it uses some of my own python code as it's a custom MXNet model)

    m = MXNet("kamel_estimator.py",
              source_dir='src',
              role=role,
              train_instance_count=1,
              train_instance_type="ml.m4.xlarge",
              sagemaker_session=sagemaker_session,
              py_version="py3",
              base_job_name="kamel-test-20180705",
              enable_cloudwatch_metrics=True,
              output_path=get_config(current_env, 'artifacts_s3_bucket'),
              code_location=get_config(current_env, 'artifacts_s3_bucket'),
              hyperparameters={'some_param': 1)
m.fit('s3://kamel-data/training/data)

Training job CloudWatch log showing the error

2018-07-05 11:30:31,569 INFO - root - running container entrypoint
2018-07-05 11:30:31,570 INFO - root - starting train task
2018-07-05 11:30:31,573 INFO - container_support.training - Training starting
2018-07-05 11:30:31,575 INFO - container_support.environment - starting metrics service
2018-07-05 11:30:31,578 ERROR - container_support.training - uncaught exception during training: [Errno 2] No such file or directory: 'telegraf'
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 32, in start
env.start_metrics_if_enabled()
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 124, in start_metrics_if_enabled
subprocess.Popen(['telegraf', '--config', telegraf_conf])
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'telegraf'

Notes

I am suspecting the issue is not actually with the Python SDK, but with the container itself not having the necessary library
I am opening the issue here as this is the entry point that exposes the problem
All over the SDK, enable_cloudwatch_metrics is always set to False. This might be due to the knowledge that the containers not supporting it. Otherwise, testing for cases with enable_cloudwatch_metrics=True sound like something that should be added.

The text was updated successfully, but these errors were encountered:

iquintero · 2018-07-06T17:38:00Z

Hi @humanzz thanks for the report! We will look into this.

humanzz · 2018-07-24T17:03:13Z

Hi @iquintero,
I'm seeing some progress as evidenced by your #292. I saw in your change that in the estimator there's now a warning that this is deprecated and that metrics are by default emitted for training jobs. How about for endpoints?
I already have an already running endpoint and I can see it's not emitting anything. That was the main purpose I tried to enable the metrics - to monitor the endpoint.
Also, related to that, I failed to find any documentation about SAGEMAKER_CONTAINER_LOG_LEVEL and what values are relevant or whether this affects Instance Metrics at all.

iquintero · 2018-07-31T17:54:27Z

Hi @humanzz

I just checked in one of my endpoints and it does post metrics. On the AWS Console under View Instance Metrics it does seem like there are no metrics:

However if you click under /aws/sagemaker/Endpoints and keep the same endpoint name filter, you will see something similar to:

We can see there are 5 metrics for this endpoint, clicking on it will show all the actual instance metrics:

I will check with our console team to see why the link from the SageMaker console is not correct. However this should hopefully unblock you.

Regarding SAGEMAKER_CONTAINER_LOG_LEVEL, the valid values are whatever the python logger accepts: https://docs.python.org/2/library/logging.html#logging-levels

Hope this helps!

humanzz · 2018-08-05T13:01:19Z

Thanks @iquintero. I confirm I found the metrics for my endpoints as per your instructions and this definitely unblocks me.

I hope the console team fixes this soon as this is tbh quite confusing.

Thanks!

laurenyu · 2019-04-17T19:08:31Z

seems like the change has been released for awhile now (and I can see the instance metrics when I look at the AWS console by simply clicking "View instance metrics"), so closing this issue.

* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test

iquintero added the type: bug label Jul 6, 2018

iquintero mentioned this issue Jul 13, 2018

Deprecate enable_cloudwatch_metrics from Frameworks #292

Merged

4 tasks

ChoiByungWook added the status: pending release The fix have been merged but not yet released to PyPI label Aug 14, 2018

laurenyu closed this as completed Apr 17, 2019

knakad added a commit to knakad/sagemaker-python-sdk that referenced this issue Nov 27, 2019

fix: update api to latest (aws#276)

1d4ba28

* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test

knakad added a commit to knakad/sagemaker-python-sdk that referenced this issue Dec 4, 2019

fix: update api to latest (aws#276)

fe5d0d8

* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test

knakad added a commit that referenced this issue Dec 4, 2019

fix: update api to latest (#276)

6da0ab5

* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Error thrown when trying to enable instance metrics for MXNet #276

[BUG] Error thrown when trying to enable instance metrics for MXNet #276

humanzz commented Jul 5, 2018 •

edited

Loading

iquintero commented Jul 6, 2018

Uh oh!

humanzz commented Jul 24, 2018

Uh oh!

iquintero commented Jul 31, 2018 •

edited

Loading

Uh oh!

humanzz commented Aug 5, 2018

Uh oh!

laurenyu commented Apr 17, 2019

Uh oh!

[BUG] Error thrown when trying to enable instance metrics for MXNet #276

[BUG] Error thrown when trying to enable instance metrics for MXNet #276

Comments

humanzz commented Jul 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

System Information

Describe the problem

Minimal repro / logs

Command to train (note it uses some of my own python code as it's a custom MXNet model)

Training job CloudWatch log showing the error

Notes

iquintero commented Jul 6, 2018

Uh oh!

humanzz commented Jul 24, 2018

Uh oh!

iquintero commented Jul 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

humanzz commented Aug 5, 2018

Uh oh!

laurenyu commented Apr 17, 2019

Uh oh!

humanzz commented Jul 5, 2018 •

edited

Loading

iquintero commented Jul 31, 2018 •

edited

Loading