Skip to content

[BUG] Error thrown when trying to enable instance metrics for MXNet #276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
humanzz opened this issue Jul 5, 2018 · 5 comments
Closed
Labels
status: pending release The fix have been merged but not yet released to PyPI type: bug

Comments

@humanzz
Copy link
Contributor

humanzz commented Jul 5, 2018

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): MXNet
  • Framework Version: 1.1
  • Python Version: 3
  • CPU or GPU: CPU
  • Python SDK Version: 1.5.1
  • Are you using a custom image: No

Describe the problem

The MXNet container by default does not emit instance metrics. When trying to enable it setting enable_cloudwatch_metrics=True, the training job fails with the exception pasted below.

Minimal repro / logs

Command to train (note it uses some of my own python code as it's a custom MXNet model)

    m = MXNet("kamel_estimator.py",
              source_dir='src',
              role=role,
              train_instance_count=1,
              train_instance_type="ml.m4.xlarge",
              sagemaker_session=sagemaker_session,
              py_version="py3",
              base_job_name="kamel-test-20180705",
              enable_cloudwatch_metrics=True,
              output_path=get_config(current_env, 'artifacts_s3_bucket'),
              code_location=get_config(current_env, 'artifacts_s3_bucket'),
              hyperparameters={'some_param': 1)
m.fit('s3://kamel-data/training/data)

Training job CloudWatch log showing the error

2018-07-05 11:30:31,569 INFO - root - running container entrypoint
2018-07-05 11:30:31,570 INFO - root - starting train task
2018-07-05 11:30:31,573 INFO - container_support.training - Training starting
2018-07-05 11:30:31,575 INFO - container_support.environment - starting metrics service
2018-07-05 11:30:31,578 ERROR - container_support.training - uncaught exception during training: [Errno 2] No such file or directory: 'telegraf'
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 32, in start
env.start_metrics_if_enabled()
File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 124, in start_metrics_if_enabled
subprocess.Popen(['telegraf', '--config', telegraf_conf])
File "/usr/lib/python3.5/subprocess.py", line 947, in __init__
restore_signals, start_new_session)
File "/usr/lib/python3.5/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg)
FileNotFoundError: [Errno 2] No such file or directory: 'telegraf'

Notes

  • I am suspecting the issue is not actually with the Python SDK, but with the container itself not having the necessary library
  • I am opening the issue here as this is the entry point that exposes the problem
  • All over the SDK, enable_cloudwatch_metrics is always set to False. This might be due to the knowledge that the containers not supporting it. Otherwise, testing for cases with enable_cloudwatch_metrics=True sound like something that should be added.
@iquintero
Copy link
Contributor

Hi @humanzz thanks for the report! We will look into this.

@humanzz
Copy link
Contributor Author

humanzz commented Jul 24, 2018

Hi @iquintero,
I'm seeing some progress as evidenced by your #292. I saw in your change that in the estimator there's now a warning that this is deprecated and that metrics are by default emitted for training jobs. How about for endpoints?
I already have an already running endpoint and I can see it's not emitting anything. That was the main purpose I tried to enable the metrics - to monitor the endpoint.
Also, related to that, I failed to find any documentation about SAGEMAKER_CONTAINER_LOG_LEVEL and what values are relevant or whether this affects Instance Metrics at all.

@iquintero
Copy link
Contributor

iquintero commented Jul 31, 2018

Hi @humanzz

I just checked in one of my endpoints and it does post metrics. On the AWS Console under View Instance Metrics it does seem like there are no metrics:

image

However if you click under /aws/sagemaker/Endpoints and keep the same endpoint name filter, you will see something similar to:

image

We can see there are 5 metrics for this endpoint, clicking on it will show all the actual instance metrics:

image

I will check with our console team to see why the link from the SageMaker console is not correct. However this should hopefully unblock you.

Regarding SAGEMAKER_CONTAINER_LOG_LEVEL, the valid values are whatever the python logger accepts: https://docs.python.org/2/library/logging.html#logging-levels

Hope this helps!

@humanzz
Copy link
Contributor Author

humanzz commented Aug 5, 2018

Thanks @iquintero. I confirm I found the metrics for my endpoints as per your instructions and this definitely unblocks me.

I hope the console team fixes this soon as this is tbh quite confusing.

Thanks!

@ChoiByungWook ChoiByungWook added the status: pending release The fix have been merged but not yet released to PyPI label Aug 14, 2018
@laurenyu
Copy link
Contributor

seems like the change has been released for awhile now (and I can see the instance metrics when I look at the AWS console by simply clicking "View instance metrics"), so closing this issue.

knakad added a commit to knakad/sagemaker-python-sdk that referenced this issue Nov 27, 2019
* fix: update api
* fix: pass None experiment_config to AutoML
* fix: making endpoint_name unique in AutoML test
knakad added a commit to knakad/sagemaker-python-sdk that referenced this issue Dec 4, 2019
* fix: update api
* fix: pass None experiment_config to AutoML
* fix: making endpoint_name unique in AutoML test
knakad added a commit that referenced this issue Dec 4, 2019
* fix: update api
* fix: pass None experiment_config to AutoML
* fix: making endpoint_name unique in AutoML test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: pending release The fix have been merged but not yet released to PyPI type: bug
Projects
None yet
Development

No branches or pull requests

4 participants