-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[BUG] Error thrown when trying to enable instance metrics for MXNet #276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @humanzz thanks for the report! We will look into this. |
Hi @iquintero, |
Hi @humanzz I just checked in one of my endpoints and it does post metrics. On the AWS Console under View Instance Metrics it does seem like there are no metrics: However if you click under We can see there are 5 metrics for this endpoint, clicking on it will show all the actual instance metrics: I will check with our console team to see why the link from the SageMaker console is not correct. However this should hopefully unblock you. Regarding SAGEMAKER_CONTAINER_LOG_LEVEL, the valid values are whatever the python logger accepts: https://docs.python.org/2/library/logging.html#logging-levels Hope this helps! |
Thanks @iquintero. I confirm I found the metrics for my endpoints as per your instructions and this definitely unblocks me. I hope the console team fixes this soon as this is tbh quite confusing. Thanks! |
seems like the change has been released for awhile now (and I can see the instance metrics when I look at the AWS console by simply clicking "View instance metrics"), so closing this issue. |
* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test
* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test
* fix: update api * fix: pass None experiment_config to AutoML * fix: making endpoint_name unique in AutoML test
Uh oh!
There was an error while loading. Please reload this page.
Please fill out the form below.
System Information
Describe the problem
The MXNet container by default does not emit instance metrics. When trying to enable it setting
enable_cloudwatch_metrics=True
, the training job fails with the exception pasted below.Minimal repro / logs
Command to train (note it uses some of my own python code as it's a custom MXNet model)
Training job CloudWatch log showing the error
Notes
enable_cloudwatch_metrics
is always set toFalse
. This might be due to the knowledge that the containers not supporting it. Otherwise, testing for cases withenable_cloudwatch_metrics=True
sound like something that should be added.The text was updated successfully, but these errors were encountered: