-
Notifications
You must be signed in to change notification settings - Fork 1.2k
TrainingJobAnalytics hard codes the period and time range #701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @bbalaji-ucsd, Thanks for using SageMaker and reporting this bug. Making period, start time and end time configurable would give callers a way to get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html I will keep you updated. |
Yes, exactly 👍 Thanks! |
…ngJobAnalytics Creating an TrainingJobAnalytics object fails if the training job has too many data points in the specified metrics. Make start time, end time and period configurable so the caller can get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html Original issue: aws#701
…r.analytics.TrainingJobAnalytics (#730) * Make start time, end time and period configurable in analytics.TrainingJobAnalytics Creating an TrainingJobAnalytics object fails if the training job has too many data points in the specified metrics. Make start time, end time and period configurable so the caller can get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html Original issue: #701 * Add analytics integ test to the TensorFlow script mode minist test * Minor changes due to PR comments and to make flake8 happy * More minor changes * One more minor change
The pr is merged. There will be a new release tomorrow. Closing the issue. Feel free to reopen if you have any further questions. |
Please fill out the form below.
System Information
Describe the problem
Describe the problem or feature request clearly here.
When I try to use the TrainingJobAnalytics function on a long running job, I got this error:
----> 5 df = TrainingJobAnalytics(job_name, ['episode_reward_mean']).dataframe()
6 # df = TrainingJobAnalytics(job_name, ['episode_len_mean']).dataframe()
7 num_metrics = len(df)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in dataframe(self, force_refresh)
55 self.clear_cache()
56 if self._dataframe is None:
---> 57 self._dataframe = self._fetch_dataframe()
58 return self._dataframe
59
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_dataframe(self)
260 def _fetch_dataframe(self):
261 for metric_name in self._metric_names:
--> 262 self._fetch_metric(metric_name)
263 return pd.DataFrame(self._data)
264
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_metric(self, metric_name)
280 'Statistics': ['Average'],
281 }
--> 282 raw_cwm_data = self._cloudwatch.get_metric_statistics(**request)['Datapoints']
283 if len(raw_cwm_data) == 0:
284 logging.warning("Warning: No metrics called %s found" % metric_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.name = str(py_operation_name)
~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
659 error_code = parsed_response.get("Error", {}).get("Code")
660 error_class = self.exceptions.from_code(error_code)
--> 661 raise error_class(parsed_response, operation_name)
662 else:
663 return parsed_response
InvalidParameterCombinationException: An error occurred (InvalidParameterCombination) when calling the GetMetricStatistics operation: You have requested up to 1,445 datapoints, which exceeds the limit of 1,440. You may reduce the datapoints requested by increasing Period, or decreasing the time range.
The period and time range are hard coded into the SDK:
sagemaker-python-sdk/src/sagemaker/analytics.py
Line 265 in 6a42335
Minimal repro / logs
Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
I cannot share the exact file, but the above error should be reproduceable on any metric that has more than 1440 datapoints.
The text was updated successfully, but these errors were encountered: