TrainingJobAnalytics hard codes the period and time range #701

bbalaji-ucsd · 2019-03-15T18:09:51Z

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow with Ray
Framework Version: Ray version - 0.5.3
Python Version: 3.6.5
CPU or GPU: CPU
Python SDK Version: 1.18.4
Are you using a custom image: Yes, upgraded Ray to 0.6.4. But that shouldn't affect this bug

Describe the problem

Describe the problem or feature request clearly here.

When I try to use the TrainingJobAnalytics function on a long running job, I got this error:
----> 5 df = TrainingJobAnalytics(job_name, ['episode_reward_mean']).dataframe()
6 # df = TrainingJobAnalytics(job_name, ['episode_len_mean']).dataframe()
7 num_metrics = len(df)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in dataframe(self, force_refresh)
55 self.clear_cache()
56 if self._dataframe is None:
---> 57 self._dataframe = self._fetch_dataframe()
58 return self._dataframe
59

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_dataframe(self)
260 def _fetch_dataframe(self):
261 for metric_name in self._metric_names:
--> 262 self._fetch_metric(metric_name)
263 return pd.DataFrame(self._data)
264

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_metric(self, metric_name)
280 'Statistics': ['Average'],
281 }
--> 282 raw_cwm_data = self._cloudwatch.get_metric_statistics(**request)['Datapoints']
283 if len(raw_cwm_data) == 0:
284 logging.warning("Warning: No metrics called %s found" % metric_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.name = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
659 error_code = parsed_response.get("Error", {}).get("Code")
660 error_class = self.exceptions.from_code(error_code)
--> 661 raise error_class(parsed_response, operation_name)
662 else:
663 return parsed_response

InvalidParameterCombinationException: An error occurred (InvalidParameterCombination) when calling the GetMetricStatistics operation: You have requested up to 1,445 datapoints, which exceeds the limit of 1,440. You may reduce the datapoints requested by increasing Period, or decreasing the time range.

The period and time range are hard coded into the SDK:

sagemaker-python-sdk/src/sagemaker/analytics.py

Line 265 in 6a42335

def _fetch_metric(self, metric_name):

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
I cannot share the exact file, but the above error should be reproduceable on any metric that has more than 1440 datapoints.

Exact command to reproduce:

icywang86rui · 2019-03-27T17:13:03Z

Hi @bbalaji-ucsd,

Thanks for using SageMaker and reporting this bug. Making period, start time and end time configurable would give callers a way to get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html

I will keep you updated.

bbalaji-ucsd · 2019-03-27T17:17:40Z

Yes, exactly 👍 Thanks!

…ngJobAnalytics Creating an TrainingJobAnalytics object fails if the training job has too many data points in the specified metrics. Make start time, end time and period configurable so the caller can get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html Original issue: aws#701

…r.analytics.TrainingJobAnalytics (#730) * Make start time, end time and period configurable in analytics.TrainingJobAnalytics Creating an TrainingJobAnalytics object fails if the training job has too many data points in the specified metrics. Make start time, end time and period configurable so the caller can get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html Original issue: #701 * Add analytics integ test to the TensorFlow script mode minist test * Minor changes due to PR comments and to make flake8 happy * More minor changes * One more minor change

icywang86rui · 2019-04-02T21:39:58Z

The pr is merged. There will be a new release tomorrow. Closing the issue. Feel free to reopen if you have any further questions.

icywang86rui mentioned this issue Mar 29, 2019

change: make start time, end time and period configurable in sagemaker.analytics.TrainingJobAnalytics #730

Merged

4 tasks

icywang86rui closed this as completed Apr 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrainingJobAnalytics hard codes the period and time range #701

TrainingJobAnalytics hard codes the period and time range #701

bbalaji-ucsd commented Mar 15, 2019

icywang86rui commented Mar 27, 2019

bbalaji-ucsd commented Mar 27, 2019

icywang86rui commented Apr 2, 2019

TrainingJobAnalytics hard codes the period and time range #701

TrainingJobAnalytics hard codes the period and time range #701

Comments

bbalaji-ucsd commented Mar 15, 2019

System Information

Describe the problem

Minimal repro / logs

icywang86rui commented Mar 27, 2019

bbalaji-ucsd commented Mar 27, 2019

icywang86rui commented Apr 2, 2019