Skip to content

TrainingJobAnalytics hard codes the period and time range #701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bbalaji-ucsd opened this issue Mar 15, 2019 · 3 comments
Closed

TrainingJobAnalytics hard codes the period and time range #701

bbalaji-ucsd opened this issue Mar 15, 2019 · 3 comments

Comments

@bbalaji-ucsd
Copy link

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Tensorflow with Ray
  • Framework Version: Ray version - 0.5.3
  • Python Version: 3.6.5
  • CPU or GPU: CPU
  • Python SDK Version: 1.18.4
  • Are you using a custom image: Yes, upgraded Ray to 0.6.4. But that shouldn't affect this bug

Describe the problem

Describe the problem or feature request clearly here.

When I try to use the TrainingJobAnalytics function on a long running job, I got this error:
----> 5 df = TrainingJobAnalytics(job_name, ['episode_reward_mean']).dataframe()
6 # df = TrainingJobAnalytics(job_name, ['episode_len_mean']).dataframe()
7 num_metrics = len(df)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in dataframe(self, force_refresh)
55 self.clear_cache()
56 if self._dataframe is None:
---> 57 self._dataframe = self._fetch_dataframe()
58 return self._dataframe
59

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_dataframe(self)
260 def _fetch_dataframe(self):
261 for metric_name in self._metric_names:
--> 262 self._fetch_metric(metric_name)
263 return pd.DataFrame(self._data)
264

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/analytics.py in _fetch_metric(self, metric_name)
280 'Statistics': ['Average'],
281 }
--> 282 raw_cwm_data = self._cloudwatch.get_metric_statistics(**request)['Datapoints']
283 if len(raw_cwm_data) == 0:
284 logging.warning("Warning: No metrics called %s found" % metric_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
355 "%s() only accepts keyword arguments." % py_operation_name)
356 # The "self" in this scope is referring to the BaseClient.
--> 357 return self._make_api_call(operation_name, kwargs)
358
359 _api_call.name = str(py_operation_name)

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
659 error_code = parsed_response.get("Error", {}).get("Code")
660 error_class = self.exceptions.from_code(error_code)
--> 661 raise error_class(parsed_response, operation_name)
662 else:
663 return parsed_response

InvalidParameterCombinationException: An error occurred (InvalidParameterCombination) when calling the GetMetricStatistics operation: You have requested up to 1,445 datapoints, which exceeds the limit of 1,440. You may reduce the datapoints requested by increasing Period, or decreasing the time range.

The period and time range are hard coded into the SDK:

def _fetch_metric(self, metric_name):

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
I cannot share the exact file, but the above error should be reproduceable on any metric that has more than 1440 datapoints.

  • Exact command to reproduce:
@icywang86rui
Copy link
Contributor

Hi @bbalaji-ucsd,

Thanks for using SageMaker and reporting this bug. Making period, start time and end time configurable would give callers a way to get around this limit - https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html

I will keep you updated.

@bbalaji-ucsd
Copy link
Author

Yes, exactly 👍 Thanks!

icywang86rui added a commit to icywang86rui/sagemaker-python-sdk that referenced this issue Mar 29, 2019
…ngJobAnalytics

Creating an TrainingJobAnalytics object fails if the training job has too many
data points in the specified metrics. Make start time, end time and period
configurable so the caller can get around this limit -
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html

Original issue:
aws#701
icywang86rui added a commit that referenced this issue Apr 2, 2019
…r.analytics.TrainingJobAnalytics (#730)

* Make start time, end time and period configurable in analytics.TrainingJobAnalytics

Creating an TrainingJobAnalytics object fails if the training job has too many
data points in the specified metrics. Make start time, end time and period
configurable so the caller can get around this limit -
https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html

Original issue:
#701

* Add analytics integ test to the TensorFlow script mode minist test

* Minor changes due to PR comments and to make flake8 happy

* More minor changes

* One more minor change
@icywang86rui
Copy link
Contributor

The pr is merged. There will be a new release tomorrow. Closing the issue. Feel free to reopen if you have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants