Skip to content

SDK crashes when getting hyperparameters from completed TF script mode job #717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akorobko opened this issue Mar 22, 2019 · 1 comment
Closed

Comments

@akorobko
Copy link
Contributor

Please fill out the form below.

System Information

  • Framework: TensorFlow
  • Framework Version: 1.12
  • Python Version: 3.6
  • CPU or GPU: N/A
  • Python SDK Version: 1.18.7
  • Are you using a custom image: No

Describe the problem

Describe the problem or feature request clearly here.

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The crash is reproducible on any completed TensorFlow job which was executed in Script Mode.
After attaching to such job and getting hyperparameters from it, the SDK crashes.

from sagemaker.tensorflow import TensorFlow
job = TensorFlow.attach(training_job_name='JOB NAME')
hp = job.hyperparameters()

Analysis

When submitting TensorFlow job in script mode, it is not possible to specify checkpoint_path anymore (if specified, the SDK raises exception stating that parameter is not supported in script mode).

When attaching to the completed job, the SDK does not set value for the checkpoint_path variable (for obvious reasons) and does not populate private variable _current_job_name as well (which is supposed to hold job name).

When call to hyperparameters() is made, first it tries to get checkpoint_path (which is not used below for the script mode - there is an if). But since checkpoint_path is not defined, it calls _default_s3_path() method to recreate one. That function in turn just concatenates series of sub-paths including _current_job_name variable which is set to None. So, here SDK crashes.

To fix it, the code line which queries check point path should be moved to the else section where it is used.

@akorobko
Copy link
Contributor Author

akorobko commented Mar 28, 2019

The fix is part of 1.18.9 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant