Skip to content

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
athewsey opened this issue Oct 2, 2020 · 3 comments

Comments

@athewsey
Copy link
Collaborator

athewsey commented Oct 2, 2020

Describe the feature you'd like

Stopped jobs (which could have been Completed) should show some kind of warning or even an error: Not just silence as they do today.

How would this feature be used? Please describe.

Common reasons a job might be Stopped rather than Completed include:

  • The job timed out (training may or may not still have exported a checkpoint or final model)
  • A custom detective control in the environment specifically terminated the job via a Stop*Job call (e.g. out of budget, security policy violation, etc)

In many such cases the job termination was not healthy, and in the case where the job was healthy, the developer must have taken explicit steps to achieve that (e.g. implementing checkpointing, etc).

Therefore the current pattern in the SDK of treating Stopped as a success is misleading to inexperienced users ("The .fit() cell ran with no errors right? Everything must be fine") or experienced users who might not realise they're working in an environment with detective controls implemented ("Why does it keep not saving the model!? I do it right there in the script!").

Describe alternatives you've considered

  1. Current behaviour (Stopped == Completed)
    • Not ideal for the reasons described above, but backward-compatible
  2. print() a warning message on 'Stopped'
    • Still easy to ignore, particularly if the job generated a lot of logs already before stopping and the warning is just added on below. Still doesn't interrupt code execution.
    • ...but simple and not breaking
  3. Raise a Python warning on 'Stopped'
    • Nice and visible in display: IPython will render in a red box much like uncaught errors. Doesn't break existing code flows.
    • ...but default warnings settings are a bit weird in notebook kernels: Easy for users to have the warning set to "once", in which case it will only display the first time it's triggered - which could be even more confusing. Still doesn't interrupt code execution.
  4. Raise a specific error on 'Stopped'
    • Breaking change in the (unusual?) case of code flows that use job timeout as standard (rather than other stopping conditions)
    • ...but sets a nice intuitive behaviour that your notebook cell will terminate nicely if your model/processing runs successfully, and error otherwise.
    • Would also not pollute logs/warnings in the event that the condition is explicitly expected and handled, which it could be easily for users who expect the condition.

(4) seems like a nice solution, so long as the logic to catch that specific error (and not Failed) is reasonably intuitive.

Additional context

It seems like the relevant implementation is in Session._check_job_status().

@metrizable
Copy link
Contributor

Hello @athewsey

Thank you for using Amazon SageMaker.

Option 3, together with, say, logging I think would satisfy most of the requirements that you mention while still being compatible with the standards of semantic versioning for the feature to be included in sagemaker==2.x. Option 4 is an appropriate request to be tagged with a v3.0.0 milestone. I think we'd be open to either path.

We are always re-evaluating our backlog of features based on customer requests, so we appreciate the feedback on this feature.

@athewsey
Copy link
Collaborator Author

Thanks for getting back! I think for me, ideal might be to try and implement option 3 as you describe within the current sagemaker 2.x, and aim towards raising the severity to an error in sagemaker 3.0. Would be interested to hear thoughts from other users too!

@ajaykarpur
Copy link
Contributor

Merged in #2000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants