[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

athewsey · 2020-10-02T09:31:12Z

Describe the feature you'd like

Stopped jobs (which could have been Completed) should show some kind of warning or even an error: Not just silence as they do today.

How would this feature be used? Please describe.

Common reasons a job might be Stopped rather than Completed include:

The job timed out (training may or may not still have exported a checkpoint or final model)
A custom detective control in the environment specifically terminated the job via a Stop*Job call (e.g. out of budget, security policy violation, etc)

In many such cases the job termination was not healthy, and in the case where the job was healthy, the developer must have taken explicit steps to achieve that (e.g. implementing checkpointing, etc).

Therefore the current pattern in the SDK of treating Stopped as a success is misleading to inexperienced users ("The .fit() cell ran with no errors right? Everything must be fine") or experienced users who might not realise they're working in an environment with detective controls implemented ("Why does it keep not saving the model!? I do it right there in the script!").

Describe alternatives you've considered

Current behaviour (Stopped == Completed)
- Not ideal for the reasons described above, but backward-compatible
print() a warning message on 'Stopped'
- Still easy to ignore, particularly if the job generated a lot of logs already before stopping and the warning is just added on below. Still doesn't interrupt code execution.
- ...but simple and not breaking
Raise a Python warning on 'Stopped'
- Nice and visible in display: IPython will render in a red box much like uncaught errors. Doesn't break existing code flows.
- ...but default warnings settings are a bit weird in notebook kernels: Easy for users to have the warning set to "once", in which case it will only display the first time it's triggered - which could be even more confusing. Still doesn't interrupt code execution.
Raise a specific error on 'Stopped'
- Breaking change in the (unusual?) case of code flows that use job timeout as standard (rather than other stopping conditions)
- ...but sets a nice intuitive behaviour that your notebook cell will terminate nicely if your model/processing runs successfully, and error otherwise.
- Would also not pollute logs/warnings in the event that the condition is explicitly expected and handled, which it could be easily for users who expect the condition.

(4) seems like a nice solution, so long as the logic to catch that specific error (and not Failed) is reasonably intuitive.

Additional context

It seems like the relevant implementation is in Session._check_job_status().

The text was updated successfully, but these errors were encountered:

metrizable · 2020-10-06T04:43:48Z

Hello @athewsey

Thank you for using Amazon SageMaker.

Option 3, together with, say, logging I think would satisfy most of the requirements that you mention while still being compatible with the standards of semantic versioning for the feature to be included in sagemaker==2.x. Option 4 is an appropriate request to be tagged with a v3.0.0 milestone. I think we'd be open to either path.

We are always re-evaluating our backlog of features based on customer requests, so we appreciate the feedback on this feature.

athewsey · 2020-10-10T11:49:18Z

Thanks for getting back! I think for me, ideal might be to try and implement option 3 as you describe within the current sagemaker 2.x, and aim towards raising the severity to an error in sagemaker 3.0. Would be interested to hear thoughts from other users too!

ajaykarpur · 2020-12-03T17:34:21Z

Merged in #2000

metrizable added the type: feature request label Oct 6, 2020

metrizable added the contributions welcome label Oct 6, 2020

athewsey mentioned this issue Nov 24, 2020

feature: warn on 'Stopped' (non-Completed) jobs #2000

Merged

7 tasks

ajaykarpur closed this as completed Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

athewsey commented Oct 2, 2020

metrizable commented Oct 6, 2020

Uh oh!

athewsey commented Oct 10, 2020

Uh oh!

ajaykarpur commented Dec 3, 2020

Uh oh!

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

[Request] Give warning/error when job ends in 'Stopped' rather than 'Completed' #1937

Comments

athewsey commented Oct 2, 2020

metrizable commented Oct 6, 2020

Uh oh!

athewsey commented Oct 10, 2020

Uh oh!

ajaykarpur commented Dec 3, 2020

Uh oh!