You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Stopped jobs (which could have been Completed) should show some kind of warning or even an error: Not just silence as they do today.
How would this feature be used? Please describe.
Common reasons a job might be Stopped rather than Completed include:
The job timed out (training may or may not still have exported a checkpoint or final model)
A custom detective control in the environment specifically terminated the job via a Stop*Job call (e.g. out of budget, security policy violation, etc)
In many such cases the job termination was not healthy, and in the case where the job was healthy, the developer must have taken explicit steps to achieve that (e.g. implementing checkpointing, etc).
Therefore the current pattern in the SDK of treating Stopped as a success is misleading to inexperienced users ("The .fit() cell ran with no errors right? Everything must be fine") or experienced users who might not realise they're working in an environment with detective controls implemented ("Why does it keep not saving the model!? I do it right there in the script!").
Describe alternatives you've considered
Current behaviour (Stopped == Completed)
Not ideal for the reasons described above, but backward-compatible
print() a warning message on 'Stopped'
Still easy to ignore, particularly if the job generated a lot of logs already before stopping and the warning is just added on below. Still doesn't interrupt code execution.
...but simple and not breaking
Raise a Python warning on 'Stopped'
Nice and visible in display: IPython will render in a red box much like uncaught errors. Doesn't break existing code flows.
...but default warnings settings are a bit weird in notebook kernels: Easy for users to have the warning set to "once", in which case it will only display the first time it's triggered - which could be even more confusing. Still doesn't interrupt code execution.
Raise a specific error on 'Stopped'
Breaking change in the (unusual?) case of code flows that use job timeout as standard (rather than other stopping conditions)
...but sets a nice intuitive behaviour that your notebook cell will terminate nicely if your model/processing runs successfully, and error otherwise.
Would also not pollute logs/warnings in the event that the condition is explicitly expected and handled, which it could be easily for users who expect the condition.
(4) seems like a nice solution, so long as the logic to catch that specific error (and not Failed) is reasonably intuitive.
Option 3, together with, say, logging I think would satisfy most of the requirements that you mention while still being compatible with the standards of semantic versioning for the feature to be included in sagemaker==2.x. Option 4 is an appropriate request to be tagged with a v3.0.0 milestone. I think we'd be open to either path.
We are always re-evaluating our backlog of features based on customer requests, so we appreciate the feedback on this feature.
Thanks for getting back! I think for me, ideal might be to try and implement option 3 as you describe within the current sagemaker 2.x, and aim towards raising the severity to an error in sagemaker 3.0. Would be interested to hear thoughts from other users too!
Describe the feature you'd like
Stopped
jobs (which could have beenCompleted
) should show some kind of warning or even an error: Not just silence as they do today.How would this feature be used? Please describe.
Common reasons a job might be
Stopped
rather thanCompleted
include:In many such cases the job termination was not healthy, and in the case where the job was healthy, the developer must have taken explicit steps to achieve that (e.g. implementing checkpointing, etc).
Therefore the current pattern in the SDK of treating
Stopped
as a success is misleading to inexperienced users ("The.fit()
cell ran with no errors right? Everything must be fine") or experienced users who might not realise they're working in an environment with detective controls implemented ("Why does it keep not saving the model!? I do it right there in the script!").Describe alternatives you've considered
warnings
settings are a bit weird in notebook kernels: Easy for users to have the warning set to "once", in which case it will only display the first time it's triggered - which could be even more confusing. Still doesn't interrupt code execution.(4) seems like a nice solution, so long as the logic to catch that specific error (and not
Failed
) is reasonably intuitive.Additional context
It seems like the relevant implementation is in Session._check_job_status().
The text was updated successfully, but these errors were encountered: