You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
During get pipeline definition and pipeline upsert, a sagemaker_job_name in Tuning step is created with base name and currenttimestamp, and added into "TrainingJobDefinition": {"StaticHyperParameters": xxx }. This will kill cache configuration even nothing has changed. This sagemaker_job_name is modified each time we run get pipeline definition or upsert, even with no change at all.
To reproduce
create a pipeline with a tuningstep in it and also cache configuration for tuning step
create a function can do get pipeline definition and pipeline upsert
run pipeline.definition() to get the pipeline json output many times without changes
compare pipeline definition Json file for different runs, you can see sagemaker_job_name is change with current timestap added
or run pipeline upsert multiple times, you can see every time, it will start from tuning again even it succeed at first run.
Expected behavior
This sagemaker_job_name should not change when do upsert, and should cause cache get killed.
Screenshots or logs
No
System information
A description of your system. Please provide:
SageMaker Python SDK version: 2.130.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost (I am using my own image and scripts)
Framework version: xgboost-1.5-1
Python version: 3.8
CPU or GPU: CPU
Custom Docker image (Y/N): Y
Additional context
There is a similar bug for training and fixed last year #2940
Thanks for pointing this out!
The sagemaker_job_name added into TrainingJobDefinition-StaticHyperParameters is actually generated by the estimator object which is supplied to the tuner.
In this previous fix: https://github.com/aws/sagemaker-python-sdk/pull/2950/files, we removed sagemaker_job_name from hyperparameters in the TrainingStep, we may need to do similar thing to clean it up in the TuningStep as well.
Will open a PR to fix the issue
Update:
Even if we make the similar change of removing the sagemaker_job_name in the TuningStep request dict, it's still ending up with cache miss under the script mode (i.e. user supplied script in the estimator object), since the sagemaker_submit_directory in the TrainingJobDefinition-StaticHyperParameters is still changing each time with the timestamp suffix.
When looking into the current cache improvement logic, it seems to support TrainingStep and ProcessingStep only (see here)
Thus this issue is acctually a feature request for cache improvement on TuningStep.
cc the feature expert @brockwade633
Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.
Describe the bug
During get pipeline definition and pipeline upsert, a sagemaker_job_name in Tuning step is created with base name and currenttimestamp, and added into "TrainingJobDefinition": {"StaticHyperParameters": xxx }. This will kill cache configuration even nothing has changed. This sagemaker_job_name is modified each time we run get pipeline definition or upsert, even with no change at all.
To reproduce
Expected behavior
This sagemaker_job_name should not change when do upsert, and should cause cache get killed.
Screenshots or logs
No
System information
A description of your system. Please provide:
Additional context
There is a similar bug for training and fixed last year
#2940
see definition of job name :
sagemaker-python-sdk/src/sagemaker/tuner.py
Line 456 in e2f3888
and fix for trianing job :
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/workflow/steps.py#L493
The text was updated successfully, but these errors were encountered: