TuningStep changing sagemaker_job_name in StaticHyperParameters when run get pipeline definition and pipeline upsert and kill cache #3651

ZhangPei25 · 2023-02-10T11:27:41Z

Describe the bug
During get pipeline definition and pipeline upsert, a sagemaker_job_name in Tuning step is created with base name and currenttimestamp, and added into "TrainingJobDefinition": {"StaticHyperParameters": xxx }. This will kill cache configuration even nothing has changed. This sagemaker_job_name is modified each time we run get pipeline definition or upsert, even with no change at all.

To reproduce

create a pipeline with a tuningstep in it and also cache configuration for tuning step
create a function can do get pipeline definition and pipeline upsert
run pipeline.definition() to get the pipeline json output many times without changes
compare pipeline definition Json file for different runs, you can see sagemaker_job_name is change with current timestap added
or run pipeline upsert multiple times, you can see every time, it will start from tuning again even it succeed at first run.

Expected behavior
This sagemaker_job_name should not change when do upsert, and should cause cache get killed.

Screenshots or logs
No

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.130.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): xgboost (I am using my own image and scripts)
Framework version: xgboost-1.5-1
Python version: 3.8
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Additional context
There is a similar bug for training and fixed last year
#2940

see definition of job name :

sagemaker-python-sdk/src/sagemaker/tuner.py

Line 456 in e2f3888

base_tuning_job_name (str): Prefix for the hyperparameter tuning job

and fix for trianing job :
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/workflow/steps.py#L493

qidewenwhen · 2023-03-02T22:45:10Z

Thanks for pointing this out!
The sagemaker_job_name added into TrainingJobDefinition-StaticHyperParameters is actually generated by the estimator object which is supplied to the tuner.
In this previous fix: https://github.com/aws/sagemaker-python-sdk/pull/2950/files, we removed sagemaker_job_name from hyperparameters in the TrainingStep, we may need to do similar thing to clean it up in the TuningStep as well.

~~Will open a PR to fix the issue~~

Update:

Even if we make the similar change of removing the sagemaker_job_name in the TuningStep request dict, it's still ending up with cache miss under the script mode (i.e. user supplied script in the estimator object), since the sagemaker_submit_directory in the TrainingJobDefinition-StaticHyperParameters is still changing each time with the timestamp suffix.

When looking into the current cache improvement logic, it seems to support TrainingStep and ProcessingStep only (see here)
Thus this issue is acctually a feature request for cache improvement on TuningStep.
cc the feature expert @brockwade633

Steps to reproduce

Start from this integ test
Update the TuningStep by:

    step_args = tuner.fit(inputs=inputs)
    step_cache_config = CacheConfig(enable_caching=True, expire_after="T12H")
    step_tune = TuningStep(
        name="my-tuning-step",
        step_args=step_args,
        cache_config=step_cache_config
    )

After the first execution, update the pipeline and trigger another one:

        pipeline.update(role)
        execution = pipeline.start(parameters={})
        execution_steps = execution.list_steps()
        print(execution_steps)

ShwetaSingh801 · 2023-12-22T21:01:16Z

Hi @ZhangPei25,

Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.

Best,
Shweta

ZhangPei25 added the type: bug label Feb 10, 2023

aoguo64 added the component: pipelines Relates to the SageMaker Pipeline Platform label Feb 13, 2023

qidewenwhen added type: feature request and removed type: bug labels Mar 3, 2023

ShwetaSingh801 closed this as completed Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TuningStep changing sagemaker_job_name in StaticHyperParameters when run get pipeline definition and pipeline upsert and kill cache #3651

TuningStep changing sagemaker_job_name in StaticHyperParameters when run get pipeline definition and pipeline upsert and kill cache #3651

ZhangPei25 commented Feb 10, 2023

qidewenwhen commented Mar 2, 2023 •

edited

Loading

ShwetaSingh801 commented Dec 22, 2023

TuningStep changing sagemaker_job_name in StaticHyperParameters when run get pipeline definition and pipeline upsert and kill cache #3651

TuningStep changing sagemaker_job_name in StaticHyperParameters when run get pipeline definition and pipeline upsert and kill cache #3651

Comments

ZhangPei25 commented Feb 10, 2023

qidewenwhen commented Mar 2, 2023 • edited Loading

Update:

Steps to reproduce

ShwetaSingh801 commented Dec 22, 2023

qidewenwhen commented Mar 2, 2023 •

edited

Loading