Skip to content

Commit 8dc17fb

Browse files
stacichovikas0203
andauthored
feature: Pipelines cache keys update (#3428)
Co-authored-by: vikas0203 <[email protected]>
1 parent 55f2365 commit 8dc17fb

File tree

1 file changed

+163
-7
lines changed

1 file changed

+163
-7
lines changed

doc/amazon_sagemaker_model_building_pipeline.rst

+163-7
Original file line numberDiff line numberDiff line change
@@ -322,29 +322,34 @@ Example:
322322

323323
.. code-block:: python
324324
325+
bucket = "my-bucket"
326+
model_prefix = "my-model"
327+
325328
step_tune = TuningStep(...)
326329
# tuning step can launch multiple training jobs, thus producing multiple model artifacts
327330
# we can create a model with the best performance
328331
best_model = Model(
329332
model_data=Join(
330333
on="/",
331334
values=[
332-
"s3://my-bucket",
335+
f"s3://{bucket}/{model_prefix}",
333336
# from DescribeHyperParameterTuningJob
334337
step_tune.properties.BestTrainingJob.TrainingJobName,
335338
"output/model.tar.gz",
336339
],
340+
)
337341
)
338342
# we can also access any top-k best as we wish
339343
second_best_model = Model(
340344
model_data=Join(
341345
on="/",
342346
values=[
343-
"s3://my-bucket",
347+
f"s3://{bucket}/{model_prefix}",
344348
# from ListTrainingJobsForHyperParameterTuningJob
345349
step_tune.properties.TrainingJobSummaries[1].TrainingJobName,
346350
"output/model.tar.gz",
347351
],
352+
)
348353
)
349354
350355
:class:`sagemaker.workflow.steps.TuningStep` also has a helper function to generate any :code:`top-k` model data URI easily:
@@ -353,7 +358,8 @@ Example:
353358
354359
model_data = step_tune.get_top_model_s3_uri(
355360
top_k=0, # best model
356-
s3_bucket="s3://my-bucekt",
361+
s3_bucket=bucket,
362+
prefix=model_prefix
357363
)
358364
359365
CreateModelStep
@@ -833,9 +839,9 @@ The following example uses :class:`sagemaker.workflow.parallelism_config.Paralle
833839
834840
Caching Configuration
835841
==============================
836-
Executing the step without changing its configurations, inputs, or outputs can be a waste. Thus, we can enable caching for pipeline steps. When caching is enabled, an expiration time (in `ISO8601 duration string format`_) needs to be supplied. The expiration time indicates how old a previous execution can be to be considered for reuse.
842+
Executing the step without changing its configurations, inputs, or outputs can be a waste. Thus, we can enable caching for pipeline steps. When you use step signature caching, SageMaker Pipelines tries to use a previous run of your current pipeline step instead of running the step again. When previous runs are considered for reuse, certain arguments from the step are evaluated to see if any have changed. If any of these arguments have been updated, the step will execute again with the new configuration.
837843
838-
.. _ISO8601 duration string format: https://en.wikipedia.org/wiki/ISO_8601#Durations
844+
When you turn on caching, you supply an expiration time (in `ISO8601 duration string format <https://en.wikipedia.org/wiki/ISO_8601#Durations>`__). The expiration time indicates how old a previous execution can be to be considered for reuse.
839845
840846
.. code-block:: python
841847
@@ -844,13 +850,13 @@ Executing the step without changing its configurations, inputs, or outputs can b
844850
expire_after="P30d" # 30-day
845851
)
846852
847-
Here are few sample ISO8601 duration strings:
853+
You can format your ISO8601 duration strings like the following examples:
848854
849855
- :code:`p30d`: 30 days
850856
- :code:`P4DT12H`: 4 days and 12 hours
851857
- :code:`T12H`: 12 hours
852858
853-
Caching is supported for the following step type:
859+
Caching is supported for the following step types:
854860
855861
- :class:`sagemaker.workflow.steps.TrainingStep`
856862
- :class:`sagemaker.workflow.steps.ProcessingStep`
@@ -860,6 +866,156 @@ Caching is supported for the following step type:
860866
- :class:`sagemaker.workflow.clarify_check_step.ClarifyCheckStep`
861867
- :class:`sagemaker.workflow.emr_step.EMRStep`
862868
869+
In order to create pipeline steps and eventually construct a SageMaker pipeline, you provide parameters within a Python script or notebook. The SageMaker Python SDK creates a pipeline definition by translating these parameters into SageMaker job attributes. Some of these attributes, when changed, cause the step to re-run (See `Caching Pipeline Steps <https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-caching.html>`__ for a detailed list). Therefore, if you update a SDK parameter that is used to create such an attribute, the step will rerun. See the following discussion for examples of this in processing and training steps, which are commonly used steps in Pipelines.
870+
871+
The following example creates a processing step:
872+
873+
.. code-block:: python
874+
875+
from sagemaker.workflow.pipeline_context import PipelineSession
876+
from sagemaker.sklearn.processing import SKLearnProcessor
877+
from sagemaker.workflow.steps import ProcessingStep
878+
from sagemaker.dataset_definition.inputs import S3Input
879+
from sagemaker.processing import ProcessingInput, ProcessingOutput
880+
881+
pipeline_session = PipelineSession()
882+
883+
framework_version = "0.23-1"
884+
885+
sklearn_processor = SKLearnProcessor(
886+
framework_version=framework_version,
887+
instance_type="ml.m5.xlarge",
888+
instance_count=processing_instance_count,
889+
role=role,
890+
sagemaker_session=pipeline_session
891+
)
892+
893+
processor_args = sklearn_processor.run(
894+
inputs=[
895+
ProcessingInput(
896+
source="artifacts/data/abalone-dataset.csv",
897+
input_name="abalone-dataset",
898+
s3_input=S3Input(
899+
local_path="/opt/ml/processing/input",
900+
s3_uri="artifacts/data/abalone-dataset.csv",
901+
s3_data_type="S3Prefix",
902+
s3_input_mode="File",
903+
s3_data_distribution_type="FullyReplicated",
904+
s3_compression_type="None",
905+
)
906+
)
907+
],
908+
outputs=[
909+
ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
910+
ProcessingOutput(output_name="validation", source="/opt/ml/processing/validation"),
911+
ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
912+
],
913+
code="artifacts/code/process/preprocessing.py",
914+
)
915+
916+
processing_step = ProcessingStep(
917+
name="Process",
918+
step_args=processor_args,
919+
cache_config=cache_config
920+
)
921+
922+
The following parameters from the example cause additional processing step iterations when you change them:
923+
924+
- :code:`framework_version`: This parameter is used to construct the :code:`image_uri` for the `AppSpecification <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AppSpecification.html>`__ attribute of the processing job.
925+
- :code:`inputs`: Any :class:`ProcessingInputs` are passed through directly as job `ProcessingInputs <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html>`__. Input :code:`source` files that exist in the container’s local file system are uploaded to S3 and given a new :code:`S3_Uri`. If the S3 path changes, a new processing job is initiated. For examples of S3 paths, see the **S3 Artifact Folder Structure** section.
926+
- :code:`code`: The code parameter is also packaged as a `ProcessingInput <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html>`__ job. For local files, a unique hash is created from the file. The file is then uploaded to S3 with the hash included in the path. When a different local file is used, a new hash is created and the S3 path for that `ProcessingInput <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingInput.html>`__ changes, initiating a new step run. For examples S3 paths, see the **S3 Artifact Folder Structure** section.
927+
928+
The following example creates a training step:
929+
930+
.. code-block:: python
931+
932+
from sagemaker.sklearn.estimator import SKLearn
933+
from sagemaker.workflow.steps import TrainingStep
934+
935+
pipeline_session = PipelineSession()
936+
937+
image_uri = sagemaker.image_uris.retrieve(
938+
framework="xgboost",
939+
region=region,
940+
version="1.0-1",
941+
py_version="py3",
942+
instance_type="ml.m5.xlarge",
943+
)
944+
945+
hyperparameters = {
946+
"dataset_frequency": "H",
947+
"timestamp_format": "yyyy-MM-dd hh:mm:ss",
948+
"number_of_backtest_windows": "1",
949+
"role_arn": role_arn,
950+
"region": region,
951+
}
952+
953+
sklearn_estimator = SKLearn(
954+
entry_point="train.py",
955+
role=role_arn,
956+
image_uri=container_image_uri,
957+
instance_type=training_instance_type,
958+
sagemaker_session=pipeline_session,
959+
base_job_name="training_job",
960+
hyperparameters=hyperparameters,
961+
enable_sagemaker_metrics=True,
962+
)
963+
964+
train_args = xgb_train.fit(
965+
inputs={
966+
"train": TrainingInput(
967+
s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
968+
"train"
969+
].S3Output.S3Uri,
970+
content_type="text/csv",
971+
),
972+
"validation": TrainingInput(
973+
s3_data=step_process.properties.ProcessingOutputConfig.Outputs[
974+
"validation"
975+
].S3Output.S3Uri,
976+
content_type="text/csv",
977+
),
978+
}
979+
)
980+
981+
training_step = TrainingStep(
982+
name="Train",
983+
estimator=sklearn_estimator,
984+
cache_config=cache_config
985+
)
986+
987+
The following parameters from the example cause additional training step iterations when you change them:
988+
989+
- :code:`image_uri`: The :code:`image_uri` parameter defines the image used for training, and is used directly in the `AlgorithmSpecification <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AlgorithmSpecification.html>`__ attribute of the training job.
990+
- :code:`hyperparameters`: All of the hyperparameters are used directly in the `HyperParameters <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeTrainingJob.html#API_DescribeTrainingJob_ResponseSyntax>`__ attribute for the training job.
991+
- :code:`entry_point`: The entry point file is included in the training job’s `InputDataConfig Channel <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html>`__ array. A unique hash is created from the file (and any other dependencies), and then the file is uploaded to S3 with the hash included in the path. When a different entry point file is used, a new hash is created and the S3 path for that `InputDataConfig Channel <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html>`__ object changes, initiating a new step run. For examples of what the S3 paths look like, see the **S3 Artifact Folder Structure** section.
992+
- :code:`inputs`: The inputs are also included in the training job’s `InputDataConfig <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Channel.html>`__. Local inputs are uploaded to S3. If the S3 path changes, a new training job is initiated. For examples of S3 paths, see the **S3 Artifact Folder Structure** section.
993+
994+
S3 Artifact Folder Structure
995+
----------------------------
996+
997+
You use the following S3 paths when uploading local input and code artifacts, and when saving output artifacts.
998+
999+
*Processing*
1000+
1001+
- Code: :code:`s3://bucket_name/pipeline_name/code/<code_hash>/file.py`. The file could also be a tar.gz of source_dir and dependencies.
1002+
- Input Data: :code:`s3://bucket_name/pipeline_name/step_name/input/input_name/file.csv`
1003+
- Configuration: :code:`s3://bucket_name/pipeline_name/step_name/input/conf/<configuration_hash>/configuration.json`
1004+
- Output: :code:`s3://bucket_name/pipeline_name/<execution_id>/step_name/output/output_name`
1005+
1006+
*Training*
1007+
1008+
- Code: :code:`s3://bucket_name/code_location/pipeline_name/code/<code_hash>/code.tar.gz`
1009+
- Output: The output paths for Training jobs can vary - the default output path is the root of the s3 bucket: :code:`s3://bucket_name`. For Training jobs created from a Tuning job, the default path includes the Training job name created by the Training platform, formatted as :code:`s3://bucket_name/<training_job_name>/output/model.tar.gz`.
1010+
1011+
*Transform*
1012+
1013+
- Output: :code:`s3://bucket_name/pipeline_name/<execution_id>/step_name`
1014+
1015+
.. warning::
1016+
For input artifacts such as data or code files, the actual content of the artifacts is not tracked, only the S3 path. This means that if a file in S3 is updated and re-uploaded directly with an identical name and path, then the step does NOT run again.
1017+
1018+
8631019
Retry Policy
8641020
===============
8651021

0 commit comments

Comments
 (0)