Cache does not work with the PySparkProcessor class #3384

HarryPommier · 2022-09-29T16:01:36Z

Describe the bug
It seems that the cache mechanism does not work with the PySparkProcessor.

To reproduce

pyspark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="3.1",
    role=role_arn,
    instance_type="ml.m5.xlarge",
    instance_count=8,
    sagemaker_session=pipeline_session,
    max_runtime_in_seconds=2400,
)

step_process_args = pyspark_processor.run(
    submit_app="steps/preprocess.py",
    outputs=[ProcessingOutput(output_name="train", source="/opt/ml/processing/output", destination=f"s3://{static_bucket}/{static_prefix}")],
)

step_process = ProcessingStep(
    name="PySparkPreprocessing",
    step_args=step_process_args,
    cache_config=cache_config,
)

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.109.0

Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :

    if code:
        code_url = urlparse(code)
        if code_url.scheme == "" or code_url.scheme == "file":
            # By default, Processor will upload the local code to an S3 path
            # containing a timestamp. This causes cache misses whenever a
            # pipeline is updated, even if the underlying script hasn't changed.
            # To avoid this, hash the contents of the script and include it
            # in the job_name passed to the Processor, which will be used
            # instead of the timestamped path.
            self.job_name = self._generate_code_upload_path()

is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.

I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).

The text was updated successfully, but these errors were encountered:

brockwade633 · 2022-12-19T21:25:18Z

Hi @HarryPommier, thanks for the info here. There have been some caching improvements released in the sagemaker python sdk recently, including the introduction of new s3 upload path structures. The new paths are based on the hash of the local artifacts, avoiding the problem you mentioned of time-stamped paths causing cache misses. The new path structures are summarized here.

Are you able to re-try your PySparkProcessor pipeline example with the latest sdk version 2.125.0 and confirm the cache is working as expected?

HarryPommier · 2022-12-20T09:57:42Z

Hi @brockwade633, I re-tried to run my pipeline with sdk version 2.125.0 and it fixes the cache problem mentioned above. Thanks for the fix !

brockwade633 · 2022-12-20T18:13:14Z

No problem! Closing the issue.

HarryPommier added the type: bug label Sep 29, 2022

rohangujarathi added the component: pipelines Relates to the SageMaker Pipeline Platform label Nov 12, 2022

brockwade633 closed this as completed Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache does not work with the PySparkProcessor class #3384

Cache does not work with the PySparkProcessor class #3384

HarryPommier commented Sep 29, 2022 •

edited

Loading

brockwade633 commented Dec 19, 2022

HarryPommier commented Dec 20, 2022

brockwade633 commented Dec 20, 2022

Cache does not work with the PySparkProcessor class #3384

Cache does not work with the PySparkProcessor class #3384

Comments

HarryPommier commented Sep 29, 2022 • edited Loading

brockwade633 commented Dec 19, 2022

HarryPommier commented Dec 20, 2022

brockwade633 commented Dec 20, 2022

HarryPommier commented Sep 29, 2022 •

edited

Loading