Skip to content

Cache does not work with the PySparkProcessor class #3384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
HarryPommier opened this issue Sep 29, 2022 · 3 comments
Closed

Cache does not work with the PySparkProcessor class #3384

HarryPommier opened this issue Sep 29, 2022 · 3 comments
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug

Comments

@HarryPommier
Copy link

HarryPommier commented Sep 29, 2022

Describe the bug
It seems that the cache mechanism does not work with the PySparkProcessor.

To reproduce

pyspark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="3.1",
    role=role_arn,
    instance_type="ml.m5.xlarge",
    instance_count=8,
    sagemaker_session=pipeline_session,
    max_runtime_in_seconds=2400,
)

step_process_args = pyspark_processor.run(
    submit_app="steps/preprocess.py",
    outputs=[ProcessingOutput(output_name="train", source="/opt/ml/processing/output", destination=f"s3://{static_bucket}/{static_prefix}")],
)

step_process = ProcessingStep(
    name="PySparkPreprocessing",
    step_args=step_process_args,
    cache_config=cache_config,
)

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.109.0

Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :

    if code:
        code_url = urlparse(code)
        if code_url.scheme == "" or code_url.scheme == "file":
            # By default, Processor will upload the local code to an S3 path
            # containing a timestamp. This causes cache misses whenever a
            # pipeline is updated, even if the underlying script hasn't changed.
            # To avoid this, hash the contents of the script and include it
            # in the job_name passed to the Processor, which will be used
            # instead of the timestamped path.
            self.job_name = self._generate_code_upload_path()

is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.

I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).

@rohangujarathi rohangujarathi added the component: pipelines Relates to the SageMaker Pipeline Platform label Nov 12, 2022
@brockwade633
Copy link
Contributor

Hi @HarryPommier, thanks for the info here. There have been some caching improvements released in the sagemaker python sdk recently, including the introduction of new s3 upload path structures. The new paths are based on the hash of the local artifacts, avoiding the problem you mentioned of time-stamped paths causing cache misses. The new path structures are summarized here.

Are you able to re-try your PySparkProcessor pipeline example with the latest sdk version 2.125.0 and confirm the cache is working as expected?

@HarryPommier
Copy link
Author

Hi @brockwade633, I re-tried to run my pipeline with sdk version 2.125.0 and it fixes the cache problem mentioned above. Thanks for the fix !

@brockwade633
Copy link
Contributor

No problem! Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pipelines Relates to the SageMaker Pipeline Platform type: bug
Projects
None yet
Development

No branches or pull requests

3 participants