You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
System information
A description of your system. Please provide:
SageMaker Python SDK version: 2.109.0
Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :
if code:
code_url = urlparse(code)
if code_url.scheme == "" or code_url.scheme == "file":
# By default, Processor will upload the local code to an S3 path
# containing a timestamp. This causes cache misses whenever a
# pipeline is updated, even if the underlying script hasn't changed.
# To avoid this, hash the contents of the script and include it
# in the job_name passed to the Processor, which will be used
# instead of the timestamped path.
self.job_name = self._generate_code_upload_path()
is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.
I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).
The text was updated successfully, but these errors were encountered:
Hi @HarryPommier, thanks for the info here. There have been some caching improvements released in the sagemaker python sdk recently, including the introduction of new s3 upload path structures. The new paths are based on the hash of the local artifacts, avoiding the problem you mentioned of time-stamped paths causing cache misses. The new path structures are summarized here.
Are you able to re-try your PySparkProcessor pipeline example with the latest sdk version 2.125.0 and confirm the cache is working as expected?
Describe the bug
It seems that the cache mechanism does not work with the PySparkProcessor.
To reproduce
System information
A description of your system. Please provide:
Additional context
I think #2790 does not solve the caching problem when using the "PySparkProcessor" class.
As far as I understand, this piece of code (src/sagemaker/workflow/steps.py) :
is only executed when the "code" argument is provided. This is not the case when using PySparkProcessor since we only provide a "submit_app" argument.
I found a temporary workaround, consisting of providing a static s3 path for the "submit_app" parameter (otherwise a time-stamped path is always generated and makes the cache fail).
The text was updated successfully, but these errors were encountered: