Processing job running in a pipeline can't load default experiment #4114

lorenzwalthert · 2023-09-11T13:18:09Z

Describe the bug

When I try to load the currenlty active run with with sagemaker.experiments.load_run() in a processing job part of a pipeline without specifying the experiment config anywhere, I get

RuntimeError: Not able to fetch RunName in ExperimentConfig of the sagemaker job. Please make sure the ExperimentConfig is correctly set.

from my processing script. When I pass an experiment name and run name to sagemaker.experiments.load_run() the problem disappears. However, this is impractical, as I want to log to the current execution ID, which I can't retrieve in the processing job easily.

From the intro blog post, it says:

With the new automatic context passing capabilities, we do not need to specifically share the experiment configuration with the SageMaker job, as it will be automatically captured.

According to the doc page Experiment + Pipeline integration, I don't need to specify an experiment when creating a pipeline start a run or within a run etc.

To reproduce

I wanted to create a minimal reproducible example, but your latest scikit-learn container does not even contain the sagemaker sdk, so it's a bit difficult...

But here is my abbreviated pipeline script:

processor_define_contracts = sagemaker.processing.ScriptProcessor(
    sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
    ...
)

step= sagemaker.workflow.steps.ProcessingStep(
    step_args=processor_define_contracts.run(code='script.py'), 
    ...
)

pipeline = sagemaker.workflow.pipeline.Pipeline(
    ...
    steps=[step],
)
pipeline.start()

And my processing script script.py contains:

import sagemaker.experiments

with sagemaker.experiments.load_run() as run:
    run.log_parameter("param1", "value1")

Expected behavior

I expect that in accordance with the docs, an experiment config has not to be specified, neighter when creating the pipeline nor in the processing scripts, to log to the current execution id of the pipeline.

Screenshots or logs

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.184
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom docker
Framework version: None
Python version: 3.11
CPU or GPU: CPU
Custom Docker image (Y/N): Y

Additional context

This bug basically means that sagemaker experiments can't be used in processing steps in pipelines.

Everything seems to work fine if I try to run the same processing job outside of a pipeline using the syntax introduced in the intro blog post:

with Run(...):
    processor_define_contracts = sagemaker.processing.ScriptProcessor(
        sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
        ...
    )
    
    step= sagemaker.workflow.steps.ProcessingStep(
        step_args=processor_define_contracts.run(code='script.py'), 
        ...
    )

The text was updated successfully, but these errors were encountered:

lorenzwalthert · 2023-10-17T19:12:11Z

I came up with the following workaround to elicit run_name and experiment_name. In my script that runs in the container:

from sagemaker.utils import retry_with_backoff
import sagemaker.experiments
environment = sagemaker.experiments._environment._RunEnvironment.load()
print(f"{environment.source_arn=}")
job_name = environment.source_arn.split("/")[-1]
print(f"job_name: {job_name}")
experiment_config = retry_with_backoff(lambda: sagemaker.Session().describe_processing_job(job_name).get("ExperimentConfig"), num_attempts=4)
run_name = experiment_config['TrialName']
experiment_name = experiment_config['ExperimentName']

Then, I loaded the run.

with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name) as run:
        products = [models.Product[product_str.upper()] for product_str in products]
        run.log_parameter('testparam', 'schnell')
        run.log_metric('testmetric', 89)

The Run Group was set the default run group, and for that reason, I ended up with one entity from the default logging and one from my manual logging. Suboptimal. I decided to set the run_group and run_name myself to match the automatic logging.

from sagemaker.utils import retry_with_backoff
import sagemaker.experiments

environment = sagemaker.experiments._environment._RunEnvironment.load()
print(f"{environment.source_arn=}")
job_name = environment.source_arn.split("/")[-1]
print(f"job_name: {job_name}")
response = retry_with_backoff(
        lambda: sagemaker.Session().describe_processing_job(job_name), num_attempts=4
)
run_name = job_name + "-aws-processing-job"
experiment_config = response.get("ExperimentConfig")
run_group_name = experiment_config["TrialName"]
experiment_name = experiment_config["ExperimentName"]

And further down

sagemaker.experiments.run.TRIAL_NAME_TEMPLATE = run_group_name

However, this does give me the following error:

ValueError: The run_name (length: 69) must have length less than or equal to 59

This can be also circumvented since it seems that the SDK for some reasons restricts the length in _generate_trial_component_name by half of the backend requirement, so we can undo that:

sagemaker.experiments.run.MAX_NAME_LEN_IN_BACKEND = sagemaker.experiments.run.MAX_NAME_LEN_IN_BACKEND * 2

However, it still won't work:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrialComponent operation: The token -aws- is reserved.

Hence, it seems impossible to log to the run name that was already created with the processing job, even when trying manually. The docs promises to do that all automatically.

jadhosn · 2023-10-26T17:47:04Z

@lorenzwalthert I'm facing the exact same issue. I'm creating a TrainingStep for an estimator, and similar to your experience above, I wasn't able to access the trial_name or any of the specified ExperimentConfig inside my run. Were you able to solve this issue any differently?

jadhosn · 2023-10-26T21:06:38Z

@lorenzwalthert how about treating the step (be it a TrainingStep, ProcessingStep) as if it was a job: use load_run() inside your custom script to add metrics to an already existing experiment, under any run_name you choose (you can even create a randomly generated run_name at Pipeline execution, and feed it to your estimator as an ENV variable), and simply turn-off the experiments and trial_runs that are dynamically created by Pipelines (by setting ExperimentConfig=None in your pipeline definition). It's definitely hacky, but works for metric tracking inside a step of any pipeline. At least, till AWS comes back with an answer

lorenzwalthert · 2023-11-02T10:47:09Z

feed it to your estimator as an ENV variable

That may require container re-build or complex passing of job names and creation of unique names. My example shows how you can extract all relevant information manually in teh container, it's just impossible to log to exactly the same name that was already created.

ardhe-qb · 2023-11-29T14:42:44Z

Are there any updates on this issue? This is an important problem for me.

lorenzwalthert · 2023-11-29T15:06:00Z

I think upvoting the initial post and (if you have) forward a link to this issue with AWS Premium support is the way to go...

ShwetaSingh801 · 2023-12-26T18:44:00Z

Hi @lorenzwalthert,

Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.

Best,
Shweta

lorenzwalthert added the type: bug label Sep 11, 2023

svia3 assigned svia3 and unassigned svia3 Sep 11, 2023

svia3 added component: pipelines Relates to the SageMaker Pipeline Platform type: feature request and removed type: bug labels Sep 11, 2023

martinRenou added the component: experiments Relates to the SageMaker Experiements Platform label Sep 21, 2023

ShwetaSingh801 closed this as completed Dec 26, 2023

This was referenced May 3, 2024

Community feedback handling #4380

Open

processing job infra spin up takes 16x longer than the job itself #4358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing job running in a pipeline can't load default experiment #4114

Processing job running in a pipeline can't load default experiment #4114

lorenzwalthert commented Sep 11, 2023 •

edited

Loading

lorenzwalthert commented Oct 17, 2023 •

edited

Loading

jadhosn commented Oct 26, 2023

jadhosn commented Oct 26, 2023

lorenzwalthert commented Nov 2, 2023 •

edited

Loading

ardhe-qb commented Nov 29, 2023

lorenzwalthert commented Nov 29, 2023

ShwetaSingh801 commented Dec 26, 2023

Processing job running in a pipeline can't load default experiment #4114

Processing job running in a pipeline can't load default experiment #4114

Comments

lorenzwalthert commented Sep 11, 2023 • edited Loading

lorenzwalthert commented Oct 17, 2023 • edited Loading

jadhosn commented Oct 26, 2023

jadhosn commented Oct 26, 2023

lorenzwalthert commented Nov 2, 2023 • edited Loading

ardhe-qb commented Nov 29, 2023

lorenzwalthert commented Nov 29, 2023

ShwetaSingh801 commented Dec 26, 2023

lorenzwalthert commented Sep 11, 2023 •

edited

Loading

lorenzwalthert commented Oct 17, 2023 •

edited

Loading

lorenzwalthert commented Nov 2, 2023 •

edited

Loading