Skip to content

Processing job running in a pipeline can't load default experiment #4114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lorenzwalthert opened this issue Sep 11, 2023 · 7 comments
Closed
Labels
component: experiments Relates to the SageMaker Experiements Platform component: pipelines Relates to the SageMaker Pipeline Platform type: feature request

Comments

@lorenzwalthert
Copy link

lorenzwalthert commented Sep 11, 2023

Describe the bug

When I try to load the currenlty active run with with sagemaker.experiments.load_run() in a processing job part of a pipeline without specifying the experiment config anywhere, I get

RuntimeError: Not able to fetch RunName in ExperimentConfig of the sagemaker job. Please make sure the ExperimentConfig is correctly set.

from my processing script. When I pass an experiment name and run name to sagemaker.experiments.load_run() the problem disappears. However, this is impractical, as I want to log to the current execution ID, which I can't retrieve in the processing job easily.

From the intro blog post, it says:

With the new automatic context passing capabilities, we do not need to specifically share the experiment configuration with the SageMaker job, as it will be automatically captured.

According to the doc page Experiment + Pipeline integration, I don't need to specify an experiment when creating a pipeline start a run or within a run etc.

To reproduce

I wanted to create a minimal reproducible example, but your latest scikit-learn container does not even contain the sagemaker sdk, so it's a bit difficult...

But here is my abbreviated pipeline script:

processor_define_contracts = sagemaker.processing.ScriptProcessor(
    sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
    ...
)

step= sagemaker.workflow.steps.ProcessingStep(
    step_args=processor_define_contracts.run(code='script.py'), 
    ...
)

pipeline = sagemaker.workflow.pipeline.Pipeline(
    ...
    steps=[step],
)
pipeline.start()

And my processing script script.py contains:

import sagemaker.experiments

with sagemaker.experiments.load_run() as run:
    run.log_parameter("param1", "value1") 

Expected behavior

I expect that in accordance with the docs, an experiment config has not to be specified, neighter when creating the pipeline nor in the processing scripts, to log to the current execution id of the pipeline.

Screenshots or logs

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.184
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Custom docker
  • Framework version: None
  • Python version: 3.11
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): Y

Additional context

This bug basically means that sagemaker experiments can't be used in processing steps in pipelines.

Everything seems to work fine if I try to run the same processing job outside of a pipeline using the syntax introduced in the intro blog post:

with Run(...):
    processor_define_contracts = sagemaker.processing.ScriptProcessor(
        sagemaker_session=sagemaker.workflow.pipeline_context.PipelineSession(),
        ...
    )
    
    step= sagemaker.workflow.steps.ProcessingStep(
        step_args=processor_define_contracts.run(code='script.py'), 
        ...
    )
@svia3 svia3 assigned svia3 and unassigned svia3 Sep 11, 2023
@svia3 svia3 added component: pipelines Relates to the SageMaker Pipeline Platform type: feature request and removed type: bug labels Sep 11, 2023
@martinRenou martinRenou added the component: experiments Relates to the SageMaker Experiements Platform label Sep 21, 2023
@lorenzwalthert
Copy link
Author

lorenzwalthert commented Oct 17, 2023

I came up with the following workaround to elicit run_name and experiment_name. In my script that runs in the container:

from sagemaker.utils import retry_with_backoff
import sagemaker.experiments
environment = sagemaker.experiments._environment._RunEnvironment.load()
print(f"{environment.source_arn=}")
job_name = environment.source_arn.split("/")[-1]
print(f"job_name: {job_name}")
experiment_config = retry_with_backoff(lambda: sagemaker.Session().describe_processing_job(job_name).get("ExperimentConfig"), num_attempts=4)
run_name = experiment_config['TrialName']
experiment_name = experiment_config['ExperimentName']

Then, I loaded the run.

with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name) as run:
        products = [models.Product[product_str.upper()] for product_str in products]
        run.log_parameter('testparam', 'schnell')
        run.log_metric('testmetric', 89)

The Run Group was set the default run group, and for that reason, I ended up with one entity from the default logging and one from my manual logging. Suboptimal. I decided to set the run_group and run_name myself to match the automatic logging.

from sagemaker.utils import retry_with_backoff
import sagemaker.experiments

environment = sagemaker.experiments._environment._RunEnvironment.load()
print(f"{environment.source_arn=}")
job_name = environment.source_arn.split("/")[-1]
print(f"job_name: {job_name}")
response = retry_with_backoff(
        lambda: sagemaker.Session().describe_processing_job(job_name), num_attempts=4
)
run_name = job_name + "-aws-processing-job"
experiment_config = response.get("ExperimentConfig")
run_group_name = experiment_config["TrialName"]
experiment_name = experiment_config["ExperimentName"]

And further down

sagemaker.experiments.run.TRIAL_NAME_TEMPLATE = run_group_name

However, this does give me the following error:

ValueError: The run_name (length: 69) must have length less than or equal to 59

This can be also circumvented since it seems that the SDK for some reasons restricts the length in _generate_trial_component_name by half of the backend requirement, so we can undo that:

sagemaker.experiments.run.MAX_NAME_LEN_IN_BACKEND = sagemaker.experiments.run.MAX_NAME_LEN_IN_BACKEND * 2

However, it still won't work:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrialComponent operation: The token -aws- is reserved.

Hence, it seems impossible to log to the run name that was already created with the processing job, even when trying manually. The docs promises to do that all automatically.

@jadhosn
Copy link

jadhosn commented Oct 26, 2023

@lorenzwalthert I'm facing the exact same issue. I'm creating a TrainingStep for an estimator, and similar to your experience above, I wasn't able to access the trial_name or any of the specified ExperimentConfig inside my run. Were you able to solve this issue any differently?

@jadhosn
Copy link

jadhosn commented Oct 26, 2023

@lorenzwalthert how about treating the step (be it a TrainingStep, ProcessingStep) as if it was a job: use load_run() inside your custom script to add metrics to an already existing experiment, under any run_name you choose (you can even create a randomly generated run_name at Pipeline execution, and feed it to your estimator as an ENV variable), and simply turn-off the experiments and trial_runs that are dynamically created by Pipelines (by setting ExperimentConfig=None in your pipeline definition). It's definitely hacky, but works for metric tracking inside a step of any pipeline. At least, till AWS comes back with an answer

@lorenzwalthert
Copy link
Author

lorenzwalthert commented Nov 2, 2023

feed it to your estimator as an ENV variable

That may require container re-build or complex passing of job names and creation of unique names. My example shows how you can extract all relevant information manually in teh container, it's just impossible to log to exactly the same name that was already created.

@ardhe-qb
Copy link

Are there any updates on this issue? This is an important problem for me.

@lorenzwalthert
Copy link
Author

I think upvoting the initial post and (if you have) forward a link to this issue with AWS Premium support is the way to go...

@ShwetaSingh801
Copy link
Collaborator

Hi @lorenzwalthert,

Thanks for using SageMaker and taking the time to suggest ways to improve SageMaker Python SDK. We have added your feature request it to our backlog of feature requests and may consider putting it into future SDK versions. I will go ahead and close the issue now, please let me know if you have any more feedback. Let me know if you have any other questions.

Best,
Shweta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: experiments Relates to the SageMaker Experiements Platform component: pipelines Relates to the SageMaker Pipeline Platform type: feature request
Projects
None yet
Development

No branches or pull requests

6 participants