SageMaker experiments not being created within training job when mandated to have tags #3989

inchara1990 · 2023-07-11T17:29:28Z

Describe the bug
We have multiple data science teams onboarded on sagemaker studio. We mandate all users to tag sagemaker resources in order to enable tag based access control. However, when using sagemaker experiments with a pytorch training job, the job fails because of 'Access Denied Error' at load_run() method as it seems that the required tags are not being passed on to the experiment object creation. Even if the experiment with the right tags exist it seems that the Experiment._load_or_create method in the SDK first tries to create the experiment object and then attempts to load if it only encounters a 'resource already exists' exception. But in my case it fails on 'Access Denied' error and hence is not caught. I confirm that this happening only at the training job level and I am able to create both experiment and trial using their specific create methods.

To reproduce
Sagemaker execution role should have the below policy in addition to sagemaker:AddTags permission on all resources. Tag the execution role itself with key as 'team' and a 'test' value to allow for principalTag comparison.

[
   {
    "Condition": {
      "StringEquals": {
        "aws:RequestTag/team": "${aws:PrincipalTag/team}"
      }
    },
    "Action": "sagemaker:Create*",
    "Resource": "*",
    "Effect": "Allow"
  },
  {
    "Condition": {
      "StringEquals": {
        "aws:ResourceTag/key": "${aws:PrincipalTag/team}"
      }
    },
    "Effect": "Allow",
    "Action": "sagemaker:CreateTrial",
    "Resource": ["arn:aws:sagemaker:*:*:experiment/*"]
  }
]

The following works

 from smexperiments import experiment
 from smexperiments.trial import Trial
 default_tags = [{'Key': 'team', 'Value': 'test'}]
 experiment = experiment.Experiment.create(experiment_name='MNIST',tags=default_tags) 
 trial = Trial.create(trial_name="linear-learner2",experiment_name="MNIST",tags=default_tags)

The below throws an 'Access Denied' exception

   with Run(experiment_name='MNIST', run_name=run_name, sagemaker_session=sess,tags=default_tags) as run:
        pytorch_estimator = PyTorch(entry_point ='train_script_expt.py.py',
                                                         ....
                                                         tags = default_tags)
        pytorch_estimator.fit({ 'train': trainpath,
                                              'test': testpath })

Training job script which throws the error

     with load_run(sagemaker_session=session) as run:
        run.log_parameters(
            { "device":device, "epochs":args.epochs}
        )

Expected behavior
I was expecting that the tags to be passed on to the experiment and trial creation within the sagemaker job. Or if the guidance is to make sure that experiment and trial already exists - then the SDK should load the experiment regardless of the error thrown upon create in the try block.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
error stacktrack

Traceback (most recent call last):
  File "train_script_expt.py", line 190, in <module>
with load_run(sagemaker_session=session) as run:
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 847, in load_run
run_instance = Run(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 177, in __init__
self._experiment = Experiment._load_or_create(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 171, in _load_or_create
raise ce
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 160, in _load_or_create
    experiment = Experiment.create(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 120, in create
return cls._construct(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 190, in _construct
    return instance._invoke_api(boto_method_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 226, in _invoke_api
api_boto_response = api_method(**api_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts::XXXX:assumed-role/aimlSagemakerStudio-xyzteam-LTWZ4NF3WWVM/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:us-east-2:XXXexperiment/xxxx because no identity-based policy allows the sagemaker:CreateExperiment action

System information
A description of your system. Please provide:

SageMaker Python SDK version: 2.171.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
Framework version: 1.12
Python version: 3.8
CPU or GPU: CPU
Custom Docker image (Y/N): N

The text was updated successfully, but these errors were encountered:

leoroy · 2023-07-18T18:07:01Z

This is a bug in the SDK. The SDK tries to create the experiment before loading it. If it encounters resource existed exception, it will load the experiment. However, because of the tag condition set on the IAM, the create will fail due to access deny instead of resource existed exception. Therefore, load_run() failed.

leoroy · 2023-07-18T18:10:08Z

We need to update the logic in load_run() to separate it from with Run(...) experience. load_run() should load first and then fall back to create if the run does not exist

danabens · 2023-07-23T18:49:42Z

the create will fail due to access deny instead of resource existed exception....
need to update the logic in load_run

If the tag is passed on create as expected the response should be ValidationException-already exists, not AccessDenied. it looks like the root cause is either:

logic is not passing tags as expected to CreateExperiment
customer is passing tag that does not match policy (the tag passed in code does not match the tag on the execution role)
something else (some other policy in effect not shown here)

danabens · 2023-07-23T20:07:19Z

I went through to reproduce this and here are some are the probably root causes:

Policy changes take a some time (~minutes) and may temporarily revert (i.e. works, works, doesn't work, works again).
There is some other policy on the studio role that is denying access.
The provided tag does not match the tag on the principal.

Tags are propagated as expected from SDK to API calls, so thats not the issue.

See attached policy and notebook which may help in troubleshooting:

For more debugging help provide:

Complete policy detail for the Principal in use.
Example notebook with tag value provided to SDK
The tags on the Principal.

inchara1990 · 2023-07-23T21:13:19Z

Hi Dana,
In your notebook, both the following code snippets(cell 97 & 98) result in AccessDenied error even though you pass the tags to the run context. Is that expected?

with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sm_sess) as run: run.log_parameter('optimizer', 'adam')

with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sm_sess, tags=tag_list) as run: run.log_parameter('optimizer', 'adam')

It seems like the _load_or_create method in experiment.py is expecting the tags to be passed to Experiment.create() without which it is not fulfilling the below condition which is necessary to load an experiment-
if not (error_code == "ValidationException" and "already exists" in error_message)
Hence it is not even trying to load the existing experiment but raising an error soon after the above if condition is not fulfilled.
If the tags are being passed why else is Experiment.create() failing with AccessDenied ?

danabens · 2023-07-25T01:41:04Z

there are 3 conditions in the notebook:

CreateExperiment with no tags
CreateExperiment with a team tag that does not match the team tag on the Principal
CreateExperiment with a team tag that does match the team tag on the Principal

Ya, we can add support for this scenario "If a caller doesn't have authorization to call Create* but does have authorization to Describe* then they should still be able to successfully call with Run...", however in your issue it seemed like you were expecting to have Create* authorization via the tag policy.

should load first and then fall back to create if the run does not exist

this solution makes sense since read APIs have higher TPS than mutate APIs vs the alternative solution to skip-AccessDenied-on-create in load_or_create solution.

in the case where the requestor does have Create* authorization (via tags) the current sdk passes tags as expected.

inchara1990 · 2023-07-25T14:51:45Z

Hi @danabens
Apologies for misunderstanding the right & wrong tags in your notebook. I see that with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sess, tags=default_tags) as run: run.log_parameter('optimizer', 'adam')
works for me as well. It is only when I use the load_run() method in the training job I get the access denied error.

lorenzwalthert · 2023-09-25T13:07:40Z

I also need sagemaker:CreateExperiment for loading a run from an existing experiment without tags and I think that's a problem, since i don't want to give roles the permission to create experiments when they are supposed to use an experiment that already exists.

with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name) as run:
   pass

I get

botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts:::assumed-role/[...]/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:eu-central-1::experiment/dev-inference because no identity-based policy allows the sagemaker:CreateExperiment action

inchara1990 added the type: bug label Jul 11, 2023

danabens self-assigned this Jul 18, 2023

danabens removed their assignment Jul 25, 2023

ananth102 mentioned this issue Aug 1, 2023

fix: Supporting tbac in load_run #4039

Merged

9 tasks

knikure linked a pull request Aug 23, 2023 that will close this issue

fix: Supporting tbac in load_run #4039

Merged

9 tasks

martinRenou added the component: experiments Relates to the SageMaker Experiements Platform label Sep 21, 2023

akrishna1995 closed this as completed in #4039 Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SageMaker experiments not being created within training job when mandated to have tags #3989

SageMaker experiments not being created within training job when mandated to have tags #3989

inchara1990 commented Jul 11, 2023 •

edited

Loading

leoroy commented Jul 18, 2023

leoroy commented Jul 18, 2023

danabens commented Jul 23, 2023

danabens commented Jul 23, 2023

inchara1990 commented Jul 23, 2023 •

edited

Loading

danabens commented Jul 25, 2023 •

edited

Loading

inchara1990 commented Jul 25, 2023

lorenzwalthert commented Sep 25, 2023

SageMaker experiments not being created within training job when mandated to have tags #3989

SageMaker experiments not being created within training job when mandated to have tags #3989

Comments

inchara1990 commented Jul 11, 2023 • edited Loading

leoroy commented Jul 18, 2023

leoroy commented Jul 18, 2023

danabens commented Jul 23, 2023

danabens commented Jul 23, 2023

inchara1990 commented Jul 23, 2023 • edited Loading

danabens commented Jul 25, 2023 • edited Loading

inchara1990 commented Jul 25, 2023

lorenzwalthert commented Sep 25, 2023

inchara1990 commented Jul 11, 2023 •

edited

Loading

inchara1990 commented Jul 23, 2023 •

edited

Loading

danabens commented Jul 25, 2023 •

edited

Loading