Skip to content

SageMaker experiments not being created within training job when mandated to have tags #3989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
inchara1990 opened this issue Jul 11, 2023 · 8 comments · Fixed by #4039
Closed
Labels
component: experiments Relates to the SageMaker Experiements Platform type: bug

Comments

@inchara1990
Copy link

inchara1990 commented Jul 11, 2023

Describe the bug
We have multiple data science teams onboarded on sagemaker studio. We mandate all users to tag sagemaker resources in order to enable tag based access control. However, when using sagemaker experiments with a pytorch training job, the job fails because of 'Access Denied Error' at load_run() method as it seems that the required tags are not being passed on to the experiment object creation. Even if the experiment with the right tags exist it seems that the Experiment._load_or_create method in the SDK first tries to create the experiment object and then attempts to load if it only encounters a 'resource already exists' exception. But in my case it fails on 'Access Denied' error and hence is not caught. I confirm that this happening only at the training job level and I am able to create both experiment and trial using their specific create methods.

To reproduce
Sagemaker execution role should have the below policy in addition to sagemaker:AddTags permission on all resources. Tag the execution role itself with key as 'team' and a 'test' value to allow for principalTag comparison.

[
   {
    "Condition": {
      "StringEquals": {
        "aws:RequestTag/team": "${aws:PrincipalTag/team}"
      }
    },
    "Action": "sagemaker:Create*",
    "Resource": "*",
    "Effect": "Allow"
  },
  {
    "Condition": {
      "StringEquals": {
        "aws:ResourceTag/key": "${aws:PrincipalTag/team}"
      }
    },
    "Effect": "Allow",
    "Action": "sagemaker:CreateTrial",
    "Resource": ["arn:aws:sagemaker:*:*:experiment/*"]
  }
]

The following works

 from smexperiments import experiment
 from smexperiments.trial import Trial
 default_tags = [{'Key': 'team', 'Value': 'test'}]
 experiment = experiment.Experiment.create(experiment_name='MNIST',tags=default_tags) 
 trial = Trial.create(trial_name="linear-learner2",experiment_name="MNIST",tags=default_tags)

The below throws an 'Access Denied' exception

   with Run(experiment_name='MNIST', run_name=run_name, sagemaker_session=sess,tags=default_tags) as run:
        pytorch_estimator = PyTorch(entry_point ='train_script_expt.py.py',
                                                         ....
                                                         tags = default_tags)
        pytorch_estimator.fit({ 'train': trainpath,
                                              'test': testpath })

Training job script which throws the error

     with load_run(sagemaker_session=session) as run:
        run.log_parameters(
            { "device":device, "epochs":args.epochs}
        )

Expected behavior
I was expecting that the tags to be passed on to the experiment and trial creation within the sagemaker job. Or if the guidance is to make sure that experiment and trial already exists - then the SDK should load the experiment regardless of the error thrown upon create in the try block.

Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
error stacktrack

Traceback (most recent call last):
  File "train_script_expt.py", line 190, in <module>
with load_run(sagemaker_session=session) as run:
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 847, in load_run
run_instance = Run(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/run.py", line 177, in __init__
self._experiment = Experiment._load_or_create(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 171, in _load_or_create
raise ce
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 160, in _load_or_create
    experiment = Experiment.create(
File "/opt/conda/lib/python3.8/site-packages/sagemaker/experiments/experiment.py", line 120, in create
return cls._construct(
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 190, in _construct
    return instance._invoke_api(boto_method_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sagemaker/apiutils/_base_types.py", line 226, in _invoke_api
api_boto_response = api_method(**api_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.8/site-packages/botocore/client.py", line 976, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts::XXXX:assumed-role/aimlSagemakerStudio-xyzteam-LTWZ4NF3WWVM/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:us-east-2:XXXexperiment/xxxx because no identity-based policy allows the sagemaker:CreateExperiment action

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.171.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
  • Framework version: 1.12
  • Python version: 3.8
  • CPU or GPU: CPU
  • Custom Docker image (Y/N): N
@leoroy
Copy link
Contributor

leoroy commented Jul 18, 2023

This is a bug in the SDK. The SDK tries to create the experiment before loading it. If it encounters resource existed exception, it will load the experiment. However, because of the tag condition set on the IAM, the create will fail due to access deny instead of resource existed exception. Therefore, load_run() failed.

@leoroy
Copy link
Contributor

leoroy commented Jul 18, 2023

We need to update the logic in load_run() to separate it from with Run(...) experience. load_run() should load first and then fall back to create if the run does not exist

@danabens danabens self-assigned this Jul 18, 2023
@danabens
Copy link
Member

the create will fail due to access deny instead of resource existed exception....
need to update the logic in load_run

If the tag is passed on create as expected the response should be ValidationException-already exists, not AccessDenied. it looks like the root cause is either:

  1. logic is not passing tags as expected to CreateExperiment
  2. customer is passing tag that does not match policy (the tag passed in code does not match the tag on the execution role)
  3. something else (some other policy in effect not shown here)

@danabens
Copy link
Member

I went through to reproduce this and here are some are the probably root causes:

  • Policy changes take a some time (~minutes) and may temporarily revert (i.e. works, works, doesn't work, works again).
  • There is some other policy on the studio role that is denying access.
  • The provided tag does not match the tag on the principal.

Tags are propagated as expected from SDK to API calls, so thats not the issue.

See attached policy and notebook which may help in troubleshooting:

For more debugging help provide:

  • Complete policy detail for the Principal in use.
  • Example notebook with tag value provided to SDK
  • The tags on the Principal.

@inchara1990
Copy link
Author

inchara1990 commented Jul 23, 2023

Hi Dana,
In your notebook, both the following code snippets(cell 97 & 98) result in AccessDenied error even though you pass the tags to the run context. Is that expected?

with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sm_sess) as run: run.log_parameter('optimizer', 'adam')

with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sm_sess, tags=tag_list) as run: run.log_parameter('optimizer', 'adam')

It seems like the _load_or_create method in experiment.py is expecting the tags to be passed to Experiment.create() without which it is not fulfilling the below condition which is necessary to load an experiment-
if not (error_code == "ValidationException" and "already exists" in error_message)
Hence it is not even trying to load the existing experiment but raising an error soon after the above if condition is not fulfilled.
If the tags are being passed why else is Experiment.create() failing with AccessDenied ?

@danabens
Copy link
Member

danabens commented Jul 25, 2023

there are 3 conditions in the notebook:

  1. CreateExperiment with no tags
  2. CreateExperiment with a team tag that does not match the team tag on the Principal
  3. CreateExperiment with a team tag that does match the team tag on the Principal

Ya, we can add support for this scenario "If a caller doesn't have authorization to call Create* but does have authorization to Describe* then they should still be able to successfully call with Run...", however in your issue it seemed like you were expecting to have Create* authorization via the tag policy.

should load first and then fall back to create if the run does not exist

this solution makes sense since read APIs have higher TPS than mutate APIs vs the alternative solution to skip-AccessDenied-on-create in load_or_create solution.

in the case where the requestor does have Create* authorization (via tags) the current sdk passes tags as expected.

@danabens danabens removed their assignment Jul 25, 2023
@inchara1990
Copy link
Author

Hi @danabens
Apologies for misunderstanding the right & wrong tags in your notebook. I see that with Run(experiment_name=experiment_name, run_name=run_name, sagemaker_session=sess, tags=default_tags) as run: run.log_parameter('optimizer', 'adam')
works for me as well. It is only when I use the load_run() method in the training job I get the access denied error.

@knikure knikure linked a pull request Aug 23, 2023 that will close this issue
9 tasks
@martinRenou martinRenou added the component: experiments Relates to the SageMaker Experiements Platform label Sep 21, 2023
@lorenzwalthert
Copy link

I also need sagemaker:CreateExperiment for loading a run from an existing experiment without tags and I think that's a problem, since i don't want to give roles the permission to create experiments when they are supposed to use an experiment that already exists.

with sagemaker.experiments.load_run(experiment_name=experiment_name, run_name=run_name) as run:
   pass 

I get

botocore.exceptions.ClientError: An error occurred (AccessDeniedException) when calling the CreateExperiment operation: User: arn:aws:sts:::assumed-role/[...]/SageMaker is not authorized to perform: sagemaker:CreateExperiment on resource: arn:aws:sagemaker:eu-central-1::experiment/dev-inference because no identity-based policy allows the sagemaker:CreateExperiment action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: experiments Relates to the SageMaker Experiements Platform type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants