-
Notifications
You must be signed in to change notification settings - Fork 1.2k
SageMaker experiments not being created within training job when mandated to have tags #3989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a bug in the SDK. The SDK tries to create the experiment before loading it. If it encounters resource existed exception, it will load the experiment. However, because of the tag condition set on the IAM, the create will fail due to access deny instead of resource existed exception. Therefore, |
We need to update the logic in |
If the tag is passed on create as expected the response should be
|
I went through to reproduce this and here are some are the probably root causes:
Tags are propagated as expected from SDK to API calls, so thats not the issue. See attached policy and notebook which may help in troubleshooting: For more debugging help provide:
|
Hi Dana,
It seems like the |
there are 3 conditions in the notebook:
Ya, we can add support for this scenario "If a caller doesn't have authorization to call Create* but does have authorization to Describe* then they should still be able to successfully call
this solution makes sense since read APIs have higher TPS than mutate APIs vs the alternative solution to skip-AccessDenied-on-create in in the case where the requestor does have Create* authorization (via tags) the current sdk passes tags as expected. |
Hi @danabens |
I also need
I get
|
Describe the bug
We have multiple data science teams onboarded on sagemaker studio. We mandate all users to tag sagemaker resources in order to enable tag based access control. However, when using sagemaker experiments with a pytorch training job, the job fails because of 'Access Denied Error' at
load_run()
method as it seems that the required tags are not being passed on to the experiment object creation. Even if the experiment with the right tags exist it seems that theExperiment._load_or_create
method in the SDK first tries to create the experiment object and then attempts to load if it only encounters a 'resource already exists' exception. But in my case it fails on 'Access Denied' error and hence is not caught. I confirm that this happening only at the training job level and I am able to create both experiment and trial using their specific create methods.To reproduce
Sagemaker execution role should have the below policy in addition to sagemaker:AddTags permission on all resources. Tag the execution role itself with key as 'team' and a 'test' value to allow for principalTag comparison.
The following works
The below throws an 'Access Denied' exception
Training job script which throws the error
Expected behavior
I was expecting that the tags to be passed on to the experiment and trial creation within the sagemaker job. Or if the guidance is to make sure that experiment and trial already exists - then the SDK should load the experiment regardless of the error thrown upon create in the try block.
Screenshots or logs
If applicable, add screenshots or logs to help explain your problem.
error stacktrack
System information
A description of your system. Please provide:
The text was updated successfully, but these errors were encountered: