Skip to content

Jumpstart with Cross-Account Role Assumption Fails with GetObject Access Denial #4043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mencarellic opened this issue Aug 3, 2023 · 1 comment · Fixed by #4051
Closed

Comments

@mencarellic
Copy link

mencarellic commented Aug 3, 2023

Describe the bug
When using sagemaker.jumpstart.model method and an assumed role (cross-accounts) via Boto3, the call to JumpStartModel fails with a denial for a GetObject call. When running this without the AssumeRole, the same code works fine. However, due to some compliance requirements, I need to assume the role and run the code.

I can also create a model, endpoint, etc, using pure Boto3 however, I lose some of the abstraction, so I'd prefer to use the Sagemaker SDK.

To reproduce
Attempt to use the Jumpstart method while also using an assumed role. I am using this script:

import boto3
import sagemaker
from sagemaker.jumpstart.model import JumpStartModel

sagemaker_role = "arn:aws:iam::123456789012:role/sagemaker-role"
session_name = "AssumedRoleSession"
region = "us-west-2"
model_id = "huggingface-text2text-flan-t5-xxl-fp16"
model_version = "*"

sts_client = boto3.client("sts")
response = sts_client.assume_role(RoleArn=sagemaker_role, RoleSessionName=session_name)
assumed_session = boto3.Session(
    aws_access_key_id=response["Credentials"]["AccessKeyId"],
    aws_secret_access_key=response["Credentials"]["SecretAccessKey"],
    aws_session_token=response["Credentials"]["SessionToken"],
    region_name=region
)

sagemaker_session = sagemaker.Session(boto_session=assumed_session)
model = JumpStartModel(
    model_id=model_id,
    model_version=model_version,
    region=region,
    role=sagemaker_role,
    vpc_config={
        "SecurityGroupIds": [
            "sg-00000000000000000",
        ],
        "Subnets": [
            "subnet-00000000000000000",
            "subnet-11111111111111111",
        ],
    },
    sagemaker_session=sagemaker_session,
    enable_network_isolation=True,
)

model.deploy(
    initial_instance_count=1,
    instance_type='summary-v1-2023-08-03T00-00-00Z',
    endpoint_name='ml.g5.12xlarge',
)

Expected behavior
I expect a model, endpoint configuration, and endpoint to be created based on the parameters specified.

Screenshots or logs
Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 7, in create_jumpstart_model
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/model.py", line 266, in __init__
    if not _is_valid_model_id_hook():
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/model.py", line 259, in _is_valid_model_id_hook
    return is_valid_model_id(
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/utils.py", line 592, in is_valid_model_id
    models_manifest_list = accessors.JumpStartModelsAccessor._get_manifest(region=region)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/accessors.py", line 97, in _get_manifest
    return JumpStartModelsAccessor._cache.get_manifest()  # type: ignore
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/cache.py", line 342, in get_manifest
    manifest_dict = self._s3_cache.get(
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/utilities/cache.py", line 103, in get
    self.put(key)
  File "/usr/local/lib/python3.11/site-packages/sagemaker/utilities/cache.py", line 126, in put
    value = self._retrieval_function(  # type: ignore
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/cache.py", line 323, in _retrieval_function
    formatted_body, etag = self._get_json_file(s3_key, file_type)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/cache.py", line 266, in _get_json_file
    file_content, etag = self._get_json_file_and_etag_from_s3(key)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/sagemaker/jumpstart/cache.py", line 243, in _get_json_file_and_etag_from_s3
    response = self._s3_client.get_object(Bucket=self.s3_bucket_name, Key=key)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied

If I create an S3 client with the same assumed session and try to fetch the manifest object (I'm assuming it is this one, maybe not: s3://jumpstart-cache-prod-us-west-2/models_manifest.json), I can do so:

>>> s3_client = assumed_session.client("s3", region_name=region)
>>> s3_bucket = 'jumpstart-cache-prod-us-west-2'
>>> s3_key = 'models_manifest.json'
>>> local_file_path = 'models_manifest.json'
>>> s3_client.download_file(s3_bucket, s3_key, local_file_path)
>>> with open(local_file_path, 'r') as file:
...     first_10_lines = ''.join([next(file) for _ in range(10)])
...
>>> print(first_10_lines)
[
    {
        "model_id": "autogluon-classification-ensemble",
        "version": "1.1.1",
        "min_version": "2.103.0",
        "spec_key": "community_models/autogluon-classification-ensemble/specs_v1.1.1.json"
    },
    {
        "model_id": "autogluon-classification-ensemble",
        "version": "1.1.0",

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.174.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans):
  • Framework version:
  • Python version: 3.11
  • CPU or GPU: GPU
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

@evakravi
Copy link
Member

evakravi commented Aug 7, 2023

A PR has been raised with a fix. In the meantime, to get unblocked, you may do pip install git+https://github.com/evakravi/sagemaker-python-sdk@fix/js-cache-s3-client to use a working version of sagemaker.

@knikure knikure linked a pull request Aug 8, 2023 that will close this issue
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants