Skip to content

Using boto3 in local mode? #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
derekhh opened this issue Sep 25, 2018 · 7 comments
Closed

Using boto3 in local mode? #403

derekhh opened this issue Sep 25, 2018 · 7 comments

Comments

@derekhh
Copy link
Contributor

derekhh commented Sep 25, 2018

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Own Container
  • Framework Version: N/A
  • Python Version: 3.5.2
  • CPU or GPU: CPU
  • Python SDK Version: 1.10.1
  • Are you using a custom image: Yes

Describe the problem

Describe the problem or feature request clearly here.

I'm building my own container which requires to use some Boto3 clients, e.g. syncing some TensorFlow Summary data to S3 and getting a KMS client to decrypt some credentials. The code runs fine in SageMaker but if I try to run the same code like:

session = boto3.session.Session(region_name=region_name)
s3 = session.client('s3')

in local mode I always see errors like this:

algo-1-K7O7M_1  | botocore.exceptions.NoCredentialsError: Unable to locate credentials
algo-1-K7O7M_1  |
algo-1-K7O7M_1  | Unable to locate credentials

Is this something that can be fixed in local mode or am I using it wrong?

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

  • Exact command to reproduce:
@mvsusp
Copy link
Contributor

mvsusp commented Sep 28, 2018

Hello @derekhh,

When local mode starts, it uses boto3.session.get_credentials() to fetch the user credentials and make it accessible the container https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L630.

Please make sure that boto3.session.get_credentials() is returning the credentials that you need inside the container.

Another thing, SageMaker has KMS support as well https://aws.amazon.com/about-aws/whats-new/2018/01/aws-kms-based-encryption-is-now-available-in-amazon-sagemaker-training-and-hosting/

Thanks for using SageMaker!

@humanzz
Copy link
Contributor

humanzz commented Oct 3, 2018

Hi @mvsusp,

I'm afraid I am running into the same issue though I have a different use case where I am using MXNet.

Here are a few code snippets to give you an idea what I am doing:

# This creates the boto session using custom AWS credentials (neither in environment variables, not in ~/.aws/credentials)
boto_session = _boto_session(environment)

# These work
print(boto_session)
print(boto_session.get_credentials())

# Setting up the sagemaker session and role
sagemaker_session = LocalSession(boto_session=boto_session)
role = get_execution_role(sagemaker_session=sagemaker_session)

I then start the training as follows

m = MXNet("my_file.py",
              source_dir='src',
              role=role,
              train_instance_count=1,
              train_instance_type="local",
              sagemaker_session=sagemaker_session,
              py_version="py3",
              output_path='s3://<some path>'),
              code_location=''s3://<some path>''),
              hyperparameters=hyperparameters)
    # Start the training with a dataset in S3
    m.fit(training_set)

It fails with (I'll paste the trace below) but one thing worth noting is that it suceeded in zipping and uploading the source directory to S3.

The trace is

WARNING:sagemaker:Couldn't call 'get_role' to get Role ARN from role name <role-name-omitted> to get Role path.
arn:aws:iam::114897356451:role/SageMakerModelTrainingAndExecutionRole
INFO:sagemaker:Creating training-job with name: <job-name-omitted>
Creating tmp2ejz30e_algo-1-57SXP_1 ... done
Attaching to tmp2ejz30e_algo-1-57SXP_1
algo-1-57SXP_1  | 2018-10-03 10:56:43,587 INFO - root - running container entrypoint
algo-1-57SXP_1  | 2018-10-03 10:56:43,587 INFO - root - starting train task
algo-1-57SXP_1  | 2018-10-03 10:56:43,591 INFO - container_support.training - Training starting
algo-1-57SXP_1  | 2018-10-03 10:56:44,160 WARNING - mxnet_container.train - This required structure for training scripts will be deprecated with the next major release of MXNet images. The train() function will no longer be required; instead the training script must be able to be run as a standalone script. For more information, see https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/mxnet#updating-your-mxnet-training-script.
algo-1-57SXP_1  | 2018-10-03 10:56:44,164 INFO - mxnet_container.train - MXNetTrainingEnvironment: {..<values-omitted>..}
algo-1-57SXP_1  | Downloading s3://<some path> to /tmp/script.tar.gz
algo-1-57SXP_1  | 2018-10-03 10:56:44,228 ERROR - container_support.training - uncaught exception during training: Unable to locate credentials
algo-1-57SXP_1  | Traceback (most recent call last):
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/container_support/training.py", line 36, in start
algo-1-57SXP_1  |     fw.train()
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/mxnet_container/train.py", line 181, in train
algo-1-57SXP_1  |     mxnet_env.download_user_module()
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/container_support/environment.py", line 89, in download_user_module
algo-1-57SXP_1  |     cs.download_s3_resource(self.user_script_archive, tmp)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/container_support/utils.py", line 41, in download_s3_resource
algo-1-57SXP_1  |     script_bucket.download_file(script_key_name, target)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 246, in bucket_download_file
algo-1-57SXP_1  |     ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/boto3/s3/inject.py", line 172, in download_file
algo-1-57SXP_1  |     extra_args=ExtraArgs, callback=Callback)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/boto3/s3/transfer.py", line 307, in download_file
algo-1-57SXP_1  |     future.result()
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 73, in result
algo-1-57SXP_1  |     return self._coordinator.result()
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/s3transfer/futures.py", line 233, in result
algo-1-57SXP_1  |     raise self._exception
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/s3transfer/tasks.py", line 255, in _main
algo-1-57SXP_1  |     self._submit(transfer_future=transfer_future, **kwargs)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/s3transfer/download.py", line 353, in _submit
algo-1-57SXP_1  |     **transfer_future.meta.call_args.extra_args
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 320, in _api_call
algo-1-57SXP_1  |     return self._make_api_call(operation_name, kwargs)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/client.py", line 610, in _make_api_call
algo-1-57SXP_1  |     operation_model, request_dict)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/endpoint.py", line 102, in make_request
algo-1-57SXP_1  |     return self._send_request(request_dict, operation_model)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/endpoint.py", line 132, in _send_request
algo-1-57SXP_1  |     request = self.create_request(request_dict, operation_model)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/endpoint.py", line 116, in create_request
algo-1-57SXP_1  |     operation_name=operation_model.name)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 356, in emit
algo-1-57SXP_1  |     return self._emitter.emit(aliased_event_name, **kwargs)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 228, in emit
algo-1-57SXP_1  |     return self._emit(event_name, kwargs)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/hooks.py", line 211, in _emit
algo-1-57SXP_1  |     response = handler(**kwargs)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/signers.py", line 90, in handler
algo-1-57SXP_1  |     return self.sign(operation_name, request)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/signers.py", line 157, in sign
algo-1-57SXP_1  |     auth.add_auth(request)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/auth.py", line 424, in add_auth
algo-1-57SXP_1  |     super(S3SigV4Auth, self).add_auth(request)
algo-1-57SXP_1  |   File "/usr/local/lib/python3.5/dist-packages/botocore/auth.py", line 356, in add_auth
algo-1-57SXP_1  |     raise NoCredentialsError
algo-1-57SXP_1  | botocore.exceptions.NoCredentialsError: Unable to locate credentials
algo-1-57SXP_1  | 
algo-1-57SXP_1  | 
tmp2ejz30e_algo-1-57SXP_1 exited with code 1
Aborting on container exit...
Traceback (most recent call last):
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/local/image.py", line 112, in train
    _stream_output(process)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/local/image.py", line 575, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  <path-omitted>
    m.fit(training_set)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/estimator.py", line 191, in fit
    self.latest_training_job = _TrainingJob.start_new(self, inputs)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/estimator.py", line 417, in start_new
    tags=estimator.tags)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/session.py", line 278, in train
    self.sagemaker_client.create_training_job(**train_request)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/local/local_session.py", line 75, in create_training_job
    training_job.start(InputDataConfig, hyperparameters)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/local/entities.py", line 60, in start
    self.model_artifacts = self.container.train(input_data_config, hyperparameters)
  File "<path-omitted>/lib/python3.4/site-packages/sagemaker/local/image.py", line 117, in train
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/tmp/tmp2ejz_30e/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 1

@iquintero
Copy link
Contributor

Hi @derekhh and @humanzz

What kind of credentials are you guys using? is it static credentials (ACCESS and SECRET key) or do you have a session token as well? If there is a session token then local mode will NOT pass the credentials to the container. The reason behind this is tokens are short-lived credentials that will expire at some point. If you run a long training job then the credentials can become invalid at some point. On EC2 instances or SageMaker Notebook instances that are using an AWS Role, the container will be able to access the EC2 Metadata service to fetch its own credentials dynamically.

I think this is most likely what is happening.

@humanzz
Copy link
Contributor

humanzz commented Oct 3, 2018

Hi @iquintero
I've traced this issue to the fact that I am using temporary credentials (assuming a role).
My use case is using the local mode on my machine - neither EC2 instance nor SageMaker notebook - so the credentials will not be made available through the provider chain.

I was considering sending a PR changing https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/local/image.py#L638 to enable passing the AWS_SESSION_TOKEN environment variable as well.

I am now trying to think how to best change this such as the sensible default of falling back the EC2 Metadata service is respected while also enabling my use case.

Maybe the checks should be something like

if no token
  set the credentials environment variables
else if credentials available through default provider chain
  return none
else
  set the credentials environment variables including the session token

What do you think? if that sounds right to you, I can try to proceed with that, but I am still to figure out how to check for the presence of credentials using the provider chain.

@humanzz
Copy link
Contributor

humanzz commented Oct 3, 2018

@iquintero Based on boto/boto3#222 and checking botocore's code, here's a more concrete suggestion

def _aws_credentials(session):
    try:
        creds = session.get_credentials()
        access_key = creds.access_key
        secret_key = creds.secret_key
        token = creds.token

        # The presence of a token indicates the credentials are short-lived and as such are risky to be used as they
        # might expire while running.
        # Long-lived credentials are available either through
        # 1. boto session
        # 2. EC2 Metadata Service (SageMaker Notebook instances or EC2 instances with roles attached them)
        # Short-lived credentials available via boto session are permitted to support running on machines with no
        # EC2 Metadata Service but a warning is provided about their danger
        if token is None:
            logger.info("Using the long-lived AWS credentials found in session")
            return [
                'AWS_ACCESS_KEY_ID=%s' % (str(access_key)),
                'AWS_SECRET_ACCESS_KEY=%s' % (str(secret_key))
            ]
        elif not _aws_credentials_available_in_metadata_service():
            logger.warn("Using the short-lived AWS credentials found in session. They might expire while running.")
            return [
                'AWS_ACCESS_KEY_ID=%s' % (str(access_key)),
                'AWS_SECRET_ACCESS_KEY=%s' % (str(secret_key)),
                'AWS_SESSION_TOKEN=%s' % (str(token))
            ]
        else:
            logger.info("No AWS credentials found in session but credentials from EC2 Metadata Service are available.")
            return None
    except Exception as e:
        logger.info('Could not get AWS credentials: %s' % e)

    return None


def _aws_credentials_available_in_metadata_service():
    import botocore
    from botocore.credentials import InstanceMetadataProvider
    from botocore.utils import InstanceMetadataFetcher

    session = botocore.session.Session()
    instance_metadata_provider = InstanceMetadataProvider(
        iam_role_fetcher=InstanceMetadataFetcher(
            timeout=session.get_config_variable('metadata_service_timeout'),
            num_attempts=session.get_config_variable('metadata_service_num_attempts'),
            user_agent=session.user_agent())
    )
    return not (instance_metadata_provider.load() is None)

What do you think?

@iquintero
Copy link
Contributor

That seems like a great solution @humanzz still works for the regular EC2 Meta data, but allows your use case to work.

@ChoiByungWook
Copy link
Contributor

Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants