Skip to content

Add support for additional files #494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ CHANGELOG
* feature: HyperparameterTuner: Make input channels optional
* feature: Add support for Chainer 5.0
* feature: Estimator: add support for MetricDefinitions
* feature: source_dir accepts a list of directories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still haven't heard an explanation for why this is useful. why can't user just stage their files properly in a single source dir before creating an estimator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From someone on the outside looking in, this looks confusing and unneeded. I understand there's probably a (pressing) reason why we want to do this - but it feels like a hack that we're adding to the PythonSDK to overcome a problem in another system.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also against this change because of the same reason Jonathan mentioned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User could stage the files before calling the estimator. User can also copy their own code to S3, and not use the estimator. Or they could build their own docker container, and make boto calls instead of using the SDK at all. The point of this project is to make things easier.


1.14.2
======
Expand Down
8 changes: 4 additions & 4 deletions src/sagemaker/chainer/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,10 +145,10 @@ Optional arguments

The following are optional arguments. When you create a ``Chainer`` object, you can specify these as keyword arguments.

- ``source_dir`` Path (absolute or relative) to a directory with any
other training source code dependencies including the entry point
file. Structure within this directory will be preserved when training
on SageMaker.
- ``source_dir`` Single path (absolute or relative) or a list of paths
to directories with any other training source code dependencies
aside from the entry point file (default: None). The structures
within this directories are preserved when training on Amazon SageMaker.
- ``hyperparameters`` Hyperparameters that will be used for training.
Will be made accessible as a dict[str, str] to the training code on
SageMaker. For convenience, accepts other types besides str, but
Expand Down
6 changes: 3 additions & 3 deletions src/sagemaker/chainer/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,9 +64,9 @@ def __init__(self, entry_point, use_mpi=None, num_processes=None, process_slots_
set to the number of GPUs on the instance (on GPU instances), or one (on CPU instances).
additional_mpi_options (str): String of options to the 'mpirun' command used to run the entry point.
For example, '-X NCCL_DEBUG=WARN' will pass that option string to the mpirun command.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from tne entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker.
source_dir (str or [str]): Single path (absolute or relative) or a list of paths to directories with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's be more explicit and use list[str] for lists. also I'd change it to "A single path"

any other training source code dependencies aside from the entry point file (default: None).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this line to "any source code (other than the entry point file) needed for training"

The structures within this directories are preserved when training on Amazon SageMaker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we include explanation about the structure of how to access each of the directories? i.e. I assume (but am still reading the PR) that it'll be something like:

| base dir from unpacking the tar
| - source dir 1
| - source dir 2
| - etc.

based on reading the docstring, but it'd be good to be explicit about it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, more documentation and examples would be helpful.

hyperparameters (dict): Hyperparameters that will be used for training (default: None).
The hyperparameters are made accessible as a dict[str, str] to the training code on SageMaker.
For convenience, this accepts other types for keys and values, but ``str()`` will be called
Expand Down
17 changes: 13 additions & 4 deletions src/sagemaker/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -632,9 +632,9 @@ def __init__(self, entry_point, source_dir=None, hyperparameters=None, enable_cl
Args:
entry_point (str): Path (absolute or relative) to the Python source file which should be executed
as the entry point to training. This should be compatible with either Python 2.7 or Python 3.5.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from tne entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker.
source_dir (str or [str]): Single path (absolute or relative) or a list of paths to directories with
any other training source code dependencies aside from the entry point file (default: None).
The structures within this directories are preserved when training on Amazon SageMaker.
hyperparameters (dict): Hyperparameters that will be used for training (default: None).
The hyperparameters are made accessible as a dict[str, str] to the training code on SageMaker.
For convenience, this accepts other types for keys and values, but ``str()`` will be called
Expand All @@ -651,6 +651,14 @@ def __init__(self, entry_point, source_dir=None, hyperparameters=None, enable_cl
**kwargs: Additional kwargs passed to the ``EstimatorBase`` constructor.
"""
super(Framework, self).__init__(**kwargs)

if isinstance(source_dir, list):
self.source_dir = source_dir[0]
self._additional_files = source_dir[1:]
else:
self.source_dir = source_dir
self._additional_files = []

self.source_dir = source_dir
self.entry_point = entry_point
if enable_cloudwatch_metrics:
Expand Down Expand Up @@ -718,7 +726,8 @@ def _stage_user_code_in_s3(self):
bucket=code_bucket,
s3_key_prefix=code_s3_prefix,
script=self.entry_point,
directory=self.source_dir)
directory=self.source_dir,
additional_files=self._additional_files)

def _model_source_dir(self):
"""Get the appropriate value to pass as source_dir to model constructor on deploying
Expand Down
46 changes: 29 additions & 17 deletions src/sagemaker/fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ def validate_source_dir(script, directory):
return True


def tar_and_upload_dir(session, bucket, s3_key_prefix, script, directory):
def tar_and_upload_dir(session, bucket, s3_key_prefix, script, directory, additional_files=None):
"""Pack and upload source files to S3 only if directory is empty or local.

Note:
Expand All @@ -118,31 +118,43 @@ def tar_and_upload_dir(session, bucket, s3_key_prefix, script, directory):
bucket (str): S3 bucket to which the compressed file is uploaded.
s3_key_prefix (str): Prefix for the S3 key.
script (str): Script filename.
directory (str): Directory containing the source file. If it starts with "s3://", no action is taken.
directory (str or None): Directory containing the source file. If it starts with "s3://", no action is taken.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add additional_files to the docstring

Returns:
sagemaker.fw_utils.UserCode: An object with the S3 bucket and key (S3 prefix) and script name.
"""
if directory:
if directory.lower().startswith("s3://"):
return UploadedCode(s3_prefix=directory, script_name=os.path.basename(script))
else:
script_name = script
source_files = [os.path.join(directory, name) for name in os.listdir(directory)]
key = '%s/sourcedir.tar.gz' % s3_key_prefix

if directory and directory.lower().startswith("s3://"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single quotes for the string

return UploadedCode(s3_prefix=directory, script_name=os.path.basename(script))
else:
# If no directory is specified, the script parameter needs to be a valid relative path.
os.path.exists(script)
script_name = os.path.basename(script)
source_files = [script]
source_files = _list_root_files(script, directory, additional_files)
_upload_code(session, bucket, key, source_files)

script_name = script if directory else os.path.basename(script)
return UploadedCode(s3_prefix='s3://%s/%s' % (bucket, key), script_name=script_name)

s3 = session.resource('s3')
key = '{}/{}'.format(s3_key_prefix, 'sourcedir.tar.gz')

def _upload_code(session, bucket, key, source_files):
tar_file = sagemaker.utils.create_tar_file(source_files)
s3.Object(bucket, key).upload_file(tar_file)
os.remove(tar_file)

return UploadedCode(s3_prefix='s3://{}/{}'.format(bucket, key), script_name=script_name)
try:
session.resource('s3').Object(bucket, key).upload_file(tar_file)
finally:
os.remove(tar_file)


def _list_root_files(script, directory, additional_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not independently unit tested. This logic is confusing (I'm not sure I completely understand it). I suggest unit testing this and providing some developer documentation describing the contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally against testing private methods, it adds unnecessary coupling to non public facing signatures and does not provide a good grasp of the functionality.

I do agree that testing should be extensive, and cover any edge cases. The tests that I wrote are here https://github.com/aws/sagemaker-python-sdk/pull/494/files#diff-3108f99e19f25f4c77ad4f63d486b174R147

Let me know if you any suggestions of improvement of these methods.

additional_files = additional_files or []
basedir = directory if directory else os.path.dirname(script)
files = [basedir] + additional_files

for file in files:
if os.path.isfile(file):
yield file
else:
for name in os.listdir(file):
yield os.path.join(file, name)


def framework_name_from_image(image_name):
Expand Down
13 changes: 9 additions & 4 deletions src/sagemaker/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,9 +136,8 @@ def __init__(self, model_data, image, role, entry_point, source_dir=None, predic
role (str): An IAM role name or ARN for SageMaker to access AWS resources on your behalf.
entry_point (str): Path (absolute or relative) to the Python source file which should be executed
as the entry point to model hosting. This should be compatible with either Python 2.7 or Python 3.5.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from tne entry point file (default: None). Structure within this
directory will be preserved when training on SageMaker.
source_dir (str or [str]): Single path (absolute or relative) or a list of paths to directories with
any other training source code dependencies aside from the entry point file (default: None).
If the directory points to S3, no code will be uploaded and the S3 location will be used instead.
predictor_cls (callable[string, sagemaker.session.Session]): A function to call to create
a predictor (default: None). If not None, ``deploy`` will return the result of invoking
Expand All @@ -158,8 +157,14 @@ def __init__(self, model_data, image, role, entry_point, source_dir=None, predic
"""
super(FrameworkModel, self).__init__(model_data, image, role, predictor_cls=predictor_cls, env=env, name=name,
sagemaker_session=sagemaker_session, **kwargs)
if isinstance(source_dir, list):
self.source_dir = source_dir[0]
self._additional_files = source_dir[1:]
else:
self.source_dir = source_dir
self._additional_files = []

self.entry_point = entry_point
self.source_dir = source_dir
self.enable_cloudwatch_metrics = enable_cloudwatch_metrics
self.container_log_level = container_log_level
if code_location:
Expand Down
8 changes: 4 additions & 4 deletions src/sagemaker/mxnet/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -267,10 +267,10 @@ Optional arguments

The following are optional arguments. When you create an ``MXNet`` object, you can specify these as keyword arguments.

- ``source_dir`` Path (absolute or relative) to a directory with any
other training source code dependencies including the entry point
file. Structure within this directory will be preserved when training
on SageMaker.
- ``source_dir`` Single path (absolute or relative) or a list of paths
to directories with any other training source code dependencies
aside from the entry point file (default: None). The structures
within this directories are preserved when training on Amazon SageMaker.
- ``hyperparameters`` Hyperparameters that will be used for training.
Will be made accessible as a dict[str, str] to the training code on
SageMaker. For convenience, accepts other types besides str, but
Expand Down
6 changes: 3 additions & 3 deletions src/sagemaker/mxnet/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ def __init__(self, entry_point, source_dir=None, hyperparameters=None, py_versio
Args:
entry_point (str): Path (absolute or relative) to the Python source file which should be executed
as the entry point to training. This should be compatible with either Python 2.7 or Python 3.5.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from tne entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker.
source_dir (str or [str]): Single path (absolute or relative) or a list of paths to directories with
any other training source code dependencies aside from the entry point file (default: None).
The structures within this directories are preserved when training on Amazon SageMaker.
hyperparameters (dict): Hyperparameters that will be used for training (default: None).
The hyperparameters are made accessible as a dict[str, str] to the training code on SageMaker.
For convenience, this accepts other types for keys and values, but ``str()`` will be called
Expand Down
8 changes: 4 additions & 4 deletions src/sagemaker/pytorch/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,10 +171,10 @@ Optional arguments

The following are optional arguments. When you create a ``PyTorch`` object, you can specify these as keyword arguments.

- ``source_dir`` Path (absolute or relative) to a directory with any
other training source code dependencies including the entry point
file. Structure within this directory will be preserved when training
on SageMaker.
- ``source_dir`` Single path (absolute or relative) or a list of paths
to directories with any other training source code dependencies
aside from the entry point file (default: None). The structures
within this directories are preserved when training on Amazon SageMaker.
- ``hyperparameters`` Hyperparameters that will be used for training.
Will be made accessible as a dict[str, str] to the training code on
SageMaker. For convenience, accepts other types besides strings, but
Expand Down
6 changes: 3 additions & 3 deletions src/sagemaker/pytorch/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,9 +47,9 @@ def __init__(self, entry_point, source_dir=None, hyperparameters=None, py_versio
Args:
entry_point (str): Path (absolute or relative) to the Python source file which should be executed
as the entry point to training. This should be compatible with either Python 2.7 or Python 3.5.
source_dir (str): Path (absolute or relative) to a directory with any other training
source code dependencies aside from tne entry point file (default: None). Structure within this
directory are preserved when training on Amazon SageMaker.
source_dir (str or [str]): Single path (absolute or relative) or a list of paths to directories with
any other training source code dependencies aside from the entry point file (default: None).
The structures within this directories are preserved when training on Amazon SageMaker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these directories

hyperparameters (dict): Hyperparameters that will be used for training (default: None).
The hyperparameters are made accessible as a dict[str, str] to the training code on SageMaker.
For convenience, this accepts other types for keys and values, but ``str()`` will be called
Expand Down
8 changes: 4 additions & 4 deletions src/sagemaker/tensorflow/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -405,10 +405,10 @@ Optional Arguments
The following are optional arguments. When you create a ``TensorFlow`` object,
you can specify these as keyword arguments.

- ``source_dir (str)`` Path (absolute or relative) to a directory with any
other training source code dependencies including the entry point
file. Structure within this directory will be preserved when training
on SageMaker.
- ``source_dir (str)`` Single path (absolute or relative) or a list of paths
to directories with any other training source code dependencies
aside from the entry point file (default: None). The structures
within this directories are preserved when training on Amazon SageMaker.
- ``requirements_file (str)`` Path to a ``requirements.txt`` file. The path should
be within and relative to ``source_dir``. This is a file containing a list of items to be
installed using pip install. Details on the format can be found in the
Expand Down
Loading