Skip to content

Commit 37e954a

Browse files
authored
Merge branch 'master' into loc-config-file
2 parents a90ac58 + 5dc9e58 commit 37e954a

File tree

76 files changed

+669
-415
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+669
-415
lines changed

CHANGELOG.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,60 @@
11
# Changelog
22

3+
## v2.25.1 (2021-02-20)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* Add tests for visualizer to improve test coverage
8+
9+
### Documentation Changes
10+
11+
* specify correct return type
12+
13+
### Testing and Release Infrastructure
14+
15+
* rename canary_quick pytest mark to release
16+
17+
## v2.25.0 (2021-02-19)
18+
19+
### Features
20+
21+
* Enable step caching
22+
* Add other Neo supported regions for Inferentia inference images
23+
24+
### Bug Fixes and Other Changes
25+
26+
* remove FailStep from pipelines
27+
* use sagemaker_session in workflow tests
28+
* use ECR public for multidatamodel tests
29+
* add the mapping from py3 to cuda11 images
30+
* Add 30s cap time for tag tests
31+
* add build spec for slow tests
32+
* mark top 10 slow tests
33+
* remove slow test_run_xxx_monitor_baseline tests
34+
* pin astroid to 2.4.2
35+
36+
### Testing and Release Infrastructure
37+
38+
* unmark more flaky integ tests
39+
* remove canary_quick pytest mark from flaky/unnecessary tests
40+
* remove python3.8 from buildspec
41+
* remove py38 tox env
42+
* fix release buildspec typo
43+
* unblock regional release builds
44+
* lower test TPS for experiment analytics
45+
* move package preparation and publishing to the deploy step
46+
47+
## v2.24.5 (2021-02-12)
48+
49+
### Bug Fixes and Other Changes
50+
51+
* test_tag/test_tags method assert fix in association tests
52+
53+
### Documentation Changes
54+
55+
* removing mention of TF 2.4 from SM distributed model parallel docs
56+
* adding details about mpi options, other small updates
57+
358
## v2.24.4 (2021-02-09)
459

560
### Bug Fixes and Other Changes

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.24.5.dev0
1+
2.25.2.dev0

buildspec-deploy.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,16 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
- PACKAGE_FILE="$CODEBUILD_SRC_DIR_ARTIFACT_1/sagemaker-*.tar.gz"
6+
# prepare the release (update versions, changelog etc.)
7+
- git-release --prepare
8+
9+
# generate the distribution package
10+
- python3 setup.py sdist
11+
12+
# publish the release to github
13+
- git-release --publish
14+
15+
- PACKAGE_FILE="dist/sagemaker-*.tar.gz"
716
- PYPI_USER=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/user --query SecretString --output text)
817
- PYPI_PASSWORD=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/password --query SecretString --output text)
918
- GPG_PRIVATE_KEY=$(aws secretsmanager get-secret-value --secret-id /codebuild/gpg/private_key --query SecretString --output text)

buildspec-localmodetests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ phases:
1111

1212
# local mode tests
1313
- start_time=`date +%s`
14-
- execute-command-if-has-matching-changes "tox -e py38 -- tests/integ -m local_mode --durations 50" "tests/integ" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec-localmodetests.yml"
15-
- ./ci-scripts/displaytime.sh 'py38 local mode' $start_time
14+
- execute-command-if-has-matching-changes "tox -e py37 -- tests/integ -m local_mode --durations 50" "tests/integ" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec-localmodetests.yml"
15+
- ./ci-scripts/displaytime.sh 'py37 local mode' $start_time

buildspec-release.yml

Lines changed: 2 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
# prepare the release (update versions, changelog etc.)
7-
- git-release --prepare
8-
96
# run linters
107
- tox -e flake8,pylint
118

@@ -18,19 +15,7 @@ phases:
1815
# run unit tests
1916
- AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_SESSION_TOKEN=
2017
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI= AWS_DEFAULT_REGION=
21-
tox -e py36,py37,py38 -- tests/unit
18+
tox -e py36,py37 -- tests/unit
2219

2320
# run a subset of the integration tests
24-
- IGNORE_COVERAGE=- tox -e py36 -- tests/integ -m canary_quick -n 64 --boxed --reruns 2
25-
26-
# generate the distribution package
27-
- python3 setup.py sdist
28-
29-
# publish the release to github
30-
- git-release --publish
31-
32-
artifacts:
33-
files:
34-
- dist/sagemaker-*.tar.gz
35-
name: ARTIFACT_1
36-
discard-paths: yes
21+
- IGNORE_COVERAGE=- tox -e py36 -- tests/integ -m "not (local_mode or slow_test)" -n 32 --boxed --reruns 2

buildspec-slowtests.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
version: 0.2
2+
3+
phases:
4+
pre_build:
5+
commands:
6+
- start-dockerd
7+
8+
build:
9+
commands:
10+
- IGNORE_COVERAGE=-
11+
12+
# slow tests
13+
- start_time=`date +%s`
14+
- execute-command-if-has-matching-changes "tox -e py37 -- tests/integ -m slow_test -n 16 --durations 0" "tests/integ" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec-slowtests.yml"
15+
- ./ci-scripts/displaytime.sh 'py37 slow tests' $start_time

buildspec-unittests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,5 +18,5 @@ phases:
1818
- start_time=`date +%s`
1919
- AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY= AWS_SESSION_TOKEN=
2020
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI= AWS_DEFAULT_REGION=
21-
tox -e py36,py37,py38 --parallel all -- tests/unit
22-
- ./ci-scripts/displaytime.sh 'py36,py37,py38 unit' $start_time
21+
tox -e py36,py37 --parallel all -- tests/unit
22+
- ./ci-scripts/displaytime.sh 'py36,py37 unit' $start_time

buildspec.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@ phases:
1111

1212
# run integration tests
1313
- start_time=`date +%s`
14-
- execute-command-if-has-matching-changes "python3.8 -u ci-scripts/queue_build.py" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec.yml"
14+
- execute-command-if-has-matching-changes "python3.7 -u ci-scripts/queue_build.py" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec.yml"
1515
- ./ci-scripts/displaytime.sh 'build queue' $start_time
1616

1717
- start_time=`date +%s`
1818
- |
19-
execute-command-if-has-matching-changes "env -u AWS_DEFAULT_REGION tox -e py38 -- tests/integ -m \"not local_mode and not cron\" -n 384 --reruns 3 --reruns-delay 15 --durations 50 --boto-config '{\"region_name\": \"us-east-2\"}'" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "src/sagemaker/image_uri_config/*.json" "setup.py" "setup.cfg" "buildspec.yml"
20-
- ./ci-scripts/displaytime.sh 'py38 tests/integ' $start_time
19+
execute-command-if-has-matching-changes "env -u AWS_DEFAULT_REGION tox -e py37 -- tests/integ -m \"not local_mode and not cron and not slow_test\" -n 384 --reruns 3 --reruns-delay 15 --durations 50 --boto-config '{\"region_name\": \"us-east-2\"}'" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "src/sagemaker/image_uri_config/*.json" "setup.py" "setup.cfg" "buildspec.yml"
20+
- ./ci-scripts/displaytime.sh 'py37 tests/integ' $start_time
2121

2222
post_build:
2323
finally:

doc/api/training/smd_model_parallel_general.rst

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@
55

66
.. _sm-sdk-modelparallel-params:
77

8-
SageMaker Python SDK ``modelparallel`` parameters
9-
=================================================
8+
Required SageMaker Python SDK parameters
9+
========================================
1010

1111
The TensorFlow and PyTorch ``Estimator`` objects contains a ``distribution`` parameter,
1212
which is used to enable and specify parameters for the
1313
initialization of the SageMaker distributed model parallel library. The library internally uses MPI,
14-
so in order to use model parallelism, MPI must be enabled using the ``distribution`` parameter.
14+
so in order to use model parallelism, MPI must also be enabled using the ``distribution`` parameter.
1515

1616
The following is an example of how you can launch a new PyTorch training job with the library.
1717

@@ -55,6 +55,9 @@ The following is an example of how you can launch a new PyTorch training job wit
5555
5656
smd_mp_estimator.fit('s3://my_bucket/my_training_data/')
5757
58+
``smdistributed`` Parameters
59+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60+
5861
You can use the following parameters to initialize the library using the ``parameters``
5962
in the ``smdistributed`` of ``distribution``.
6063

@@ -302,6 +305,41 @@ table are optional.
302305
| | | | SageMaker. |
303306
+-------------------+-------------------------+-----------------+-----------------------------------+
304307

308+
``mpi`` Parameters
309+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310+
For the ``"mpi"`` key, a dict must be passed which contains:
311+
312+
* ``"enabled"``: Set to ``True`` to launch the training job with MPI.
313+
314+
* ``"processes_per_host"``: Specifies the number of processes MPI should launch on each host.
315+
In SageMaker a host is a single Amazon EC2 ml instance. The SageMaker distributed model parallel library maintains
316+
a one-to-one mapping between processes and GPUs across model and data parallelism.
317+
This means that SageMaker schedules each process on a single, separate GPU and no GPU contains more than one process.
318+
If you are using PyTorch, you must restrict each process to its own device using
319+
``torch.cuda.set_device(smp.local_rank())``. To learn more, see
320+
`Modify a PyTorch Training Script
321+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-pt-16>`_.
322+
323+
.. important::
324+
``process_per_host`` must be less than or equal to the number of GPUs per instance, and typically will be equal to
325+
the number of GPUs per instance.
326+
327+
For example, if you use one instance with 4-way model parallelism and 2-way data parallelism,
328+
then processes_per_host should be 2 x 4 = 8. Therefore, you must choose an instance that has at least 8 GPUs,
329+
such as an ml.p3.16xlarge.
330+
331+
The following image illustrates how 2-way data parallelism and 4-way model parallelism is distributed across 8 GPUs:
332+
the model is partitioned across 4 GPUs, and each partition is added to 2 GPUs.
333+
334+
.. image:: smp_versions/model-data-parallel.png
335+
:width: 650
336+
:alt: 2-way data parallelism and 4-way model parallelism distributed across 8 GPUs
337+
338+
339+
* ``"custom_mpi_options"``: Use this key to pass any custom MPI options you might need.
340+
To avoid Docker warnings from contaminating your training logs, we recommend the following flag.
341+
```--mca btl_vader_single_copy_mechanism none```
342+
305343

306344
.. _ranking-basics:
307345

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,6 @@
1717

1818
- Adds support for `_register_comm_hook` (PyTorch 1.7 only) which will register the callable as a communication hook for DDP. NOTE: Like in DDP, this is an experimental API and subject to change.
1919

20-
### Tensorflow
21-
22-
- Adds support for Tensorflow 2.4
23-
2420
## Bug Fixes
2521

2622
### PyTorch
Loading

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_common_api.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,9 @@ The following SageMaker distribute model parallel APIs are common across all fra
118118
- https://www.tensorflow.org/api_docs/python/tf/function\
119119
- https://www.tensorflow.org/guide/function\
120120

121+
Each ``smp.step`` decorated function must have a return value that depends on the
122+
output of ``smp.DistributedModel``.
123+
121124
**Common parameters**
122125

123126
- ``non_split_inputs`` (``list``): The list of arguments to the decorated function

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_pytorch.rst

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,6 @@ This API document assumes you use the following import statements in your traini
3131
model in the training script can be wrapped with
3232
``smp.DistributedModel``.
3333

34-
3534
**Example:**
3635

3736
.. code:: python
@@ -89,6 +88,17 @@ This API document assumes you use the following import statements in your traini
8988
the model objects (``model(inputs)`` and ``model.backward(loss)``) must be made inside
9089
a ``smp.step``-decorated function.
9190

91+
**Using DDP**
92+
93+
If DDP is enabled, do not not place a PyTorch
94+
``DistributedDataParallel`` wrapper around the ``DistributedModel`` because
95+
the ``DistributedModel`` wrapper will also handle data parallelism.
96+
97+
Unlike the original DDP wrapper, when you use ``DistributedModel``,
98+
model parameters and buffers are not immediately broadcast across
99+
processes when the wrapper is called. Instead, the broadcast is deferred to the first call of the
100+
``smp.step``-decorated function when the partition is done.
101+
92102
**Parameters**
93103

94104
- ``module`` (``torch.nn.Module``): Module to be distributed (data parallelism and model parallelism).
@@ -248,11 +258,14 @@ This API document assumes you use the following import statements in your traini
248258
.. function:: join( )
249259

250260
**Available for PyTorch 1.7 only**
261+
251262
A context manager to be used in conjunction with an instance of
252-
``smp.DistributedModel``to be able to train with uneven inputs across
263+
``smp.DistributedModel`` to be able to train with uneven inputs across
253264
participating processes. This is only supported when ``ddp=True`` for
254265
``smp.DistributedModel``. This will use the join with the wrapped
255-
``DistributedDataParallel`` instance. Please see: `join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__.
266+
``DistributedDataParallel`` instance. For more information, see:
267+
`join <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.join>`__
268+
in the PyTorch documentation.
256269

257270

258271
.. class:: smp.DistributedOptimizer

doc/api/training/smp_versions/v1.2.0/smd_model_parallel_tensorflow.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
TensorFlow API
22
==============
33

4-
**Supported version: 2.4, 2.3**
4+
**Supported version: 2.3**
55

66
**Important**: This API document assumes you use the following import statement in your training scripts.
77

doc/frameworks/mxnet/using_mxnet.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -377,7 +377,7 @@ It loads the model parameters from a ``model.params`` file in the SageMaker mode
377377
return net
378378
379379
MXNet on Amazon SageMaker has support for `Elastic Inference <https://docs.aws.amazon.com/sagemaker/latest/dg/ei.html>`__, which allows for inference acceleration to a hosted endpoint for a fraction of the cost of using a full GPU instance.
380-
In order to load and serve your MXNet model through Amazon Elastic Inference, the MXNet context passed to your MXNet Symbol or Module object within your ``model_fn`` needs to be set to ``eia``, as shown `here <https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-mxnet-elastic-inference.html#ei-mxnet>`__.
380+
In order to load and serve your MXNet model through Amazon Elastic Inference, import the ``eimx`` Python package and make one change in the code to partition your model and optimize it for the ``EIA`` back end, as shown `here <https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-mxnet-elastic-inference.html#ei-mxnet>`__.
381381

382382
Based on the example above, the following code-snippet shows an example custom ``model_fn`` implementation, which enables loading and serving our MXNet model through Amazon Elastic Inference.
383383

@@ -392,11 +392,12 @@ Based on the example above, the following code-snippet shows an example custom `
392392
Returns:
393393
mxnet.gluon.nn.Block: a Gluon network (for this example)
394394
"""
395-
net = models.get_model('resnet34_v2', ctx=mx.eia(), pretrained=False, classes=10)
396-
net.load_params('%s/model.params' % model_dir, ctx=mx.eia())
395+
net = models.get_model('resnet34_v2', ctx=mx.cpu(), pretrained=False, classes=10)
396+
net.load_params('%s/model.params' % model_dir, ctx=mx.cpu())
397+
net.hybridize(backend='EIA', static_alloc=True, static_shape=True)
397398
return net
398399
399-
The `default_model_fn <https://github.com/aws/sagemaker-mxnet-container/pull/55/files#diff-aabf018d906ed282a3c738377d19a8deR71>`__ loads and serve your model through Elastic Inference, if applicable, within the Amazon SageMaker MXNet containers.
400+
If you are using MXNet 1.5.1 and earlier, the `default_model_fn <https://github.com/aws/sagemaker-mxnet-container/pull/55/files#diff-aabf018d906ed282a3c738377d19a8deR71>`__ loads and serve your model through Elastic Inference, if applicable, within the Amazon SageMaker MXNet containers.
400401

401402
For more information on how to enable MXNet to interact with Amazon Elastic Inference, see `Use Elastic Inference with MXNet <https://docs.aws.amazon.com/dlami/latest/devguide/tutorial-mxnet-elastic-inference.html>`__.
402403

doc/workflows/pipelines/sagemaker.workflow.pipelines.rst

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,8 +110,6 @@ Steps
110110

111111
.. autoclass:: sagemaker.workflow.steps.ProcessingStep
112112

113-
.. autoclass:: sagemaker.workflow.steps.FailStep
114-
115113
Utilities
116114
---------
117115

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ def read_version():
6666
"pytest<6.1.0",
6767
"pytest-cov",
6868
"pytest-rerunfailures",
69+
"pytest-timeout",
6970
"pytest-xdist",
7071
"mock",
7172
"contextlib2",

src/sagemaker/analytics.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ class AnalyticsMetricsBase(with_metaclass(ABCMeta, object)):
4343
"""
4444

4545
def __init__(self):
46+
"""Initializes ``AnalyticsMetricsBase`` instance."""
4647
self._dataframe = None
4748

4849
def export_csv(self, filename):

src/sagemaker/image_uri_config/inferentia-mxnet.json

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,29 @@
55
"1.5.1": {
66
"py_versions": ["py3"],
77
"registries": {
8+
"af-south-1": "774647643957",
9+
"ap-east-1": "110948597952",
10+
"ap-northeast-1": "941853720454",
11+
"ap-northeast-2": "151534178276",
12+
"ap-south-1": "763008648453",
13+
"ap-southeast-1": "324986816169",
14+
"ap-southeast-2": "355873309152",
15+
"ca-central-1": "464438896020",
16+
"cn-north-1": "472730292857",
17+
"cn-northwest-1": "474822919863",
18+
"eu-central-1": "746233611703",
19+
"eu-north-1": "601324751636",
20+
"eu-south-1": "966458181534",
21+
"eu-west-1": "802834080501",
22+
"eu-west-2": "205493899709",
23+
"eu-west-3": "254080097072",
24+
"me-south-1": "836785723513",
25+
"sa-east-1": "756306329178",
826
"us-east-1": "785573368785",
9-
"us-west-2": "301217895009"
27+
"us-east-2": "007439368137",
28+
"us-gov-west-1": "263933020539",
29+
"us-west-1": "710691900526",
30+
"us-west-2": "301217895009"
1031
},
1132
"repository": "sagemaker-neo-mxnet"
1233
}

0 commit comments

Comments
 (0)