Skip to content

feature: Allow custom output for RepackModelStep #2804

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
dde8d00
fix: Set ProcessingStep upload locations deterministically to avoid c…
staubhp Dec 8, 2021
0f72907
fix: Prevent repack_model script from referencing nonexistent directo…
staubhp Dec 9, 2021
0bae071
fix: S3Input - add support for instance attributes (#2754)
mufaddal-rohawala Dec 15, 2021
17fe93e
fix: typos and broken link (#2765)
mohamed-ali Dec 16, 2021
f0efd27
feature: Add output path parameter for _RepackModelStep
tuliocasagrande Dec 17, 2021
7a1f4f8
fix: Fix role parameter for _RepackModelStep
tuliocasagrande Dec 17, 2021
ee6afcf
fix: Remove entry_point before calling Model on EstimatorTransformer
tuliocasagrande Jan 4, 2022
faf4ad5
feature: Add tests for RegisterModel with repack output
tuliocasagrande Jan 4, 2022
8210375
fix: fixes unnecessary session call while generating pipeline definit…
xchen909 Jan 10, 2022
972a6d2
feature: Add models_v2 under lineage context (#2800)
yzhu0 Jan 10, 2022
7206b9e
feature: enable python 3.9 (#2802)
mufaddal-rohawala Jan 10, 2022
127c964
change: Update CHANGELOG.md (#2842)
shreyapandit Jan 11, 2022
554d735
fix: update pricing link (#2805)
ahsan-z-khan Jan 11, 2022
88e4d68
doc: Document the available ExecutionVariables (#2807)
tuliocasagrande Jan 12, 2022
b3c19d8
fix: Remove duplicate vertex/edge in query lineage (#2784)
yzhu0 Jan 12, 2022
fd7a335
feature: Support model pipelines in CreateModelStep (#2845)
staubhp Jan 12, 2022
ccfcbe7
feature: support JsonGet/Join parameterization in tuning step Hyperpa…
jerrypeng7773 Jan 13, 2022
71c5617
doc: Enhance smddp 1.2.2 doc (#2852)
mchoi8739 Jan 13, 2022
975e031
feature: support checkpoint to be passed from estimator (#2849)
marckarp Jan 13, 2022
b377b52
fix: allow kms_key to be passed for processing step (#2779)
jayatalr Jan 13, 2022
9d259b3
feature: Adds support for Serverless inference (#2831)
bhaoz Jan 14, 2022
b82fb8a
feature: Add support for SageMaker lineage queries in action (#2853)
yzhu0 Jan 14, 2022
0489b59
Merge branch 'dev' into repack_output
shreyapandit Jan 14, 2022
ed9131b
Merge remote-tracking branch 'upstream/dev' into repack_output
May 13, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,28 @@

## v2.72.3 (2022-01-10)

### Features

* default repack encryption
* support large pipeline
* add support for pytorch 1.10.0

### Documentation Changes

* SageMaker model parallel library 1.6.0 API doc

### Bug Fixes and Other Changes

* update master from dev
* Model Registration with BYO scripts
* Add ContentType in test_auto_ml_describe
* Re-deploy static integ test endpoint if it is not found
* fix kmeans test deletion sequence, increment lineage statics
* Increment static lineage pipeline
* Fix lineage query integ tests
* Add label_headers option for Clarify ModelExplainabilityMonitor
* Add action type to lineage object
* Collapse cross-account artifacts in query lineage response
* Update CHANGELOG.md to remove defaulting dot characters

## v2.72.2 (2022-01-06)

Expand Down
2 changes: 1 addition & 1 deletion doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# You can set these variables from the command line.
SPHINXOPTS = -W
SPHINXBUILD = python -msphinx
SPHINXBUILD = python -msphinx
SPHINXPROJ = sagemaker
SOURCEDIR = .
BUILDDIR = _build
Expand Down
232 changes: 132 additions & 100 deletions doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@
PyTorch Guide to SageMaker's distributed data parallel library
##############################################################

.. admonition:: Contents
Use this guide to learn about the SageMaker distributed
data parallel library API for PyTorch.

- :ref:`pytorch-sdp-modify`
- :ref:`pytorch-sdp-api`
.. contents:: Topics
:depth: 3
:local:

.. _pytorch-sdp-modify:

Expand Down Expand Up @@ -55,7 +57,7 @@ API offered for PyTorch.


- Modify the ``torch.utils.data.distributed.DistributedSampler`` to
include the cluster’s information. Set``num_replicas`` to the
include the cluster’s information. Set ``num_replicas`` to the
total number of GPUs participating in training across all the nodes
in the cluster. This is called ``world_size``. You can get
``world_size`` with
Expand Down Expand Up @@ -110,7 +112,7 @@ you will have for distributed training with the distributed data parallel librar
def main():

    # Scale batch size by world size
    batch_size //= dist.get_world_size() // 8
    batch_size //= dist.get_world_size()
    batch_size = max(batch_size, 1)

    # Prepare dataset
Expand Down Expand Up @@ -153,9 +155,132 @@ you will have for distributed training with the distributed data parallel librar
PyTorch API
===========

.. rubric:: Supported versions
.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)

``smdistributed.dataparallel``'s implementation of distributed data
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
with ``smdistributed.dataparallel``'s ``DistributedDataParallel`` (DDP) is
all you need to do to use ``smdistributed.dataparallel``.

Creation of this DDP class requires ``smdistributed.dataparallel``
already initialized
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.

This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the
batch dimension. The module is replicated on each machine and each
device, and each such replica handles a portion of the input. During the
backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.
Example usage
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:

.. code:: python

import torch
import smdistributed.dataparallel.torch.distributed as dist
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP

dist.init_process_group()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(dist.get_local_rank())

# Build model and optimizer
model = ...
optimizer = torch.optim.SGD(model.parameters(),
                            lr=1e-3 * dist.get_world_size())
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
model = DDP(model)

**Parameters:**

- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
parallelized
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
be provided when the input module resides on a single CUDA device.
For single-device modules,
the ``ith module replica is placed on device_ids[i]``. For
multi-device modules and CPU modules, device_ids must be None or an
empty list, and input data for the forward pass must be placed on the
correct device. Defaults to ``None``.
- ``output_device (int)(optional):`` Device location of output for
single-device CUDA modules. For multi-device modules and CPU modules,
it must be None, and the module itself dictates the output location.
(default: device_ids[0] for single-device modules).  Defaults
to ``None``.
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
(broadcasting) buffers of the module at beginning of the forward
function. ``smdistributed.dataparallel`` does not support broadcast
buffer yet. Please set this to ``False``.
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
group is not supported in ``smdistributed.dataparallel``. This
parameter exists for API parity with torch.distributed only. Only
supported value is
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
to ``None.``
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
bucket parameters into multiple buckets so that gradient reduction of
each bucket can potentially overlap with backward
computation. ``bucket_cap_mb`` controls the bucket size in
MegaBytes (MB) (default: 25).

.. note::

This module assumes all parameters are registered in the model by the
time it is created. No parameters should be added nor removed later.

.. note::

This module assumes all parameters are registered in the model of
each distributed processes are in the same order. The module itself
will conduct gradient all-reduction following the reverse order of
the registered parameters of the model. In other words, it is users’
responsibility to ensure that each distributed process has the exact
same model and thus the exact same parameter registration order.

.. note::

You should never change the set of your model’s parameters after
wrapping up your model with DistributedDataParallel. In other words,
when wrapping up your model with DistributedDataParallel, the
constructor of DistributedDataParallel will register the additional
gradient reduction functions on all the parameters of the model
itself at the time of construction. If you change the model’s
parameters after the DistributedDataParallel construction, this is
not supported and unexpected behaviors can happen, since some
parameters’ gradient reduction functions might not get called.

.. method:: no_sync()

``smdistributed.dataparallel`` supports the `PyTorch DDP no_sync() <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.no_sync>`_
context manager. It enables gradient accumulation by skipping AllReduce
during training iterations inside the context.

.. note::

The ``no_sync()`` context manager is available from smdistributed-dataparallel v1.2.2.
To find the release note, see :ref:`sdp_1.2.2_release_note`.

**PyTorch 1.7.1, 1.8.1**
**Example:**

.. code:: python

# Gradients are accumulated while inside no_sync context
with model.no_sync():
...
loss.backward()

# First iteration upon exiting context
# Incoming gradients are added to the accumulated gradients and then synchronized via AllReduce
...
loss.backward()

# Update weights and reset gradients to zero after accumulation is finished
optimizer.step()
optimizer.zero_grad()


.. function:: smdistributed.dataparallel.torch.distributed.is_available()
Expand Down Expand Up @@ -409,99 +534,6 @@ PyTorch API
otherwise.


.. class:: smdistributed.dataparallel.torch.parallel.DistributedDataParallel(module, device_ids=None, output_device=None, broadcast_buffers=True, process_group=None, bucket_cap_mb=None)

``smdistributed.dataparallel's`` implementation of distributed data
parallelism for PyTorch. In most cases, wrapping your PyTorch Module
with ``smdistributed.dataparallel's`` ``DistributedDataParallel (DDP)`` is
all you need to do to use ``smdistributed.dataparallel``.

Creation of this DDP class requires ``smdistributed.dataparallel``
already initialized
with ``smdistributed.dataparallel.torch.distributed.init_process_group()``.

This container parallelizes the application of the given module by
splitting the input across the specified devices by chunking in the
batch dimension. The module is replicated on each machine and each
device, and each such replica handles a portion of the input. During the
backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally.
Example usage
of ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``:

.. code:: python

import torch
import smdistributed.dataparallel.torch.distributed as dist
from smdistributed.dataparallel.torch.parallel import DistributedDataParallel as DDP

dist.init_process_group()

# Pin GPU to be used to process local rank (one GPU per process)
torch.cuda.set_device(dist.get_local_rank())

# Build model and optimizer
model = ...
optimizer = torch.optim.SGD(model.parameters(),
                            lr=1e-3 * dist.get_world_size())
# Wrap model with smdistributed.dataparallel's DistributedDataParallel
model = DDP(model)

**Parameters:**

- ``module (torch.nn.Module)(required):`` PyTorch NN Module to be
parallelized
- ``device_ids (list[int])(optional):`` CUDA devices. This should only
be provided when the input module resides on a single CUDA device.
For single-device modules,
the ``ith module replica is placed on device_ids[i]``. For
multi-device modules and CPU modules, device_ids must be None or an
empty list, and input data for the forward pass must be placed on the
correct device. Defaults to ``None``.
- ``output_device (int)(optional):`` Device location of output for
single-device CUDA modules. For multi-device modules and CPU modules,
it must be None, and the module itself dictates the output location.
(default: device_ids[0] for single-device modules).  Defaults
to ``None``.
- ``broadcast_buffers (bool)(optional):`` Flag that enables syncing
(broadcasting) buffers of the module at beginning of the forward
function. ``smdistributed.dataparallel`` does not support broadcast
buffer yet. Please set this to ``False``.
- ``process_group(smdistributed.dataparallel.torch.distributed.group)(optional):`` Process
group is not supported in ``smdistributed.dataparallel``. This
parameter exists for API parity with torch.distributed only. Only
supported value is
``smdistributed.dataparallel.torch.distributed.group.WORLD.`` Defaults
to ``None.``
- ``bucket_cap_mb (int)(optional):`` DistributedDataParallel will
bucket parameters into multiple buckets so that gradient reduction of
each bucket can potentially overlap with backward
computation. ``bucket_cap_mb`` controls the bucket size in
MegaBytes (MB) (default: 25).

.. rubric:: Notes

- This module assumes all parameters are registered in the model by the
time it is created. No parameters should be added nor removed later.
- This module assumes all parameters are registered in the model of
each distributed processes are in the same order. The module itself
will conduct gradient all-reduction following the reverse order of
the registered parameters of the model. In other words, it is users’
responsibility to ensure that each distributed process has the exact
same model and thus the exact same parameter registration order.
- You should never change the set of your model’s parameters after
wrapping up your model with DistributedDataParallel. In other words,
when wrapping up your model with DistributedDataParallel, the
constructor of DistributedDataParallel will register the additional
gradient reduction functions on all the parameters of the model
itself at the time of construction. If you change the model’s
parameters after the DistributedDataParallel construction, this is
not supported and unexpected behaviors can happen, since some
parameters’ gradient reduction functions might not get called.


.. class:: smdistributed.dataparallel.torch.distributed.ReduceOp

An enum-like class for supported reduction operations
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,10 +155,6 @@ script you will have for distributed training with the library.
TensorFlow API
==============

.. rubric:: Supported versions

**TensorFlow 2.3.1, 2.4.1, 2.5.0**

.. function:: smdistributed.dataparallel.tensorflow.init()

Initialize ``smdistributed.dataparallel``. Must be called at the
Expand Down
Loading