Skip to content

Commit 20a3aa4

Browse files
NihalHarishtejaschumbalkarsaimidu
authored
[tensorflow] [build] Upgrade Smdebug to 1.0.4 for TF 2.4 (#835)
* upgrade smdebug * revert buildspec * bump smdebug version * add inference context * modify tf buildspec * remove cu102 containers from test * revert yaml to commit 3fc3f0f * Unskip smmodelparallel multinode test Co-authored-by: Tejas Chumbalkar <[email protected]> Co-authored-by: tejaschumbalkar <[email protected]> Co-authored-by: Sai Parthasarathy Miduthuri <[email protected]>
1 parent e528a56 commit 20a3aa4

File tree

4 files changed

+81
-137
lines changed

4 files changed

+81
-137
lines changed

tensorflow/buildspec.yml

Lines changed: 79 additions & 134 deletions
Original file line numberDiff line numberDiff line change
@@ -1,137 +1,82 @@
1-
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2-
region: &REGION <set-$REGION-in-environment>
3-
framework: &FRAMEWORK tensorflow
4-
version: &VERSION 2.3.1
5-
short_version: &SHORT_VERSION 2.3
1+
account_id: &ACCOUNT_ID <set-$ACCOUNT_ID-in-environment>
2+
region: &REGION <set-$REGION-in-environment>
3+
framework: &FRAMEWORK tensorflow
4+
version: &VERSION 2.4.1
5+
short_version: &SHORT_VERSION 2.4
66

7-
repository_info:
8-
training_repository: &TRAINING_REPOSITORY
9-
image_type: &TRAINING_IMAGE_TYPE training
10-
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
11-
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE]
12-
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
13-
inference_repository: &INFERENCE_REPOSITORY
14-
image_type: &INFERENCE_IMAGE_TYPE inference
15-
root: !join [ *FRAMEWORK, "/", *INFERENCE_IMAGE_TYPE ]
16-
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *INFERENCE_IMAGE_TYPE]
17-
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/, *REPOSITORY_NAME ]
7+
repository_info:
8+
training_repository: &TRAINING_REPOSITORY
9+
image_type: &TRAINING_IMAGE_TYPE training
10+
root: !join [ *FRAMEWORK, "/", *TRAINING_IMAGE_TYPE ]
11+
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *TRAINING_IMAGE_TYPE]
12+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/,
13+
*REPOSITORY_NAME ]
14+
inference_repository:
15+
image_type: &INFERENCE_IMAGE_TYPE inference
16+
root: !join [ *FRAMEWORK, "/", *INFERENCE_IMAGE_TYPE ]
17+
repository_name: &REPOSITORY_NAME !join [pr, "-", *FRAMEWORK, "-", *INFERENCE_IMAGE_TYPE]
18+
repository: &REPOSITORY !join [ *ACCOUNT_ID, .dkr.ecr., *REGION, .amazonaws.com/,
19+
*REPOSITORY_NAME ]
1820

19-
context:
20-
training_context: &TRAINING_CONTEXT
21-
dockerd-entrypoint:
22-
source: docker/build_artifacts/dockerd-entrypoint.py
23-
target: dockerd-entrypoint.py
24-
inference_context: &INFERENCE_CONTEXT
25-
sagemaker_package_name:
26-
source: docker/build_artifacts/sagemaker
27-
target: sagemaker
28-
init:
29-
source: docker/build_artifacts/__init__.py
30-
target: __init__.py
31-
dockerd-entrypoint:
32-
source: docker/build_artifacts/dockerd-entrypoint.py
33-
target: dockerd-entrypoint.py
21+
context:
22+
training_context: &TRAINING_CONTEXT
23+
dockerd-entrypoint:
24+
source: docker/build_artifacts/dockerd-entrypoint.py
25+
target: dockerd-entrypoint.py
26+
inference_context:
27+
sagemaker_package_name:
28+
source: docker/build_artifacts/sagemaker
29+
target: sagemaker
30+
init:
31+
source: docker/build_artifacts/__init__.py
32+
target: __init__.py
33+
dockerd-entrypoint:
34+
source: docker/build_artifacts/dockerd-entrypoint.py
35+
target: dockerd-entrypoint.py
3436

35-
images:
36-
BuildTensorflowCpuPy37TrainingDockerImage:
37-
<<: *TRAINING_REPOSITORY
38-
build: &TENSORFLOW_CPU_TRAINING_PY3 false
39-
image_size_baseline: &IMAGE_SIZE_BASELINE 4489
40-
device_type: &DEVICE_TYPE cpu
41-
python_version: &DOCKER_PYTHON_VERSION py3
42-
tag_python_version: &TAG_PYTHON_VERSION py37
43-
os_version: &OS_VERSION ubuntu18.04
44-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION
45-
]
46-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile.,
47-
*DEVICE_TYPE ]
48-
context:
49-
<<: *TRAINING_CONTEXT
50-
BuildTensorflowGpuPy37Cu102TrainingDockerImage:
51-
<<: *TRAINING_REPOSITORY
52-
build: &TENSORFLOW_GPU_TRAINING_PY3 false
53-
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
54-
device_type: &DEVICE_TYPE gpu
55-
python_version: &DOCKER_PYTHON_VERSION py3
56-
tag_python_version: &TAG_PYTHON_VERSION py37
57-
cuda_version: &CUDA_VERSION cu102
58-
os_version: &OS_VERSION ubuntu18.04
59-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
60-
"-", *OS_VERSION ]
61-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION,
62-
/Dockerfile., *DEVICE_TYPE ]
63-
context:
64-
<<: *TRAINING_CONTEXT
65-
BuildTensorflowExampleGpuPy37Cu102TrainingDockerImage:
66-
<<: *TRAINING_REPOSITORY
67-
build: &TENSORFLOW_GPU_TRAINING_PY3 false
68-
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
69-
base_image_name: BuildTensorflowGpuPy37Cu102TrainingDockerImage
70-
device_type: &DEVICE_TYPE gpu
71-
python_version: &DOCKER_PYTHON_VERSION py3
72-
tag_python_version: &TAG_PYTHON_VERSION py37
73-
cuda_version: &CUDA_VERSION cu102
74-
os_version: &OS_VERSION ubuntu18.04
75-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
76-
"-", *OS_VERSION, "-example" ]
77-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example,
78-
/Dockerfile., *DEVICE_TYPE ]
79-
context:
80-
<<: *TRAINING_CONTEXT
81-
BuildTensorflowGpuPy37Cu110TrainingDockerImage:
82-
<<: *TRAINING_REPOSITORY
83-
build: &TENSORFLOW_GPU_TRAINING_PY3 false
84-
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
85-
device_type: &DEVICE_TYPE gpu
86-
python_version: &DOCKER_PYTHON_VERSION py3
87-
tag_python_version: &TAG_PYTHON_VERSION py37
88-
cuda_version: &CUDA_VERSION cu110
89-
os_version: &OS_VERSION ubuntu18.04
90-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
91-
"-", *OS_VERSION ]
92-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION,
93-
/Dockerfile., *DEVICE_TYPE ]
94-
context:
95-
<<: *TRAINING_CONTEXT
96-
BuildTensorflowExampleGpuPy37Cu110TrainingDockerImage:
97-
<<: *TRAINING_REPOSITORY
98-
build: &TENSORFLOW_GPU_TRAINING_PY3 false
99-
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
100-
base_image_name: BuildTensorflowGpuPy37Cu110TrainingDockerImage
101-
device_type: &DEVICE_TYPE gpu
102-
python_version: &DOCKER_PYTHON_VERSION py3
103-
tag_python_version: &TAG_PYTHON_VERSION py37
104-
cuda_version: &CUDA_VERSION cu110
105-
os_version: &OS_VERSION ubuntu18.04
106-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
107-
"-", *OS_VERSION, "-example" ]
108-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example,
109-
/Dockerfile., *DEVICE_TYPE ]
110-
context:
111-
<<: *TRAINING_CONTEXT
112-
BuildTensorflowCPUInferencePy3DockerImage:
113-
<<: *INFERENCE_REPOSITORY
114-
build: &TENSORFLOW_CPU_INFERENCE_PY3 false
115-
image_size_baseline: 4899
116-
device_type: &DEVICE_TYPE cpu
117-
python_version: &DOCKER_PYTHON_VERSION py3
118-
tag_python_version: &TAG_PYTHON_VERSION py37
119-
cuda_version: &CUDA_VERSION cu102
120-
os_version: &OS_VERSION ubuntu18.04
121-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION ]
122-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile., *DEVICE_TYPE ]
123-
context:
124-
<<: *INFERENCE_CONTEXT
125-
BuildTensorflowGPUInferencePy3DockerImage:
126-
<<: *INFERENCE_REPOSITORY
127-
build: &TENSORFLOW_GPU_INFERENCE_PY3 false
128-
image_size_baseline: 7738
129-
device_type: &DEVICE_TYPE gpu
130-
python_version: &DOCKER_PYTHON_VERSION py3
131-
tag_python_version: &TAG_PYTHON_VERSION py37
132-
cuda_version: &CUDA_VERSION cu102
133-
os_version: &OS_VERSION ubuntu18.04
134-
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION, "-", *OS_VERSION ]
135-
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION, /Dockerfile., *DEVICE_TYPE ]
136-
context:
137-
<<: *INFERENCE_CONTEXT
37+
images:
38+
BuildTensorflowCpuPy37TrainingDockerImage:
39+
<<: *TRAINING_REPOSITORY
40+
build: &TENSORFLOW_CPU_TRAINING_PY3 false
41+
image_size_baseline: &IMAGE_SIZE_BASELINE 4489
42+
device_type: &DEVICE_TYPE cpu
43+
python_version: &DOCKER_PYTHON_VERSION py3
44+
tag_python_version: &TAG_PYTHON_VERSION py37
45+
os_version: &OS_VERSION ubuntu18.04
46+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *OS_VERSION
47+
]
48+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /Dockerfile.,
49+
*DEVICE_TYPE ]
50+
context:
51+
<<: *TRAINING_CONTEXT
52+
BuildTensorflowGpuPy37Cu110TrainingDockerImage:
53+
<<: *TRAINING_REPOSITORY
54+
build: &TENSORFLOW_GPU_TRAINING_PY3 false
55+
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
56+
device_type: &DEVICE_TYPE gpu
57+
python_version: &DOCKER_PYTHON_VERSION py3
58+
tag_python_version: &TAG_PYTHON_VERSION py37
59+
cuda_version: &CUDA_VERSION cu110
60+
os_version: &OS_VERSION ubuntu18.04
61+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
62+
"-", *OS_VERSION ]
63+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /, *CUDA_VERSION,
64+
/Dockerfile., *DEVICE_TYPE ]
65+
context:
66+
<<: *TRAINING_CONTEXT
67+
BuildTensorflowExampleGpuPy37Cu110TrainingDockerImage:
68+
<<: *TRAINING_REPOSITORY
69+
build: &TENSORFLOW_GPU_TRAINING_PY3 false
70+
image_size_baseline: &IMAGE_SIZE_BASELINE 7738
71+
base_image_name: BuildTensorflowGpuPy37Cu110TrainingDockerImage
72+
device_type: &DEVICE_TYPE gpu
73+
python_version: &DOCKER_PYTHON_VERSION py3
74+
tag_python_version: &TAG_PYTHON_VERSION py37
75+
cuda_version: &CUDA_VERSION cu110
76+
os_version: &OS_VERSION ubuntu18.04
77+
tag: !join [ *VERSION, "-", *DEVICE_TYPE, "-", *TAG_PYTHON_VERSION, "-", *CUDA_VERSION,
78+
"-", *OS_VERSION, "-example" ]
79+
docker_file: !join [ docker/, *SHORT_VERSION, /, *DOCKER_PYTHON_VERSION, /example,
80+
/Dockerfile., *DEVICE_TYPE ]
81+
context:
82+
<<: *TRAINING_CONTEXT

tensorflow/training/docker/2.4/py3/Dockerfile.cpu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ ARG ESTIMATOR_URL=https://aws-tensorflow-binaries.s3-us-west-2.amazonaws.com/est
3434

3535
# The smdebug pipeline relies for following format to perform string replace and trigger DLC pipeline for validating
3636
# the nightly builds. Therefore, while updating the smdebug version, please ensure that the format is not disturbed.
37-
ARG SMDEBUG_VERSION=1.0.2
37+
ARG SMDEBUG_VERSION=1.0.4
3838

3939
RUN apt-get update && apt-get install -y --no-install-recommends \
4040
build-essential \

tensorflow/training/docker/2.4/py3/cu110/Dockerfile.gpu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ ARG ESTIMATOR_URL=https://aws-tensorflow-binaries.s3-us-west-2.amazonaws.com/est
3535

3636
# The smdebug pipeline relies for following format to perform string replace and trigger DLC pipeline for validating
3737
# the nightly builds. Therefore, while updating the smdebug version, please ensure that the format is not disturbed.
38-
ARG SMDEBUG_VERSION=1.0.2
38+
ARG SMDEBUG_VERSION=1.0.4
3939

4040
ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/tensorflow/2.4.1/cu110/2021-01-28/smdistributed_dataparallel-1.0.0-cp37-cp37m-linux_x86_64.whl
4141

test/sagemaker_tests/tensorflow/tensorflow2_training/integration/sagemaker/test_smmodelparallel.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,6 @@ def test_smmodelparallel(sagemaker_session, instance_type, ecr_image, tmpdir, fr
6868
@pytest.mark.skip_cpu
6969
@pytest.mark.skip_py2_containers
7070
@pytest.mark.parametrize("test_script, num_processes", [("smmodelparallel_hvd2_conv_multinode.py", 2)])
71-
@pytest.mark.skip("Skipping the test due to known issue in TF2.3. Updated binary for SM Model Parallel: https://github.com/aws/deep-learning-containers/pull/837")
7271
def test_smmodelparallel_multinode(sagemaker_session, instance_type, ecr_image, tmpdir, framework_version, test_script, num_processes):
7372
"""
7473
Tests SM Modelparallel in sagemaker

0 commit comments

Comments
 (0)