Skip to content

Commit 85af929

Browse files
authored
Merge branch 'master' into feature/new_fg_utils
2 parents b31a0fe + 58fe72a commit 85af929

25 files changed

+554
-368
lines changed

CHANGELOG.md

+17
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
# Changelog
22

3+
## v2.131.0 (2023-01-31)
4+
5+
### Features
6+
7+
* Display file diff on black-check
8+
* Support for environment variables in the HPO
9+
* Support role as PipelineParameter in Processor class
10+
* Add TrainingImageConfig support for SageMaker training jobs
11+
12+
### Bug Fixes and Other Changes
13+
14+
* use FeatureGroup's Session in nonconcurrency ingestion
15+
* Update feature_group.py ingest() description
16+
* Do not use print function. User logger instead
17+
* Add batch_get_record and search API for FeatureStore
18+
* hashing problem for framework processors with identical source dirs
19+
320
## v2.130.0 (2023-01-26)
421

522
### Features

VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.130.1.dev0
1+
2.131.1.dev0

doc/api/training/smd_model_parallel_release_notes/smd_model_parallel_change_log.rst

+41-7
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,47 @@ New features, bug fixes, and improvements are regularly made to the SageMaker
66
distributed model parallel library.
77

88

9-
SageMaker Distributed Model Parallel 1.13.0 Release Notes
9+
SageMaker Distributed Model Parallel 1.14.0 Release Notes
1010
=========================================================
1111

12+
*Date: Jan. 30. 2023*
13+
14+
**Currency Updates**
15+
16+
* Added support for PyTorch v1.13.1
17+
18+
**Improvements**
19+
20+
* Upgraded the flash-attention (https://github.com/HazyResearch/flash-attention) library to v0.2.6.post1
21+
22+
**Migration to AWS Deep Learning Containers**
23+
24+
This version passed benchmark testing and is migrated to the following AWS Deep Learning Containers (DLC):
25+
26+
- SageMaker training container for PyTorch v1.13.1
27+
28+
.. code::
29+
30+
763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker
31+
32+
33+
Binary file of this version of the library for `custom container
34+
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
35+
36+
- For PyTorch 1.13.1
37+
38+
.. code::
39+
40+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.13.1/build-artifacts/2023-01-19-18-35/smdistributed_modelparallel-1.14.0-cp39-cp39-linux_x86_64.whl
41+
42+
----
43+
44+
Release History
45+
===============
46+
47+
SageMaker Distributed Model Parallel 1.13.0 Release Notes
48+
---------------------------------------------------------
49+
1250
*Date: Dec. 15. 2022*
1351

1452
**New Features**
@@ -46,16 +84,12 @@ This version passed benchmark testing and is migrated to the following AWS Deep
4684
Binary file of this version of the library for `custom container
4785
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-bring-your-own-container>`_ users:
4886

49-
- For PyTorch 1.12.0
87+
- For PyTorch 1.12.1
5088

5189
.. code::
5290
5391
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.1/build-artifacts/2022-12-08-21-34/smdistributed_modelparallel-1.13.0-cp38-cp38-linux_x86_64.whl
5492
55-
----
56-
57-
Release History
58-
===============
5993
6094
SageMaker Distributed Model Parallel 1.11.0 Release Notes
6195
---------------------------------------------------------
@@ -92,7 +126,7 @@ Binary file of this version of the library for `custom container
92126

93127
.. code::
94128
95-
https://sagemaker-distribu
129+
https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.12.0/build-artifacts/2022-08-12-16-58/smdistributed_modelparallel-1.11.0-cp38-cp38-linux_x86_64.whl
96130
97131
SageMaker Distributed Model Parallel 1.10.1 Release Notes
98132
---------------------------------------------------------

doc/api/training/smp_versions/latest.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ depending on which version of the library you need to use.
1010
To use the library, reference the
1111
**Common API** documentation alongside the framework specific API documentation.
1212

13-
Version 1.11.0, 1.13.0 (Latest)
14-
===============================
13+
Version 1.11.0, 1.13.0, 1.14.0 (Latest)
14+
=======================================
1515

1616
To use the library, reference the Common API documentation alongside the framework specific API documentation.
1717

src/sagemaker/debugger/framework_profile.py

+6
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,12 @@ def __init__(
143143
profiling. Configure it using the
144144
:class:`~sagemaker.debugger.metrics_config.DetailedProfilingConfig` class.
145145
Pass ``DetailedProfilingConfig()`` to use the default configuration.
146+
147+
.. warning::
148+
This detailed framework profiling feature discontinues support for TensorFlow v2.11
149+
and later. To use the detailed profiling feature, use previous versions of
150+
TensorFlow between v2.3.1 and v2.10.0.
151+
146152
dataloader_profiling_config (DataloaderProfilingConfig): The configuration for
147153
dataloader metrics profiling. Configure it using the
148154
:class:`~sagemaker.debugger.metrics_config.DataloaderProfilingConfig` class.

src/sagemaker/debugger/metrics_config.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -203,8 +203,7 @@ def __init__(
203203
):
204204
"""Specify target steps or a target duration to profile.
205205
206-
By default, it profiles step 5
207-
of training.
206+
By default, it profiles step 5 of the training job.
208207
209208
If **profile_default_steps** is set to `True` and none of the other
210209
range parameters is specified,
@@ -224,6 +223,11 @@ def __init__(
224223
if one of the two pairs is used. If both pairs are specified, a
225224
conflict error occurs.
226225
226+
.. warning::
227+
This detailed framework profiling feature discontinues support for TensorFlow v2.11
228+
and later. To use the detailed profiling feature, use previous versions of
229+
TensorFlow between v2.3.1 and v2.10.0.
230+
227231
"""
228232
assert isinstance(
229233
profile_default_steps, bool

src/sagemaker/experiments/_environment.py

+12-20
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,13 @@
1818
import logging
1919
import os
2020

21+
from sagemaker import Session
2122
from sagemaker.experiments import trial_component
2223
from sagemaker.utils import retry_with_backoff
2324

2425
TRAINING_JOB_ARN_ENV = "TRAINING_JOB_ARN"
2526
PROCESSING_JOB_CONFIG_PATH = "/opt/ml/config/processingjobconfig.json"
26-
TRANSFORM_JOB_ENV_BATCH_VAR = "SAGEMAKER_BATCH"
27+
TRANSFORM_JOB_ARN_ENV = "TRANSFORM_JOB_ARN"
2728
MAX_RETRY_ATTEMPTS = 7
2829

2930
logger = logging.getLogger(__name__)
@@ -40,7 +41,7 @@ class _EnvironmentType(enum.Enum):
4041
class _RunEnvironment(object):
4142
"""Retrieves job specific data from the environment."""
4243

43-
def __init__(self, environment_type, source_arn):
44+
def __init__(self, environment_type: _EnvironmentType, source_arn: str):
4445
"""Init for _RunEnvironment.
4546
4647
Args:
@@ -53,9 +54,9 @@ def __init__(self, environment_type, source_arn):
5354
@classmethod
5455
def load(
5556
cls,
56-
training_job_arn_env=TRAINING_JOB_ARN_ENV,
57-
processing_job_config_path=PROCESSING_JOB_CONFIG_PATH,
58-
transform_job_batch_var=TRANSFORM_JOB_ENV_BATCH_VAR,
57+
training_job_arn_env: str = TRAINING_JOB_ARN_ENV,
58+
processing_job_config_path: str = PROCESSING_JOB_CONFIG_PATH,
59+
transform_job_arn_env: str = TRANSFORM_JOB_ARN_ENV,
5960
):
6061
"""Loads source arn of current job from environment.
6162
@@ -64,8 +65,8 @@ def load(
6465
(default: `TRAINING_JOB_ARN`).
6566
processing_job_config_path (str): The processing job config path
6667
(default: `/opt/ml/config/processingjobconfig.json`).
67-
transform_job_batch_var (str): The environment variable indicating if
68-
it is a transform job (default: `SAGEMAKER_BATCH`).
68+
transform_job_arn_env (str): The environment key for transform job ARN
69+
(default: `TRANSFORM_JOB_ARN_ENV`).
6970
7071
Returns:
7172
_RunEnvironment: Job data loaded from the environment. None if config does not exist.
@@ -78,16 +79,15 @@ def load(
7879
environment_type = _EnvironmentType.SageMakerProcessingJob
7980
source_arn = json.loads(open(processing_job_config_path).read())["ProcessingJobArn"]
8081
return _RunEnvironment(environment_type, source_arn)
81-
if transform_job_batch_var in os.environ and os.environ[transform_job_batch_var] == "true":
82+
if transform_job_arn_env in os.environ:
8283
environment_type = _EnvironmentType.SageMakerTransformJob
83-
# TODO: need to figure out how to get source_arn from job env
84-
# with Transform team's help.
85-
source_arn = ""
84+
# TODO: need to update to get source_arn from config file once Transform side ready
85+
source_arn = os.environ.get(transform_job_arn_env)
8686
return _RunEnvironment(environment_type, source_arn)
8787

8888
return None
8989

90-
def get_trial_component(self, sagemaker_session):
90+
def get_trial_component(self, sagemaker_session: Session):
9191
"""Retrieves the trial component from the job in the environment.
9292
9393
Args:
@@ -99,14 +99,6 @@ def get_trial_component(self, sagemaker_session):
9999
Returns:
100100
_TrialComponent: The trial component created from the job. None if not found.
101101
"""
102-
# TODO: Remove this condition check once we have a way to retrieve source ARN
103-
# from transform job env
104-
if self.environment_type == _EnvironmentType.SageMakerTransformJob:
105-
logger.error(
106-
"Currently getting the job trial component from the transform job environment "
107-
"is not supported. Returning None."
108-
)
109-
return None
110102

111103
def _get_trial_component():
112104
summaries = list(

src/sagemaker/experiments/_metrics.py

-80
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@
1414
from __future__ import absolute_import
1515

1616
import datetime
17-
import json
1817
import logging
1918
import os
2019
import time
@@ -35,85 +34,6 @@
3534
logger = logging.getLogger(__name__)
3635

3736

38-
# TODO: remove this _SageMakerFileMetricsWriter class
39-
# when _MetricsManager is fully ready
40-
class _SageMakerFileMetricsWriter(object):
41-
"""Write metric data to file."""
42-
43-
def __init__(self, metrics_file_path=None):
44-
"""Construct a `_SageMakerFileMetricsWriter` object"""
45-
self._metrics_file_path = metrics_file_path
46-
self._file = None
47-
self._closed = False
48-
49-
def log_metric(self, metric_name, value, timestamp=None, step=None):
50-
"""Write a metric to file.
51-
52-
Args:
53-
metric_name (str): The name of the metric.
54-
value (float): The value of the metric.
55-
timestamp (datetime.datetime): Timestamp of the metric.
56-
If not specified, the current UTC time will be used.
57-
step (int): Iteration number of the metric (default: None).
58-
59-
Raises:
60-
SageMakerMetricsWriterException: If the metrics file is closed.
61-
AttributeError: If file has been initialized and the writer hasn't been closed.
62-
"""
63-
raw_metric_data = _RawMetricData(
64-
metric_name=metric_name, value=value, timestamp=timestamp, step=step
65-
)
66-
try:
67-
logger.debug("Writing metric: %s", raw_metric_data)
68-
self._file.write(json.dumps(raw_metric_data.to_record()))
69-
self._file.write("\n")
70-
except AttributeError as attr_err:
71-
if self._closed:
72-
raise SageMakerMetricsWriterException("log_metric called on a closed writer")
73-
if not self._file:
74-
self._file = open(self._get_metrics_file_path(), "a", buffering=1)
75-
self._file.write(json.dumps(raw_metric_data.to_record()))
76-
self._file.write("\n")
77-
else:
78-
raise attr_err
79-
80-
def close(self):
81-
"""Closes the metric file."""
82-
if not self._closed and self._file:
83-
self._file.close()
84-
self._file = None # invalidate reference, causing subsequent log_metric to fail.
85-
self._closed = True
86-
87-
def __enter__(self):
88-
"""Return self"""
89-
return self
90-
91-
def __exit__(self, exc_type, exc_value, exc_traceback):
92-
"""Execute self.close()"""
93-
self.close()
94-
95-
def __del__(self):
96-
"""Execute self.close()"""
97-
self.close()
98-
99-
def _get_metrics_file_path(self):
100-
"""Get file path to store metrics"""
101-
pid_filename = "{}.json".format(str(os.getpid()))
102-
metrics_file_path = self._metrics_file_path or os.path.join(METRICS_DIR, pid_filename)
103-
logger.debug("metrics_file_path = %s", metrics_file_path)
104-
return metrics_file_path
105-
106-
107-
class SageMakerMetricsWriterException(Exception):
108-
"""SageMakerMetricsWriterException"""
109-
110-
def __init__(self, message, errors=None):
111-
"""Construct a `SageMakerMetricsWriterException` instance"""
112-
super().__init__(message)
113-
if errors:
114-
self.errors = errors
115-
116-
11737
class _RawMetricData(object):
11838
"""A Raw Metric Data Object"""
11939

src/sagemaker/experiments/_utils.py

+3-5
Original file line numberDiff line numberDiff line change
@@ -127,11 +127,9 @@ def get_tc_and_exp_config_from_job_env(
127127
num_attempts=4,
128128
)
129129
else: # environment.environment_type == _EnvironmentType.SageMakerTransformJob
130-
raise RuntimeError(
131-
"Failed to load the Run as loading experiment config "
132-
"from transform job environment is not currently supported. "
133-
"As a workaround, please explicitly pass in "
134-
"the experiment_name and run_name in load_run."
130+
job_response = retry_with_backoff(
131+
callable_func=lambda: sagemaker_session.describe_transform_job(job_name),
132+
num_attempts=4,
135133
)
136134

137135
job_exp_config = job_response.get("ExperimentConfig", dict())

src/sagemaker/experiments/run.py

+7-7
Original file line numberDiff line numberDiff line change
@@ -120,19 +120,18 @@ def __init__(
120120
estimator.fit(job_name="my-job") # Create a training job
121121
122122
In order to reuse an existing run to log extra data, ``load_run`` is recommended.
123+
For example, instead of the ``Run`` constructor, the ``load_run`` is recommended to use
124+
in a job script to load the existing run created before the job launch.
125+
Otherwise, a new run may be created each time you launch a job.
126+
123127
The code snippet below displays how to load the run initialized above
124128
in a custom training job script, where no ``run_name`` or ``experiment_name``
125129
is presented as they are automatically retrieved from the experiment config
126130
in the job environment.
127131
128-
Note:
129-
Instead of the ``Run`` constructor, the ``load_run`` is recommended to use
130-
in a job script to load the existing run created before the job launch.
131-
Otherwise, a new run may be created each time you launch a job.
132-
133132
.. code:: python
134133
135-
with load_run() as run:
134+
with load_run(sagemaker_session=sagemaker_session) as run:
136135
run.log_metric(...)
137136
...
138137
@@ -648,7 +647,8 @@ def _append_run_tc_label_to_tags(tags: Optional[List[Dict[str, str]]] = None) ->
648647
"""
649648
if not tags:
650649
tags = []
651-
tags.append(RUN_TC_TAG)
650+
if RUN_TC_TAG not in tags:
651+
tags.append(RUN_TC_TAG)
652652
return tags
653653

654654
def __enter__(self):

0 commit comments

Comments
 (0)