Skip to content

feat: Add detail profiler V2 options and tests #4078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 23, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 1 addition & 45 deletions doc/api/training/debugger.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Debugger Rule APIs
.. autoclass:: get_rule_container_image_uri
:show-inheritance:

.. autoclass:: get_default_profiler_rule
.. autoclass:: get_default_profiler_processing_job
:show-inheritance:

.. class:: sagemaker.debugger.rule_configs
Expand All @@ -45,10 +45,6 @@ Debugger Rule APIs
:show-inheritance:
:inherited-members:

.. autoclass:: ProfilerRule
:show-inheritance:
:inherited-members:

Debugger Configuration APIs
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -60,43 +56,3 @@ Debugger Configuration APIs

.. autoclass:: TensorBoardOutputConfig
:show-inheritance:

.. autoclass:: ProfilerConfig
:show-inheritance:

Debugger Configuration APIs for Framework Profiling (Deprecated)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::

SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

* SageMaker Python SDK <= v2.130.0
* PyTorch >= v1.6.0, < v2.0
* TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger discontinues support for the APIs below this note.

See also `Amazon SageMaker Debugger Release Notes: March 16, 2023 <https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-release-notes.html#debugger-release-notes-20230315>`_.

.. autoclass:: FrameworkProfile
:show-inheritance:

.. autoclass:: DetailedProfilingConfig
:show-inheritance:

.. autoclass:: DataloaderProfilingConfig
:show-inheritance:

.. autoclass:: PythonProfilingConfig
:show-inheritance:

.. autoclass:: PythonProfiler
:show-inheritance:

.. autoclass:: cProfileTimer
:show-inheritance:

.. automodule:: sagemaker.debugger.metrics_config
:members: StepRange, TimeRange
:undoc-members:
3 changes: 2 additions & 1 deletion doc/api/training/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@ Training APIs
.. toctree::
:maxdepth: 4

algorithm
analytics
automl
debugger
estimators
algorithm
tuner
parameter
processing
profiler
102 changes: 102 additions & 0 deletions doc/api/training/profiler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
Profiler
--------

Amazon SageMaker Profiler provides full visibility
into provisioned compute resources for training
state-of-the-art deep learning models.
The following SageMaker Profiler classes are
for activating SageMaker Profiler while creating
an estimator object of `:class:sagemaker.pytorch.estimator.PyTorch`
or `:class:sagemaker.tensorflow.estimator.TensorFlow`.

.. contents::

.. currentmodule:: sagemaker.debugger

Profiler configuration modules
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. class:: sagemaker.Profiler(cpu_profiling_duration=3600)

A configuration class to activate
`Amazon SageMaker Profiler <https://docs.aws.amazon.com/sagemaker/latest/dg/train-profile-computational-performance.html>`_.

To adjust the Profiler configuration instead of using the default configuration, use the following parameters.

**Parameters:**

- **cpu_profiling_duration** (*str*): Specify the time duration in seconds for
profiling CPU activities. The default value is 3600 seconds.

**Example usage:**

.. code:: python

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker import ProfilerConfig, Profiler

profiler_config = ProfilerConfig(
profiler_params = Profiler(cpu_profiling_duration=3600)
)

estimator = PyTorch(
framework_version="2.0.0",
... # Set up other essential parameters for the estimator class
profiler_config=profiler_config
)

For a complete instruction on activating and using SageMaker Profiler, see
`Use Amazon SageMaker Profiler to profile activities on AWS compute resources
<https://docs.aws.amazon.com/sagemaker/latest/dg/train-profile-computational-performance.html>`_.

.. autoclass:: sagemaker.ProfilerConfig


Profiler Rule APIs
~~~~~~~~~~~~~~~~~~

The following API is for setting up SageMaker Debugger's profiler rules
to detect computational performance issues from training jobs.

.. autoclass:: ProfilerRule
:inherited-members:


Debugger Configuration APIs for Framework Profiling (Deprecated)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::

In favor of `Amazon SageMaker Profiler <https://docs.aws.amazon.com/sagemaker/latest/dg/train-profile-computational-performance.html>`_,
SageMaker Debugger deprecates the framework profiling feature starting from TensorFlow 2.11 and PyTorch 2.0. You can still use the feature in the previous versions of the frameworks and SDKs as follows.

* SageMaker Python SDK <= v2.130.0
* PyTorch >= v1.6.0, < v2.0
* TensorFlow >= v2.3.1, < v2.11

With the deprecation, SageMaker Debugger discontinues support for the APIs below this note.

See also `Amazon SageMaker Debugger Release Notes: March 16, 2023 <https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-release-notes.html#debugger-release-notes-20230315>`_.

.. autoclass:: FrameworkProfile
:show-inheritance:

.. autoclass:: DetailedProfilingConfig
:show-inheritance:

.. autoclass:: DataloaderProfilingConfig
:show-inheritance:

.. autoclass:: PythonProfilingConfig
:show-inheritance:

.. autoclass:: PythonProfiler
:show-inheritance:

.. autoclass:: cProfileTimer
:show-inheritance:

.. automodule:: sagemaker.debugger.metrics_config
:members: StepRange, TimeRange
:undoc-members:
2 changes: 2 additions & 0 deletions src/sagemaker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,6 @@
from sagemaker.automl.automl import AutoML, AutoMLJob, AutoMLInput # noqa: F401
from sagemaker.automl.candidate_estimator import CandidateEstimator, CandidateStep # noqa: F401

from sagemaker.debugger import ProfilerConfig, Profiler # noqa: F401

__version__ = importlib_metadata.version("sagemaker")
3 changes: 2 additions & 1 deletion src/sagemaker/debugger/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
DEBUGGER_FLAG,
DebuggerHookConfig,
framework_name,
get_default_profiler_rule,
get_default_profiler_processing_job,
get_rule_container_image_uri,
ProfilerRule,
Rule,
Expand All @@ -27,6 +27,7 @@
TensorBoardOutputConfig,
)
from sagemaker.debugger.framework_profile import FrameworkProfile # noqa: F401
from sagemaker.debugger.profiler import Profiler # noqa: F401
from sagemaker.debugger.metrics_config import ( # noqa: F401
DataloaderProfilingConfig,
DetailedProfilingConfig,
Expand Down
56 changes: 44 additions & 12 deletions src/sagemaker/debugger/debugger.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@
"""
from __future__ import absolute_import

import time

from abc import ABC

from typing import Union, Optional, List, Dict
Expand All @@ -31,14 +29,31 @@
import smdebug_rulesconfig as rule_configs

from sagemaker import image_uris
from sagemaker.utils import build_dict
from sagemaker.utils import build_dict, name_from_base
from sagemaker.workflow.entities import PipelineVariable
from sagemaker.debugger.profiler_constants import (
DETAIL_PROF_PROCESSING_DEFAULT_INSTANCE_TYPE,
DETAIL_PROF_PROCESSING_DEFAULT_VOLUME_SIZE,
)

framework_name = "debugger"
detailed_framework_name = "detailed-profiler"
DEBUGGER_FLAG = "USE_SMDEBUG"


def get_rule_container_image_uri(region):
class DetailedProfilerProcessingJobConfig:
"""ProfilerRule like class.

Serves as a vehicle to pass info through to the processing instance.

"""

def __init__(self):
self.rule_name = self.__class__.__name__
self.rule_parameters = {"rule_to_invoke": "DetailedProfilerProcessing"}


def get_rule_container_image_uri(name, region):
"""Return the Debugger rule image URI for the given AWS Region.

For a full list of rule image URIs,
Expand All @@ -52,19 +67,28 @@ def get_rule_container_image_uri(region):
str: Formatted image URI for the given AWS Region and the rule container type.

"""
if name is not None and name.startswith("DetailedProfilerProcessingJobConfig"):
# should have the format like "123456789012.dkr.ecr.us-west-2.amazonaws.com/detailed-profiler-processing:latest"
return image_uris.retrieve(detailed_framework_name, region)

return image_uris.retrieve(framework_name, region)


def get_default_profiler_rule():
"""Return the default built-in profiler rule with a unique name.
def get_default_profiler_processing_job(instance_type=None, volume_size_in_gb=None):
"""Return the default profiler processing job (a rule) with a unique name.

Returns:
sagemaker.debugger.ProfilerRule: The instance of the built-in ProfilerRule.

"""
default_rule = rule_configs.ProfilerReport()
custom_name = f"{default_rule.rule_name}-{int(time.time())}"
return ProfilerRule.sagemaker(default_rule, name=custom_name)
default_rule = DetailedProfilerProcessingJobConfig()
custom_name = name_from_base(default_rule.rule_name)
return ProfilerRule.sagemaker(
default_rule,
name=custom_name,
instance_type=instance_type,
volume_size_in_gb=volume_size_in_gb,
)


@attr.s
Expand Down Expand Up @@ -482,6 +506,8 @@ def sagemaker(
name=None,
container_local_output_path=None,
s3_output_path=None,
instance_type=None,
volume_size_in_gb=None,
):
"""Initialize a ``ProfilerRule`` object for a *built-in* profiling rule.

Expand Down Expand Up @@ -510,13 +536,19 @@ def sagemaker(
The instance of the built-in ProfilerRule.

"""
used_name = name or base_config.rule_name
if used_name.startswith("DetailedProfilerProcessingJobConfig"):
if volume_size_in_gb is None:
volume_size_in_gb = DETAIL_PROF_PROCESSING_DEFAULT_VOLUME_SIZE
if instance_type is None:
instance_type = DETAIL_PROF_PROCESSING_DEFAULT_INSTANCE_TYPE
return cls(
name=name or base_config.rule_name,
name=used_name,
image_uri="DEFAULT_RULE_EVALUATOR_IMAGE",
instance_type=None,
instance_type=instance_type,
container_local_output_path=container_local_output_path,
s3_output_path=s3_output_path,
volume_size_in_gb=None,
volume_size_in_gb=volume_size_in_gb,
rule_parameters=base_config.rule_parameters,
)

Expand Down
42 changes: 42 additions & 0 deletions src/sagemaker/debugger/profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You
# may not use this file except in compliance with the License. A copy of
# the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "license" file accompanying this file. This file is
# distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
# ANY KIND, either express or implied. See the License for the specific
# language governing permissions and limitations under the License.

"""Configuration for collecting profiler v2 metrics in SageMaker training jobs."""
from __future__ import absolute_import

from sagemaker.debugger.profiler_constants import (
FILE_ROTATION_INTERVAL_DEFAULT,
CPU_PROFILING_DURATION,
DETAIL_PROF_PROCESSING_DEFAULT_INSTANCE_TYPE,
DETAIL_PROF_PROCESSING_DEFAULT_VOLUME_SIZE,
)


class Profiler:
"""A configuration class to activate SageMaker Profiler."""

def __init__(
self,
cpu_profiling_duration: str = str(CPU_PROFILING_DURATION),
file_rotation_interval: str = str(FILE_ROTATION_INTERVAL_DEFAULT),
):
"""To specify values to adjust the Profiler configuration, use the following parameters.

:param cpu_profiling_duration: Specify the time duration in seconds for
profiling CPU activities. The default value is 3600 seconds.
"""
self.profiling_parameters = {}
self.profiling_parameters["CPUProfilingDuration"] = str(cpu_profiling_duration)
self.profiling_parameters["SMPFileRotationSecs"] = str(file_rotation_interval)
self.instanceType = DETAIL_PROF_PROCESSING_DEFAULT_INSTANCE_TYPE
self.volumeSizeInGB = DETAIL_PROF_PROCESSING_DEFAULT_VOLUME_SIZE
Loading