Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

saimidu · 2023-08-28T18:42:35Z

Describe the bug
Framework Estimators such as PyTorch take the distribution parameter, where instance_groups is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.

The Framework Estimator validates the distribution parameter as a part of the Estimator initialization. At the end of the validation function, the distribution["instance_groups"] value gets reassigned, from a list of InstanceGroup objects to a list of instance-group names.

Python passes non-primitive data structures such as dict and list by reference, which means that this in-place modification of the contents of the distribution["instance_groups"] dictionary value implicitly changes the contents of the input parameter without telling the user.

When I reuse the same distribution dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.

To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named mnist, an empty python script mnist/smdataparallel_mnist.py, and the role value changed to any SageMaker Execution IAM Role name:

mnist_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mnist")
region = "us-west-2"
region_2 = "us-east-1"
instance_count = 2
training_group = InstanceGroup("train_group", "ml.p4d.24xlarge", instance_count)
distribution = {
    "smdistributed": {"dataparallel": {"enabled": True}},
    "instance_groups": [training_group],
}
estimator_parameter = {
    "entry_point": "smdataparallel_mnist.py",
    "role": "SageMakerRole",
    "source_dir": mnist_path,
    "instance_groups": [training_group],
    "distribution": distribution,
}

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session,
    **estimator_parameter,
)

sagemaker_session_2 = sagemaker.Session(boto_session=boto3.Session(region_name=region_2))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session_2,
    **estimator_parameter,
)

Expected behavior
The creation of Estimator objects should not modify input parameters, such as the distribution dict, in-place. In this specific scenario, the distribution["instance_groups"] list could continue to be a list of InstanceGroup objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.

Screenshots or logs

                if train_instance_group not in instance_groups:
                    # check if train instance groups belongs to what user defined in estimator set up
                    raise ValueError(
>                       f"Invalid training instance group {train_instance_group.instance_group_name} !"
                    )
E                   AttributeError: 'str' object has no attribute 'instance_group_name'

test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError

System information
A description of your system. Please provide:

SageMaker Python SDK version: v2.155.0, but code hasn't changed in latest version v2.181.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): all Frameworks, but this issue was reproduced on PyTorch
Framework version: N/A
Python version: N/A
CPU or GPU: N/A
Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

saimidu added the type: bug label Aug 28, 2023

trungleduc mentioned this issue Oct 17, 2023

fix: clone distribution in validate_distribution #4205

Merged

9 tasks

trungleduc self-assigned this Oct 17, 2023

knikure closed this as completed in #4205 Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

saimidu commented Aug 28, 2023 •

edited

Loading

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

Comments

saimidu commented Aug 28, 2023 • edited Loading

saimidu commented Aug 28, 2023 •

edited

Loading