Skip to content

Estimator modifies input InstanceGroup configurations in-place, preventing them from being reused #4097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saimidu opened this issue Aug 28, 2023 · 0 comments · Fixed by #4205
Assignees

Comments

@saimidu
Copy link
Contributor

saimidu commented Aug 28, 2023

Describe the bug
Framework Estimators such as PyTorch take the distribution parameter, where instance_groups is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.

The Framework Estimator validates the distribution parameter as a part of the Estimator initialization. At the end of the validation function, the distribution["instance_groups"] value gets reassigned, from a list of InstanceGroup objects to a list of instance-group names.

Python passes non-primitive data structures such as dict and list by reference, which means that this in-place modification of the contents of the distribution["instance_groups"] dictionary value implicitly changes the contents of the input parameter without telling the user.

When I reuse the same distribution dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.

To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named mnist, an empty python script mnist/smdataparallel_mnist.py, and the role value changed to any SageMaker Execution IAM Role name:

mnist_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "mnist")
region = "us-west-2"
region_2 = "us-east-1"
instance_count = 2
training_group = InstanceGroup("train_group", "ml.p4d.24xlarge", instance_count)
distribution = {
    "smdistributed": {"dataparallel": {"enabled": True}},
    "instance_groups": [training_group],
}
estimator_parameter = {
    "entry_point": "smdataparallel_mnist.py",
    "role": "SageMakerRole",
    "source_dir": mnist_path,
    "instance_groups": [training_group],
    "distribution": distribution,
}

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name=region))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session,
    **estimator_parameter,
)

sagemaker_session_2 = sagemaker.Session(boto_session=boto3.Session(region_name=region_2))
pytorch = PyTorch(
    sagemaker_session=sagemaker_session_2,
    **estimator_parameter,
)

Expected behavior
The creation of Estimator objects should not modify input parameters, such as the distribution dict, in-place. In this specific scenario, the distribution["instance_groups"] list could continue to be a list of InstanceGroup objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.

Screenshots or logs

                if train_instance_group not in instance_groups:
                    # check if train instance groups belongs to what user defined in estimator set up
                    raise ValueError(
>                       f"Invalid training instance group {train_instance_group.instance_group_name} !"
                    )
E                   AttributeError: 'str' object has no attribute 'instance_group_name'

test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: v2.155.0, but code hasn't changed in latest version v2.181.0
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): all Frameworks, but this issue was reproduced on PyTorch
  • Framework version: N/A
  • Python version: N/A
  • CPU or GPU: N/A
  • Custom Docker image (Y/N): N

Additional context
Add any other context about the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants