You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Framework Estimators such as PyTorch take the distribution parameter, where instance_groups is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.
The Framework Estimator validates the distribution parameter as a part of the Estimator initialization. At the end of the validation function, the distribution["instance_groups"] value gets reassigned, from a list of InstanceGroup objects to a list of instance-group names.
Python passes non-primitive data structures such as dict and list by reference, which means that this in-place modification of the contents of the distribution["instance_groups"] dictionary value implicitly changes the contents of the input parameter without telling the user.
When I reuse the same distribution dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.
To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named mnist, an empty python script mnist/smdataparallel_mnist.py, and the role value changed to any SageMaker Execution IAM Role name:
Expected behavior
The creation of Estimator objects should not modify input parameters, such as the distribution dict, in-place. In this specific scenario, the distribution["instance_groups"] list could continue to be a list of InstanceGroup objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.
Screenshots or logs
if train_instance_group not in instance_groups:
# check if train instance groups belongs to what user defined in estimator set up
raise ValueError(
> f"Invalid training instance group {train_instance_group.instance_group_name} !"
)
E AttributeError: 'str' object has no attribute 'instance_group_name'
test_venv/lib/python3.8/site-packages/sagemaker/fw_utils.py:898: AttributeError
System information
A description of your system. Please provide:
SageMaker Python SDK version: v2.155.0, but code hasn't changed in latest version v2.181.0
Framework name (eg. PyTorch) or algorithm (eg. KMeans): all Frameworks, but this issue was reproduced on PyTorch
Framework version: N/A
Python version: N/A
CPU or GPU: N/A
Custom Docker image (Y/N): N
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Describe the bug
Framework Estimators such as PyTorch take the
distribution
parameter, whereinstance_groups
is expected as a key-value map that tells the Estimator which instance-groups are available for it to use from the overall pool of instance-groups made available in the instance_group EstimatorBase parameter.The Framework Estimator validates the
distribution
parameter as a part of the Estimator initialization. At the end of the validation function, thedistribution["instance_groups"]
value gets reassigned, from a list ofInstanceGroup
objects to a list of instance-group names.Python passes non-primitive data structures such as
dict
andlist
by reference, which means that this in-place modification of the contents of thedistribution["instance_groups"]
dictionary value implicitly changes the contents of the input parameter without telling the user.When I reuse the same
distribution
dictionary across multiple sequential SM Training Job attempts (such as this code where we retry the same job in a different region if we encounter CapacityErrors), it causes a ValueError in the validation step in the next attempt.To reproduce
This is a simplified version of test_hc_smdataparallel_mnist. Run the following with a folder named
mnist
, an empty python scriptmnist/smdataparallel_mnist.py
, and therole
value changed to any SageMaker Execution IAM Role name:Expected behavior
The creation of
Estimator
objects should not modify input parameters, such as thedistribution
dict, in-place. In this specific scenario, thedistribution["instance_groups"]
list could continue to be a list ofInstanceGroup
objects, and any further usage of that dictionary key-value pair in the rest of the SDK code should just use <InstanceGroupObject>.instance_group_name.Screenshots or logs
System information
A description of your system. Please provide:
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: