Inconsistent behavior of argument model_dir between single instance vs. distributed training using TensorFlow Estimator #1355

shashankprasanna · 2020-03-13T01:05:07Z

Describe the bug
The SageMaker Python SDK uses the argument model_dir inconsistently when executing single instance training vs. distributed training.

When running single instance training, SageMaker invokes specified python script with the argument:
python my_tf_script.py --model_dir s3://my_default_bucket/job/...

When running a distributed training job using horovod/MPI (by passing distributions to the TF estimator), sagemaker invokes MPIRUN with:
mpirun --host algo-1,algo-2 -np 2 /usr/local/bin/python3.6 -m mpi4py my_hvd_script.py --model_dir /opt/ml/model

This is very confusing and inconsistent behavior that's very hard to debug.

The SDK docs doesn't explain these differences:
https://sagemaker.readthedocs.io/en/stable/sagemaker.tensorflow.html
"model_dir (str) – S3 location where the checkpoint data and models can be exported to during training (default: None). If not specified a default S3 URI will be generated. It will be passed in the training script as one of the command line arguments."

The above definition is only true for single instance training. For Distributed training it's set to /opt/ml/model

To reproduce
To repro:
Run a single instance TensorFlow script mode training and observe logs. Here is an example where model_dir is set to an S3 location by the SDK.
https://github.com/shashankprasanna/reinvent-aim338/blob/master/2-dl-containers/cifar10_sagemaker_demo-with-results.ipynb
Output:
Invoking script with the following command:
/usr/local/bin/python3.6 cifar10-training-script-sagemaker.py --batch-size 256 --epochs 30 --learning-rate 0.01 --model_dir s3://reinvent-aim338/tensorflow-training-2019-12-05-19-13-24-229/model --momentum 0.9 --optimizer sgd --weight-decay 0.0002

Run distributed Tensorflow horovod/MPI script mode training and observe logs.
https://github.com/shashankprasanna/reinvent-aim338/blob/master/3-distributed-training/cifar10-sagemaker-distributed.ipynb
Output:
Command "mpirun --host algo-1,algo-2 -np 2 --allow-run-as-root <many_other_arguments> /usr/local/bin/python3.6 -m mpi4py cifar10-tf-horovod-sagemaker.py --batch-size 256 --epochs 100 --learning-rate 0.001 --model_dir /opt/ml/model --momentum 0.9 --optimizer adam --tensorboard_logs s3://sagemaker-jobs/jobstensorboard_logs/tf-horovod-2x1-workers-2020-03-13-00-25-10-073 --weight-decay 0.0002"

Expected behavior
When user doesn't override model_dir, SageMaker SDK must pass the same type of location (S3 or local) consistently for single instance or distributed training. Today one is S3 and the other is a local /opt/ml/model

Screenshots or logs
I've included logs/output in the repro section of this issue

System information
A description of your system. Please provide:

SageMaker Python SDK version: '1.50.16'
Framework name (eg. PyTorch) or algorithm (eg. KMeans): TensorFlow
Framework version: 1.14 (issue shows up on other versions 1.15, 1.13)
Python version: python 3
CPU or GPU: both
Custom Docker image (Y/N): no, DLC images on ECR

Additional context
Add any other context about the problem here.

laurenyu · 2020-03-17T22:07:50Z

@shashankprasanna sorry for the delay in response, and sorry for the confusion

this is the expected behavior when using MPI for distributed training, so we'll need to update the documentation. however, if you use parameter servers, it should use an S3 location for model_dir. can you share your code for starting the training jobs?

shashankprasanna · 2020-03-17T23:07:08Z

Hi @laurenyu the training scripts are in the same repo as the Jupyter notebook code in the issue.

(PS: I deleted my earlier comment, since I first misread your comment as this is expected behavior and no changes were required.)

laurenyu · 2020-03-20T00:38:45Z

I've opened a PR to fix the documentation: #1368

laurenyu · 2020-03-23T17:50:53Z

released in v1.51.4

Co-authored-by: Raymond Liu <[email protected]> Co-authored-by: Ruilian Gao <[email protected]> Co-authored-by: John Barboza <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Zuoyuan Huang <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Mike Schneider <[email protected]> Co-authored-by: Bhupendra Singh <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: evakravi <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Alexander Pivovarov <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: mariumof <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: amzn-choeric <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Sarah Castillo <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: stacicho <[email protected]> Co-authored-by: martinRenou <[email protected]> Co-authored-by: jiapinw <[email protected]> Co-authored-by: Akash Goel <[email protected]> Co-authored-by: Joseph Zhang <[email protected]> Co-authored-by: Harsha Reddy <[email protected]> Co-authored-by: Haixin Wang <[email protected]> Co-authored-by: Kalyani Nikure <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Jose Pena <[email protected]> Co-authored-by: cansun <[email protected]> Co-authored-by: AWS-pratab <[email protected]> Co-authored-by: shenlongtang <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: chrivtho-github <[email protected]> Co-authored-by: Justin <[email protected]> Co-authored-by: Duc Trung Le <[email protected]> Co-authored-by: HappyAmazonian <[email protected]> Co-authored-by: cj-zhang <[email protected]> Co-authored-by: Matthew <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: Rohith Nadimpally <[email protected]> Co-authored-by: rohithn1 <[email protected]> Co-authored-by: Victor Zhu <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: SSRraymond <[email protected]> Co-authored-by: jbarz1 <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Barboza <[email protected]> Co-authored-by: ruiliann666 <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: svia3 <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Stephen Via <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Stacia Choe <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: JohnaAtAWS <[email protected]> Co-authored-by: Vera Yu <[email protected]> Co-authored-by: bhaoz <[email protected]> Co-authored-by: Qing Lan <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Sirut Buasai <[email protected]> Co-authored-by: wayneyao <[email protected]> Co-authored-by: Jacky Lee <[email protected]> Co-authored-by: haNa-meister <[email protected]> Co-authored-by: Shailav <[email protected]> Fix unit tests (#1018) Fix happy hf test (#1026) fix logic setup (#1034) fixes (#1045) Fix flake error in init (#1050) fix (#1053) fix: skip tensorflow local mode notebook test (#4060) fix: tags for jumpstart model package models (#4061) fix: pipeline variable kms key (#4065) fix: jumpstart cache using sagemaker session s3 client (#4051) fix: gated models unsupported region (#4069) fix: pipeline upsert failed to pass parallelism_config to update (#4066) fix: temporarily skip kmeans notebook (#4092) fixes (#1051) Fix missing absolute import error (#1057) Fix flake8 error in unit test (#1058) fixes (#1056) Fix flake8 error in integ test (#1060) Fix black format error in test_pickle_dependencies (#1062) Fix docstyle error under serve (#1065) Fix docstyle error in builder failure (#1066) fix black and flake8 formatting (#1069) Fix format error (#1070) Fix integ test (#1074) fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072) fix: log message when sdk defaults not applied (#4104) fix: handle bad jumpstart default session (#4109) Fix the version information, whl and flake8 (#1085) Fix JSON serializer error (#1088) Fix unit test (#1091) fix format (#1103) Fix local mode predictor (#1107) Fix DJLPredictor (#1108) Fix modelbuilder unit tests (#1118) fixes (#1136) fixes (#1165) fixes (#1166) fix: auto ml integ tests and add flaky test markers (#4136) fix model data for JumpStartModel (#4135) fix: transform step unit test (#4151) fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099) fix: Fixed bug in _create_training_details (#4141) fix: use correct line endings and s3 uris on windows (#4118) fix: js tagging s3 prefix (#4167) fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181) fix: import error in unsupported js regions (#4188) fix: update local mode schema (#4185) fix: fix flaky Inference Recommender integration tests (#4156) fix: clone distribution in validate_distribution (#4205) Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208) Fix master merge formatting (#1186) Fix master unit tests (#1203) Fix djl unit tests (#1204) Fix merge conflicts (#1217) fix: fix URL links (#4217) fix: bump urllib3 version (#4223) fix: relax upper bound on urllib in local mode requirements (#4219) fixes (#1224) fix formatting (#1233) fix byoc unit tests (#1235) fix byoc unit tests (#1236) Fixed Modelpackage's deploy calling model's deploy (#1155) fix: jumpstart unit-test (#1265) fixes (#963) Fix TorchTensorSer/Deser (#969) fix (#971) fix local container mode (#972) Fix auto detect (#979) Fix routing fn (#981) fix local container serialization (#989) fix custom serialiazation with local container. Also remove a lot of unused code (#994) Fix custom serialization for local container mode (#1000) fix pytorch version (#1001) Fix unit test (#990) fix: Multiple bug fixes including removing unsupported feature. (#1105) Fix some problems with pipeline compilation (#1125) fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164) fix: invoke_function circular import (#1262) fix: pylint (#1264) fix: Add logging for docker build failures (#1267) Fix session bug when provided in ModelBuilder (#1288) fixes (#1313) fix: Gated content bucket env var override (#1280) fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287) fix: HMAC signing for ModelBuilder Triton python backend (#1282) fix: do not delete temp folder generated by sdist (#1291) fix: Do not require model_server if provided image_uri is a 1p image. (#1303) fix: check image type vs instance type (#1307) fix: unit test (#1315) fix: Fixed model builder's register unable to deploy (#1323) fix: missing `self._framework` in `InferenceSpec` path (#1325) fix: enable xgboost integ test in our own pipeline (#1326) fix: skip py310 (#1328) fix: Update autodetect dlc logic (#1329) Fix secret key in the Model object (#1334) fix: improve error message (#1333) Fix unit testing (#1340) fix: Typing and formatting (#1341) fix: WaiterError on failed pipeline execution. results() (#1337) Fix tox identified errors (#1344) Fix issue when the user runs in Python 3.11 (#1345) fixes (#1346) fix: use copy instead of move in bootstrap script (#1339) Resolve keynote3 conflicts (#1351) Resolve keynote3 conflicts v2 (#1353) Fix conflicts (#1354) Fix conflicts v3 (#1355) fix: get whl from local to run integ tests (#1357) fix: enable triton pt tests (#1358) fix: integ test (#1362) Fix Python 3.11 issue with dataclass decorator (#1345) fix: remote function include_local_workdir default value (#1342) fix: error message (#1373) fixes (#1372) fix: Remvoe PickleSerializer (#1378)

laurenyu added type: documentation type: question labels Mar 17, 2020

laurenyu mentioned this issue Mar 20, 2020

doc: fix description of default model_dir for TF #1368

Merged

7 tasks

laurenyu added In progress status: pending release The fix have been merged but not yet released to PyPI and removed type: question In progress labels Mar 20, 2020

laurenyu closed this as completed Mar 23, 2020

athewsey mentioned this issue Apr 7, 2020

--model_dir is inconsistent, confusing, and unnecessary aws/sagemaker-tensorflow-training-toolkit#340

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior of argument model_dir between single instance vs. distributed training using TensorFlow Estimator #1355

Inconsistent behavior of argument model_dir between single instance vs. distributed training using TensorFlow Estimator #1355

shashankprasanna commented Mar 13, 2020 •

edited

Loading

laurenyu commented Mar 17, 2020

shashankprasanna commented Mar 17, 2020

laurenyu commented Mar 20, 2020

laurenyu commented Mar 23, 2020

Inconsistent behavior of argument model_dir between single instance vs. distributed training using TensorFlow Estimator #1355

Inconsistent behavior of argument model_dir between single instance vs. distributed training using TensorFlow Estimator #1355

Comments

shashankprasanna commented Mar 13, 2020 • edited Loading

laurenyu commented Mar 17, 2020

shashankprasanna commented Mar 17, 2020

laurenyu commented Mar 20, 2020

laurenyu commented Mar 23, 2020

shashankprasanna commented Mar 13, 2020 •

edited

Loading