Local deploy failure for MXNet estimator due to non-uniform output uri #1354

ehsanmok · 2020-03-12T23:40:08Z

Describe the bug

This is related to the already resolved issue #1349 . After using the local mode for training via LocalSession(), deploying the estimator locally repeatedly throws this error:

Click to see the error

algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle -     in DEFAULT_MODEL_FILENAMES.items()]))
algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model-stderr com.amazonaws.ml.mms.wlm.WorkerLifeCycle - ValueError: Failed to load model with default model_fn: missing file model-symbol.json.Expected files: ['model-symbol.json', 'model-0000.params', 'model-shapes.json']
algo-1-zu1b2_1  | 2020-03-13 17:36:32,149 [WARN ] W-9001-model com.amazonaws.ml.mms.wlm.BatchAggregator - Load model failed: model, error: Worker died.

I investigated where the issue might be and manually created the MXNetModel that worked with non-local mode but fails in local mode with

---------------------------------------------------------------------------
ClientError                               Traceback (most recent call last)
<timed exec> in <module>()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in deploy(self, initial_instance_count, instance_type, accelerator_type, endpoint_name, update_endpoint, tags, kms_key, wait, data_capture_config)
    440             self.name = "{}{}".format(name_prefix, compiled_model_suffix)
    441 
--> 442         self._create_sagemaker_model(instance_type, accelerator_type, tags)
    443         production_variant = sagemaker.production_variant(
    444             self.name, instance_type, initial_instance_count, accelerator_type=accelerator_type

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _create_sagemaker_model(self, instance_type, accelerator_type, tags)
    175                 /api/latest/reference/services/sagemaker.html#SageMaker.Client.add_tags
    176         """
--> 177         container_def = self.prepare_container_def(instance_type, accelerator_type=accelerator_type)
    178         self.name = self.name or utils.name_from_image(container_def["Image"])
    179         enable_network_isolation = self.enable_network_isolation()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/mxnet/model.py in prepare_container_def(self, instance_type, accelerator_type)
    150 
    151         deploy_key_prefix = model_code_key_prefix(self.key_prefix, self.name, deploy_image)
--> 152         self._upload_code(deploy_key_prefix, self._is_mms_version())
    153         deploy_env = dict(self.env)
    154         deploy_env.update(self._framework_env_vars())

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/model.py in _upload_code(self, key_prefix, repack)
    823                 repacked_model_uri=repacked_model_data,
    824                 sagemaker_session=self.sagemaker_session,
--> 825                 kms_key=self.model_kms_key,
    826             )
    827 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in repack_model(inference_script, source_directory, dependencies, model_uri, repacked_model_uri, sagemaker_session, kms_key)
    481 
    482     with _tmpdir() as tmp:
--> 483         model_dir = _extract_model(model_uri, sagemaker_session, tmp)
    484 
    485         _create_or_update_code_dir(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in _extract_model(model_uri, sagemaker_session, tmp)
    571     if model_uri.lower().startswith("s3://"):
    572         local_model_path = os.path.join(tmp, "tar_file")
--> 573         download_file_from_url(model_uri, local_model_path, sagemaker_session)
    574     else:
    575         local_model_path = model_uri.replace("file://", "")

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file_from_url(url, dst, sagemaker_session)
    589     bucket, key = url.netloc, url.path.lstrip("/")
    590 
--> 591     download_file(bucket, key, dst, sagemaker_session)
    592 
    593 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/sagemaker/utils.py in download_file(bucket_name, path, target, sagemaker_session)
    607     s3 = boto_session.resource("s3")
    608     bucket = s3.Bucket(bucket_name)
--> 609     bucket.download_file(path, target)
    610 
    611 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in bucket_download_file(self, Key, Filename, ExtraArgs, Callback, Config)
    244     return self.meta.client.download_file(
    245         Bucket=self.name, Key=Key, Filename=Filename,
--> 246         ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
    247 
    248 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
    170         return transfer.download_file(
    171             bucket=Bucket, key=Key, filename=Filename,
--> 172             extra_args=ExtraArgs, callback=Callback)
    173 
    174 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
    305             bucket, key, filename, extra_args, subscribers)
    306         try:
--> 307             future.result()
    308         # This is for backwards compatibility where when retries are
    309         # exceeded we need to throw the same error from boto3 instead of

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
    104             # however if a KeyboardInterrupt is raised we want want to exit
    105             # out of this and propogate the exception.
--> 106             return self._coordinator.result()
    107         except KeyboardInterrupt as e:
    108             self.cancel()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/futures.py in result(self)
    263         # final result.
    264         if self._exception:
--> 265             raise self._exception
    266         return self._result
    267 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/tasks.py in _main(self, transfer_future, **kwargs)
    253             # Call the submit method to start submitting tasks to execute the
    254             # transfer.
--> 255             self._submit(transfer_future=transfer_future, **kwargs)
    256         except BaseException as e:
    257             # If there was an exception raised during the submission of task

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/s3transfer/download.py in _submit(self, client, config, osutil, request_executor, io_executor, transfer_future, bandwidth_limiter)
    341                 Bucket=transfer_future.meta.call_args.bucket,
    342                 Key=transfer_future.meta.call_args.key,
--> 343                 **transfer_future.meta.call_args.extra_args
    344             )
    345             transfer_future.meta.provide_transfer_size(

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
    314                     "%s() only accepts keyword arguments." % py_operation_name)
    315             # The "self" in this scope is referring to the BaseClient.
--> 316             return self._make_api_call(operation_name, kwargs)
    317 
    318         _api_call.__name__ = str(py_operation_name)

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
    624             error_code = parsed_response.get("Error", {}).get("Code")
    625             error_class = self.exceptions.from_code(error_code)
--> 626             raise error_class(parsed_response, operation_name)
    627         else:
    628             return parsed_response

ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

and looking more into it I found out that in local mode model.tar.gz location uri is stored differently from non-local.

Local model uri: s3://{bucket}/{prefix}/outputmxnet-training-2020-03-12-23-19-23-971/model.tar.gz
Non-local uri: s3://{bucket}/{prefix}/output/mxnet-training-2020-03-12-23-19-23-971/output/model.tar.gz.

System information
A description of your system. Please provide:

SageMaker Python SDK version: 1.51.3
Framework name (eg. PyTorch) or algorithm (eg. KMeans): MXNet
Framework version: 1.6.0
Python version: py3.6
CPU or GPU: Both
Custom Docker image (Y/N): N

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-03-19T23:46:01Z

what is your output_path set to in your estimator?

ehsanmok · 2020-03-23T18:34:02Z

@laurenyu it's the same output_path='s3://{}/{}/output'.format(bucket, prefix) for both modes.

nadiaya · 2020-04-03T19:41:41Z

    Local model uri: s3://{bucket}/{prefix}/outputmxnet-training-2020-03-12-23-19-23-971/model.tar.gz

    Non-local uri: s3://{bucket}/{prefix}/output/mxnet-training-2020-03-12-23-19-23-971/output/model.tar.gz.

is a known issue, but I don't think it would cause the error you are experiencing.

nadiaya · 2020-04-03T19:42:25Z

Could you share your code that creates the estimator and deploys the model?

ehsanmok · 2020-04-03T20:40:20Z

@nadiaya I'm using MXNetModel(model_data=os.path.join(output_path, 'model.tar.gz'), ...) for my custom inference and not being able to find the model for deploy was the issue where I had to fix manually using estimator._current_job_name to correctly get the path.

Here is the not-so-special inference code.

nadiaya · 2020-04-06T19:01:25Z

Could you give us the code of how you create the training job and then deploy the model?

ehsanmok · 2020-04-06T19:06:38Z

Take a look at the notebook. Because of this issue, I haven't included the local session mode in the notebook, you'd need to add it yourself. That's the best I can offer.

Also if this is a known issue please link in here for the record.

nadiaya · 2020-04-06T19:31:07Z

An alternative solution would be to get the model location from the training job which would work regardless of if the mode:

training_job_name = estimator.latest_training_job.name
desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)
trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']

laurenyu · 2020-04-24T17:56:09Z

I've submitted a PR to address this: #1439

laurenyu · 2020-04-27T20:25:18Z

#1439 has been released as part of v1.56.1. does it resolve your issue?

ehsanmok · 2020-04-29T00:25:07Z

Thanks @laurenyu! I just upgraded and tested it and it partially solved the earlier issue. Here's the what's happening now:

I'm using output_path=train_output='s3://{}/{}/output'.format(bucket, prefix) in my estimator.

local uri:

s3://{bucket}/{prefix}/output/mxnet-training-2020-04-28-23-01-06-447/model.tar.gz
s3://{bucket}/{prefix}/output/mxnet-training-2020-04-28-23-01-06-447/output.tar.gz

non-local uri:

s3://{bucket}/{prefix}/output/mxnet-training-2020-04-28-23-42-45-505/output/model.tar.gz
s3://{bucket}/{prefix}/output/mxnet-training-2020-04-28-23-42-45-505/output/output.tar.gz

Notice the additional /output attached to the non-local uri so it needs some more work to be completely unified.

I also examined the content of output.tar.gz for both local and non-local cases and they do not match which they differ by a redundant data repo in the local mode.

local:

output/
  data/
    a.csv

non-local:

output/
   a.csv

Co-authored-by: Raymond Liu <[email protected]> Co-authored-by: Ruilian Gao <[email protected]> Co-authored-by: John Barboza <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Zuoyuan Huang <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Mike Schneider <[email protected]> Co-authored-by: Bhupendra Singh <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: evakravi <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Alexander Pivovarov <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: mariumof <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: amzn-choeric <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Sarah Castillo <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: stacicho <[email protected]> Co-authored-by: martinRenou <[email protected]> Co-authored-by: jiapinw <[email protected]> Co-authored-by: Akash Goel <[email protected]> Co-authored-by: Joseph Zhang <[email protected]> Co-authored-by: Harsha Reddy <[email protected]> Co-authored-by: Haixin Wang <[email protected]> Co-authored-by: Kalyani Nikure <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Jose Pena <[email protected]> Co-authored-by: cansun <[email protected]> Co-authored-by: AWS-pratab <[email protected]> Co-authored-by: shenlongtang <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: chrivtho-github <[email protected]> Co-authored-by: Justin <[email protected]> Co-authored-by: Duc Trung Le <[email protected]> Co-authored-by: HappyAmazonian <[email protected]> Co-authored-by: cj-zhang <[email protected]> Co-authored-by: Matthew <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: Rohith Nadimpally <[email protected]> Co-authored-by: rohithn1 <[email protected]> Co-authored-by: Victor Zhu <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: SSRraymond <[email protected]> Co-authored-by: jbarz1 <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Barboza <[email protected]> Co-authored-by: ruiliann666 <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: svia3 <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Stephen Via <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Stacia Choe <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: JohnaAtAWS <[email protected]> Co-authored-by: Vera Yu <[email protected]> Co-authored-by: bhaoz <[email protected]> Co-authored-by: Qing Lan <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Sirut Buasai <[email protected]> Co-authored-by: wayneyao <[email protected]> Co-authored-by: Jacky Lee <[email protected]> Co-authored-by: haNa-meister <[email protected]> Co-authored-by: Shailav <[email protected]> Fix unit tests (#1018) Fix happy hf test (#1026) fix logic setup (#1034) fixes (#1045) Fix flake error in init (#1050) fix (#1053) fix: skip tensorflow local mode notebook test (#4060) fix: tags for jumpstart model package models (#4061) fix: pipeline variable kms key (#4065) fix: jumpstart cache using sagemaker session s3 client (#4051) fix: gated models unsupported region (#4069) fix: pipeline upsert failed to pass parallelism_config to update (#4066) fix: temporarily skip kmeans notebook (#4092) fixes (#1051) Fix missing absolute import error (#1057) Fix flake8 error in unit test (#1058) fixes (#1056) Fix flake8 error in integ test (#1060) Fix black format error in test_pickle_dependencies (#1062) Fix docstyle error under serve (#1065) Fix docstyle error in builder failure (#1066) fix black and flake8 formatting (#1069) Fix format error (#1070) Fix integ test (#1074) fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072) fix: log message when sdk defaults not applied (#4104) fix: handle bad jumpstart default session (#4109) Fix the version information, whl and flake8 (#1085) Fix JSON serializer error (#1088) Fix unit test (#1091) fix format (#1103) Fix local mode predictor (#1107) Fix DJLPredictor (#1108) Fix modelbuilder unit tests (#1118) fixes (#1136) fixes (#1165) fixes (#1166) fix: auto ml integ tests and add flaky test markers (#4136) fix model data for JumpStartModel (#4135) fix: transform step unit test (#4151) fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099) fix: Fixed bug in _create_training_details (#4141) fix: use correct line endings and s3 uris on windows (#4118) fix: js tagging s3 prefix (#4167) fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181) fix: import error in unsupported js regions (#4188) fix: update local mode schema (#4185) fix: fix flaky Inference Recommender integration tests (#4156) fix: clone distribution in validate_distribution (#4205) Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208) Fix master merge formatting (#1186) Fix master unit tests (#1203) Fix djl unit tests (#1204) Fix merge conflicts (#1217) fix: fix URL links (#4217) fix: bump urllib3 version (#4223) fix: relax upper bound on urllib in local mode requirements (#4219) fixes (#1224) fix formatting (#1233) fix byoc unit tests (#1235) fix byoc unit tests (#1236) Fixed Modelpackage's deploy calling model's deploy (#1155) fix: jumpstart unit-test (#1265) fixes (#963) Fix TorchTensorSer/Deser (#969) fix (#971) fix local container mode (#972) Fix auto detect (#979) Fix routing fn (#981) fix local container serialization (#989) fix custom serialiazation with local container. Also remove a lot of unused code (#994) Fix custom serialization for local container mode (#1000) fix pytorch version (#1001) Fix unit test (#990) fix: Multiple bug fixes including removing unsupported feature. (#1105) Fix some problems with pipeline compilation (#1125) fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164) fix: invoke_function circular import (#1262) fix: pylint (#1264) fix: Add logging for docker build failures (#1267) Fix session bug when provided in ModelBuilder (#1288) fixes (#1313) fix: Gated content bucket env var override (#1280) fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287) fix: HMAC signing for ModelBuilder Triton python backend (#1282) fix: do not delete temp folder generated by sdist (#1291) fix: Do not require model_server if provided image_uri is a 1p image. (#1303) fix: check image type vs instance type (#1307) fix: unit test (#1315) fix: Fixed model builder's register unable to deploy (#1323) fix: missing `self._framework` in `InferenceSpec` path (#1325) fix: enable xgboost integ test in our own pipeline (#1326) fix: skip py310 (#1328) fix: Update autodetect dlc logic (#1329) Fix secret key in the Model object (#1334) fix: improve error message (#1333) Fix unit testing (#1340) fix: Typing and formatting (#1341) fix: WaiterError on failed pipeline execution. results() (#1337) Fix tox identified errors (#1344) Fix issue when the user runs in Python 3.11 (#1345) fixes (#1346) fix: use copy instead of move in bootstrap script (#1339) Resolve keynote3 conflicts (#1351) Resolve keynote3 conflicts v2 (#1353) Fix conflicts (#1354) Fix conflicts v3 (#1355) fix: get whl from local to run integ tests (#1357) fix: enable triton pt tests (#1358) fix: integ test (#1362) Fix Python 3.11 issue with dataclass decorator (#1345) fix: remote function include_local_workdir default value (#1342) fix: error message (#1373) fixes (#1372) fix: Remvoe PickleSerializer (#1378)

ehsanmok changed the title ~~Local deploy failure for MXNet estimator~~ Local deploy failure for MXNet estimator due to non-uniform output uri Mar 13, 2020

laurenyu added the type: question label Mar 19, 2020

laurenyu added type: bug and removed type: question labels Apr 23, 2020

laurenyu mentioned this issue Apr 24, 2020

fix: allow output_path without trailing slash in Local Mode training jobs #1439

Merged

7 tasks

laurenyu added the In progress label Apr 24, 2020

laurenyu removed the In progress label Apr 29, 2020

maqboolkhan mentioned this issue Nov 16, 2021

Pytorch_estimator.model_data is incorrect and also unable to deploy endpoint #2762

Closed

martinRenou added the component: local mode label Sep 28, 2023

martinRenou assigned trungleduc Nov 17, 2023

trungleduc mentioned this issue Dec 4, 2023

change: update model path in local mode #4296

Merged

9 tasks

akrishna1995 closed this as completed in #4296 Dec 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local deploy failure for MXNet estimator due to non-uniform output uri #1354

Local deploy failure for MXNet estimator due to non-uniform output uri #1354

ehsanmok commented Mar 12, 2020 •

edited

Loading

laurenyu commented Mar 19, 2020

ehsanmok commented Mar 23, 2020

nadiaya commented Apr 3, 2020

nadiaya commented Apr 3, 2020

ehsanmok commented Apr 3, 2020 •

edited

Loading

nadiaya commented Apr 6, 2020

ehsanmok commented Apr 6, 2020

nadiaya commented Apr 6, 2020

laurenyu commented Apr 24, 2020

laurenyu commented Apr 27, 2020

ehsanmok commented Apr 29, 2020 •

edited

Loading

Local deploy failure for MXNet estimator due to non-uniform output uri #1354

Local deploy failure for MXNet estimator due to non-uniform output uri #1354

Comments

ehsanmok commented Mar 12, 2020 • edited Loading

laurenyu commented Mar 19, 2020

ehsanmok commented Mar 23, 2020

nadiaya commented Apr 3, 2020

nadiaya commented Apr 3, 2020

ehsanmok commented Apr 3, 2020 • edited Loading

nadiaya commented Apr 6, 2020

ehsanmok commented Apr 6, 2020

nadiaya commented Apr 6, 2020

laurenyu commented Apr 24, 2020

laurenyu commented Apr 27, 2020

ehsanmok commented Apr 29, 2020 • edited Loading

ehsanmok commented Mar 12, 2020 •

edited

Loading

ehsanmok commented Apr 3, 2020 •

edited

Loading

ehsanmok commented Apr 29, 2020 •

edited

Loading