-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Pytorch deployment failing with unexpected errors #752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Got to the bottom of it, and fixed it with quite an unsavory hack:
Apparently the I'll close this for now, but I'd really like to know if there is a more efficient way of testing the deployments locally. Granted, the docker containers are made available for Pytorch and Tensorflow, but there is no clear documentation on the steps required to build the images with the required entry_point script, model_data, etc. locally and run the inference container as it would on the SageMaker environment. |
Hi Carlo, I found your post: #752 My issue is that I keep getting "No module named 'flair'". Your model.tar.gz file is no longer available. Did that have something else inside it other than the pt or pth file? Thanks in advance! I am completely stumped. Here is the output from my Cloudwatch |
Hey @ShaneRyan1977 AFAIK there are a couple of ways, one a bit hackier than the other. First one, is create a custom docker image:
You can then push that image to a private ECR repo with something like this:
Then on training, you can specify your image as a base image to use: https://sagemaker.readthedocs.io/en/stable/sagemaker.pytorch.html#pytorch-estimator see the Alternatively, you can add a couple of lines at the very top of your training code:
Lastly, you can also add a bash file as your entry point. That bash script can run pip commands, and in the end it would call |
Thanks for the incredibly quick reply. But I'm having a problem with deploying a model to an endpoint rather than training. I actually trained the model using flair in a JupyterLab. Then I uploaded the model.tar.gz to s3. I am trying to do something almost identical to your original code but deploying a trained model to an endpoint. But I keep getting the errors I pasted. I am completely stumped. I am also amazed that you didn't have to build a custom image for deploying your endpoint! |
The reason why python 3.5 is an issue is that flair 0.4.2 doesn't support it (it need python 3.6). Which is what I trained my model with. |
The same is valid, you can either define your own Docker image for deployment of add the pip install script in your main inference script. You can create an instance of a model from a pre-trained model just using this class right here: https://sagemaker.readthedocs.io/en/stable/sagemaker.pytorch.html#pytorch-model Namely:
Then you can I haven't tested this in some time, I actually solved the issue by deploying models with terraform (which I'd advise you to look into, as it solves many problems as far as repeatable deployments go). See here: https://www.terraform.io/docs/providers/aws/r/sagemaker_model.html |
… respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]>
… respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]>
… respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]>
… respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]>
… respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]>
* feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (#696) * change: Update Run init and add Run load and _RunContext (#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (#754) * fix flaky metrics test (#753) * change: Change Run.init and Run.load to constructor and module method respectively (#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (#767) * Change: Minimize use of lower case tc name (#769) * change: Clean up test resources to remove model files (#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (#691) * feature: Add Experiment helper classes (#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (#696) * change: Update Run init and add Run load and _RunContext (#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (#754) * fix flaky metrics test (#753) * change: Change Run.init and Run.load to constructor and module method respectively (#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (#767) * Change: Minimize use of lower case tc name (#769) * change: Clean up test resources to remove model files (#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
* feature: Add experiment plus Run class (aws#691) * feature: Add Experiment helper classes (aws#646) * feature: Add Experiment helper classes feature: Add helper class _RunEnvironment * change: Change sleep retry to backoff retry for get TC * minor fixes in backoff retry Co-authored-by: Dewen Qi <[email protected]> * feature: Add helper classes and methods for Run class (aws#660) * feature: Add helper classes and methods for Run class * Add Parent class to address comment * fix docstyle check * Add arg docstrings in _helper Co-authored-by: Dewen Qi <[email protected]> * feature: Add Experiment Run class (aws#651) Co-authored-by: Dewen Qi <[email protected]> * change: Add integ tests for Run (aws#673) Co-authored-by: Dewen Qi <[email protected]> * Update run log metric to use MetricsManager (aws#678) * Update run.log_metric to use _MetricsManager * fix several metrics issues * Add doc strings to metrics.py Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> * change: Simplify exp plus integ test configuration (aws#694) Co-authored-by: Dewen Qi <[email protected]> * feature: add RunName to expeirment_config (aws#696) * change: Update Run init and add Run load and _RunContext (aws#707) * change: Update Run init and add Run load Add exp name and run group name to load and address comments * Address nit comments Co-authored-by: Dewen Qi <[email protected]> * fix: Fix run name uniqueness issue (aws#730) Co-authored-by: Dewen Qi <[email protected]> * change: Update integ tests for Exp Plus M1 changes (aws#741) Co-authored-by: Dewen Qi <[email protected]> * add metrics client to session object (aws#745) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * change: Add integ test for using Run in Transform Job (aws#749) Co-authored-by: Dewen Qi <[email protected]> * Add async metrics sink (aws#739) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * use metrics client provided by session (aws#754) * fix flaky metrics test (aws#753) * change: Change Run.init and Run.load to constructor and module method respectively (aws#752) Co-authored-by: Dewen Qi <[email protected]> * feature: Add latest metric service model (aws#757) Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: qidewenwhen <[email protected]> * fix: lowercase run name (aws#767) * Change: Minimize use of lower case tc name (aws#769) * change: Clean up test resources to remove model files (aws#756) * change: Clean up test resources to remove model files * fix: Change experiment enums to upper case * change: Upgrade boto3 and update test to validate mixed case name * fix: Update as per latest botocore release and backend change Co-authored-by: Dewen Qi <[email protected]> * lowercase trial component name (aws#776) * change: Expose sagemaker experiment doc strings * fix: Fix exp name mixed case in issue Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Dana Benson <[email protected]> Co-authored-by: Yifei Zhu <[email protected]>
Please fill out the form below.
System Information
Describe the problem
Model deployment fails with cryptic errors. See the logs below. The command issued to deploy the model is the following:
The model is publicly available here:
https://s3.us-east-2.amazonaws.com/sagemaker-us-east-2-971148336196/improved-ner-training-0-25-0/output/model.tar.gz
It contains a directory called
flair
which contains thefinal_model.pt
The (relevant) part of the
train_model.py
script is the following:Minimal repro / logs
The CloudWatch logs are very opaque. One of the errors is the following:
Then, much later, these errors pop up:
Any ideas of what is actually causing the error, or some other steps to take to make it easier to debug?
The text was updated successfully, but these errors were encountered: