Skip to content

Internal server error when using OpenNMT-tf #1057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anasir-ureed opened this issue Sep 22, 2019 · 19 comments
Closed

Internal server error when using OpenNMT-tf #1057

anasir-ureed opened this issue Sep 22, 2019 · 19 comments

Comments

@anasir-ureed
Copy link

anasir-ureed commented Sep 22, 2019

Reference: 0411850995

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow , OpenNMT-tf
  • Framework Version: 1.14
  • Python Version: py3
  • CPU or GPU: GPU
  • Python SDK Version: latest
  • Are you using a custom image: script training

Describe the problem

I am using OpenNMT-tf for training. When the training reaches the export step, the exported model is not exported properly, and a strange error named "Internal server error" appears.

Minimal repro / logs

Unfortunately, there is nothing in the logs telling anything about the issue!

  • Exact command to reproduce:
estimator = TensorFlow(entry_point=local_training_script_path,
                       dependencies=['model.py'],
                       train_instance_type='ml.p3.2xlarge',
                       train_instance_count=1,
                       checkpoint_s3_uri=checkpoint_path,
                       output_path=model_artifacts_location,
                       code_location=custom_code_upload_location,
                       role=sagemaker.get_execution_role(),
                       framework_version='1.14',
                       py_version='py3',
                       script_mode=True,
                       train_use_spot_instances=train_use_spot_instances,
                       train_max_run=train_max_run,
                       train_max_wait=train_max_wait,
                       train_volume_size=75)
estimator.fit(training_inputs_location, job_name=job_name_sagemaker, wait=True)

The shell script contains

pip install OpenNMT-tf
onmt-main ...

Please advise,

@mvsusp
Copy link
Contributor

mvsusp commented Sep 29, 2019

Hello @anasir-ureed

Please share with us a minimal code example, allowing us to reproduce the issue.

Thanks for using SageMaker

@anasir-ureed
Copy link
Author

Hello @mvsusp

Thanks for your reply

This is the script I am using in the entry point. "train.sh"

#!/bin/bash
pip install OpenNMT-tf
onmt-main train_and_eval --model_type NMTMedium --config ${SM_CHANNEL_TRAINING}/config/exp1.yml --data_dir ${SM_CHANNEL_TRAINING} --run_dir /opt/ml/checkpoints/
cp -R /opt/ml/checkpoints/* ${SM_MODEL_DIR}/

In the config file,

model_dir: .

data:
  train_features_file: data/train.sp.en
  train_labels_file: data/train.sp.ar
  eval_features_file: data/val.sp.en
  eval_labels_file: data/val.sp.ar
  source_words_vocabulary: data/t.vocab
  target_words_vocabulary: data/t.vocab

params:
  optimizer: GradientDescentOptimizer
  learning_rate: 0.5
  param_init: 0.1
  clip_gradients: 5.0
  decay_type: exponential_decay
  decay_rate: 0.8
  decay_steps: 50000
  start_decay_steps: 50000
  min_learning_rate: 0.0001
  beam_width: 5
  maximum_iterations: 250
  replace_unknown_target: true

train:
  batch_size: 128
  bucket_width: 5
  save_checkpoints_steps: 300
  save_summary_steps: 50
  train_steps: 250000
  maximum_features_length: 300
  maximum_labels_length: 300
  keep_checkpoint_max: 10
  average_last_checkpoints: 8

eval:
  eval_delay: 7200
  batch_size: 8
  external_evaluators: BLEU
  exporters: last
  save_eval_predictions: true

infer:
  batch_size: 32
  n_best: 1

In the notebook please use

train_use_spot_instances = False # the issue happens in spot and on demand
train_max_run=3600 * 24 * 5
train_max_wait = 3600 * 24 * 5 if train_use_spot_instances else None

estimator = TensorFlow(entry_point="train.sh",
                       train_instance_type='ml.p3.2xlarge',
                       train_instance_count=1,
                       checkpoint_s3_uri=checkpoint_path, # Any location you wish
                       role=sagemaker.get_execution_role(),
                       framework_version='1.14',
                       py_version='py3',
                       script_mode=True,
                       train_use_spot_instances=train_use_spot_instances,
                       train_max_run=train_max_run,
                       train_max_wait=train_max_wait,
                       train_volume_size=75)
estimator.fit(training_inputs_location,  wait=True) 

In training_inputs_location, the directory structure is as following:

config/exp1.yml
data/train.sp.en
data/train.sp.ar
data/val.sp.en
data/val.sp.ar
data/t.vocab
data/t.vocab

Best wishes,

@mvsusp
Copy link
Contributor

mvsusp commented Sep 30, 2019

Thanks @anasir-ureed. Where do I find the data files?

@anasir-ureed
Copy link
Author

Hello @mvsusp ,

Any data files I think can be used to reproduce the issue. It happened with me using several data sources.

I have tested also using toy-ende data that can be downloaded if you follow the following link. http://opennmt.net/OpenNMT-tf/quickstart.html

You will need to change the config file to match the data files names.

  train_features_file: data/src-train.txt
  train_labels_file: data/tgt-train.txt
  eval_features_file: data/src-val.txt
  eval_labels_file: data/tgt-val.txt
  source_words_vocabulary: data/src-vocab.txt
  target_words_vocabulary: data/tgt-vocab.txt

also please change train.sh script pip install OpenNMT-tf to :

pip install OpenNMT-tf==1.25.1

A new major version just was released.

Thanks,

@AbdallahNasir
Copy link

Hello @mvsusp ,

Any updates?

Best wishes,

@chuyang-deng
Copy link
Contributor

Hi @AbdallahNasir, sorry for the delay. @mvsusp has left the team and we will continue looking into the issue.

@AbdallahNasir
Copy link

Best wishes to him.

Thank you @ChuyangDeng, appreciate your efforts.

@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Oct 10, 2019

Reference: 0411850995

@AbdallahNasir would it be possible for you to open an AWS support ticket?

We would like to get some information to identify your training job that failed, so we can get the stack trace that is hidden behind the "internal server error".

Feel free to reference this Github issue in your AWS support ticket.

In the meanwhile, can you please provide the name for the training job that failed and region?

Thanks!

@AbdallahNasir
Copy link

Hello @ChoiByungWook ,

The training job name is error-exp---2019-01-10--08-54-45-225648 in us-east-1. Our account name alias is "ureed".

Thanks for your assistance.

@ChoiByungWook
Copy link
Contributor

Hello @AbdallahNasir,

I have passed this information to the respective team.

Please make sure you file an AWS support ticket, as to ensure your issue doesn't get lost.

Thanks!

@AbdallahNasir
Copy link

Dear @ChoiByungWook ,

It looks like I cannot create support tickets, as we do not have purchased support yet.

Any chance we can do this from here?

Best wishes,

@ChoiByungWook
Copy link
Contributor

Hello @AbdallahNasir,

I've reached out to the corresponding team and it looks like we are in need of your AWS account ID as well, however I would not post that on Github.

Can you please message me through the AWS forums? https://forums.aws.amazon.com/forum.jspa?forumID=285

After logging in, please click "Private Messages" -> "Compose Message" -> and send to DanielCAtAWS.

Screen Shot 2019-10-14 at 2 45 20 PM

Feel free to also create a forum post in the AWS forums referencing this Github issue.

Apologies for the back and forth.

@AbdallahNasir
Copy link

Hello @ChoiByungWook ,

Thanks for your reply.

I have sent you the message on the form.

Thanks for your support.

Best wishes,

@ChoiByungWook
Copy link
Contributor

Hello @AbdallahNasir,

I have forwarded the ID to the corresponding team. Awaiting response from them.

Thank you for your patience.

@AbdallahNasir
Copy link

Hi @ChoiByungWook,

Thank you so much dear :)

@ChoiByungWook
Copy link
Contributor

Hello @AbdallahNasir,

The corresponding team has looked into the issue, and it seems to have been an issue on the platform side. A fix was recently deployed, can you please retry your training job?

Thank you!

@AbdallahNasir
Copy link

Hello @ChoiByungWook,

I have started another training job, called error-exp---2019-17-10--06-51-04-788983. It failed, producing the same error message.

Thank you,

@AbdallahNasir
Copy link

Hello @ChoiByungWook,

I hope you are having a great day.

Any updates dear?

Best wishes,

@martinRenou
Copy link
Collaborator

Closing for triage.

If you read this message and are also seeing this error message. Please open a new issue or file an AWS support ticket as suggested above.

Thank you.

knikure added a commit to knikure/sagemaker-python-sdk that referenced this issue Oct 26, 2023
…- Galactus (aws#1212)

Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Raymond Liu <[email protected]>
Co-authored-by: John Barboza <[email protected]>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Mufaddal Rohawala <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Bhupendra Singh <[email protected]>
Co-authored-by: ci <ci>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Zuoyuan Huang <[email protected]>
Co-authored-by: evakravi <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Alexander Pivovarov <[email protected]>
Co-authored-by: SSRraymond <[email protected]>
Co-authored-by: Ruilian Gao <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>
Co-authored-by: mariumof <[email protected]>
Co-authored-by: matherit <[email protected]>
Co-authored-by: amzn-choeric <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Qingzi-Lan <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Sarah Castillo <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: stacicho <[email protected]>
Co-authored-by: martinRenou <[email protected]>
Co-authored-by: jiapinw <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Joseph Zhang <[email protected]>
Co-authored-by: Harsha Reddy <[email protected]>
Co-authored-by: Haixin Wang <[email protected]>
Co-authored-by: Kalyani Nikure <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: Jose Pena <[email protected]>
Co-authored-by: cansun <[email protected]>
Co-authored-by: AWS-pratab <[email protected]>
Co-authored-by: shenlongtang <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: chrivtho-github <[email protected]>
Co-authored-by: Justin <[email protected]>
Co-authored-by: Duc Trung Le <[email protected]>
Co-authored-by: HappyAmazonian <[email protected]>
Co-authored-by: cj-zhang <[email protected]>
Co-authored-by: Matthew <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: Rohith Nadimpally <[email protected]>
Co-authored-by: rohithn1 <[email protected]>
Co-authored-by: Victor Zhu <[email protected]>
Co-authored-by: jbarz1 <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Barboza <[email protected]>
Co-authored-by: ruiliann666 <[email protected]>
fixes (aws#963)
fix: skip tensorflow local mode notebook test (aws#4060)
Fix TorchTensorSer/Deser (aws#969)
fix (aws#971)
fix local container mode (aws#972)
Fix auto detect (aws#979)
Fix routing fn (aws#981)
fix: tags for jumpstart model package models (aws#4061)
fix: pipeline variable kms key (aws#4065)
fix: jumpstart cache using sagemaker session s3 client (aws#4051)
fix: gated models unsupported region (aws#4069)
fix local container serialization (aws#989)
fix custom serialiazation with local container. Also remove a  lot of unused code (aws#994)
Fix custom serialization for local container mode (aws#1000)
fix pytorch version (aws#1001)
Fix unit test (aws#990)
Fix unit tests (aws#1018)
Fix happy hf test (aws#1026)
fix logic setup (aws#1034)
fixes (aws#1045)
Fix flake error in init (aws#1050)
fix (aws#1053)
fix: pipeline upsert failed to pass parallelism_config to update (aws#4066)
fix: temporarily skip kmeans notebook (aws#4092)
fixes (aws#1051)
Fix missing absolute import error (aws#1057)
Fix flake8 error in unit test (aws#1058)
fixes (aws#1056)
Fix flake8 error in integ test (aws#1060)
Fix black format error in test_pickle_dependencies (aws#1062)
Fix docstyle error under serve (aws#1065)
Fix docstyle error in builder failure (aws#1066)
fix black and flake8 formatting (aws#1069)
Fix format error (aws#1070)
Fix integ test (aws#1074)
fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (aws#4072)
fix: log message when sdk defaults not applied (aws#4104)
fix: handle bad jumpstart default session (aws#4109)
Fix the version information, whl and flake8 (aws#1085)
Fix JSON serializer error (aws#1088)
Fix unit test (aws#1091)
fix format (aws#1103)
Fix local mode predictor (aws#1107)
Fix DJLPredictor (aws#1108)
Fix modelbuilder unit tests (aws#1118)
fixes (aws#1136)
fixes (aws#1165)
fixes (aws#1166)
fix: auto ml integ tests and add flaky test markers (aws#4136)
fix model data for JumpStartModel (aws#4135)
fix: transform step  unit test (aws#4151)
fix: Update pipeline.py and selective_execution_config.py with small fixes (aws#1099)
fix: Fixed bug in _create_training_details (aws#4141)
fix: use correct line endings and s3 uris on windows (aws#4118)
fix: js tagging s3 prefix (aws#4167)
fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (aws#4181)
fix: import error in unsupported js regions (aws#4188)
fix: update local mode schema (aws#4185)
fix: fix flaky Inference Recommender integration tests (aws#4156)
fix: clone distribution in validate_distribution (aws#4205)
Fix hyperlinks in feature_processor.scheduler parameter descriptions (aws#4208)
Fix master merge formatting (aws#1186)
Fix master unit tests (aws#1203)
Fix djl unit tests (aws#1204)
Fix merge conflicts (aws#1217)
fix: fix URL links (aws#4217)
fix: bump urllib3 version (aws#4223)
fix: relax upper bound on urllib in local mode requirements (aws#4219)
fixes (aws#1224)
fix formatting (aws#1233)
fix byoc unit tests (aws#1235)
fix byoc unit tests (aws#1236)
benieric added a commit that referenced this issue Nov 29, 2023
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Raymond Liu <[email protected]>
Co-authored-by: John Barboza <[email protected]>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Mufaddal Rohawala <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Bhupendra Singh <[email protected]>
Co-authored-by: ci <ci>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Zuoyuan Huang <[email protected]>
Co-authored-by: evakravi <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Alexander Pivovarov <[email protected]>
Co-authored-by: SSRraymond <[email protected]>
Co-authored-by: Ruilian Gao <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>
Co-authored-by: mariumof <[email protected]>
Co-authored-by: matherit <[email protected]>
Co-authored-by: amzn-choeric <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Qingzi-Lan <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Sarah Castillo <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: stacicho <[email protected]>
Co-authored-by: martinRenou <[email protected]>
Co-authored-by: jiapinw <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Joseph Zhang <[email protected]>
Co-authored-by: Harsha Reddy <[email protected]>
Co-authored-by: Haixin Wang <[email protected]>
Co-authored-by: Kalyani Nikure <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: Jose Pena <[email protected]>
Co-authored-by: cansun <[email protected]>
Co-authored-by: AWS-pratab <[email protected]>
Co-authored-by: shenlongtang <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: chrivtho-github <[email protected]>
Co-authored-by: Justin <[email protected]>
Co-authored-by: Duc Trung Le <[email protected]>
Co-authored-by: HappyAmazonian <[email protected]>
Co-authored-by: cj-zhang <[email protected]>
Co-authored-by: Matthew <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: Rohith Nadimpally <[email protected]>
Co-authored-by: rohithn1 <[email protected]>
Co-authored-by: Victor Zhu <[email protected]>
Co-authored-by: jbarz1 <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Barboza <[email protected]>
Co-authored-by: ruiliann666 <[email protected]>
fixes (#963)
fix: skip tensorflow local mode notebook test (#4060)
Fix TorchTensorSer/Deser (#969)
fix (#971)
fix local container mode (#972)
Fix auto detect (#979)
Fix routing fn (#981)
fix: tags for jumpstart model package models (#4061)
fix: pipeline variable kms key (#4065)
fix: jumpstart cache using sagemaker session s3 client (#4051)
fix: gated models unsupported region (#4069)
fix local container serialization (#989)
fix custom serialiazation with local container. Also remove a  lot of unused code (#994)
Fix custom serialization for local container mode (#1000)
fix pytorch version (#1001)
Fix unit test (#990)
Fix unit tests (#1018)
Fix happy hf test (#1026)
fix logic setup (#1034)
fixes (#1045)
Fix flake error in init (#1050)
fix (#1053)
fix: pipeline upsert failed to pass parallelism_config to update (#4066)
fix: temporarily skip kmeans notebook (#4092)
fixes (#1051)
Fix missing absolute import error (#1057)
Fix flake8 error in unit test (#1058)
fixes (#1056)
Fix flake8 error in integ test (#1060)
Fix black format error in test_pickle_dependencies (#1062)
Fix docstyle error under serve (#1065)
Fix docstyle error in builder failure (#1066)
fix black and flake8 formatting (#1069)
Fix format error (#1070)
Fix integ test (#1074)
fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072)
fix: log message when sdk defaults not applied (#4104)
fix: handle bad jumpstart default session (#4109)
Fix the version information, whl and flake8 (#1085)
Fix JSON serializer error (#1088)
Fix unit test (#1091)
fix format (#1103)
Fix local mode predictor (#1107)
Fix DJLPredictor (#1108)
Fix modelbuilder unit tests (#1118)
fixes (#1136)
fixes (#1165)
fixes (#1166)
fix: auto ml integ tests and add flaky test markers (#4136)
fix model data for JumpStartModel (#4135)
fix: transform step  unit test (#4151)
fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099)
fix: Fixed bug in _create_training_details (#4141)
fix: use correct line endings and s3 uris on windows (#4118)
fix: js tagging s3 prefix (#4167)
fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181)
fix: import error in unsupported js regions (#4188)
fix: update local mode schema (#4185)
fix: fix flaky Inference Recommender integration tests (#4156)
fix: clone distribution in validate_distribution (#4205)
Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208)
Fix master merge formatting (#1186)
Fix master unit tests (#1203)
Fix djl unit tests (#1204)
Fix merge conflicts (#1217)
fix: fix URL links (#4217)
fix: bump urllib3 version (#4223)
fix: relax upper bound on urllib in local mode requirements (#4219)
fixes (#1224)
fix formatting (#1233)
fix byoc unit tests (#1235)
fix byoc unit tests (#1236)
benieric added a commit that referenced this issue Nov 29, 2023
Co-authored-by: Raymond Liu <[email protected]>
Co-authored-by: Ruilian Gao <[email protected]>
Co-authored-by: John Barboza <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Zuoyuan Huang <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Mufaddal Rohawala <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Bhupendra Singh <[email protected]>
Co-authored-by: ci <ci>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: evakravi <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Alexander Pivovarov <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>
Co-authored-by: mariumof <[email protected]>
Co-authored-by: matherit <[email protected]>
Co-authored-by: amzn-choeric <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Qingzi-Lan <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Sarah Castillo <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: stacicho <[email protected]>
Co-authored-by: martinRenou <[email protected]>
Co-authored-by: jiapinw <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Joseph Zhang <[email protected]>
Co-authored-by: Harsha Reddy <[email protected]>
Co-authored-by: Haixin Wang <[email protected]>
Co-authored-by: Kalyani Nikure <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: Jose Pena <[email protected]>
Co-authored-by: cansun <[email protected]>
Co-authored-by: AWS-pratab <[email protected]>
Co-authored-by: shenlongtang <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: chrivtho-github <[email protected]>
Co-authored-by: Justin <[email protected]>
Co-authored-by: Duc Trung Le <[email protected]>
Co-authored-by: HappyAmazonian <[email protected]>
Co-authored-by: cj-zhang <[email protected]>
Co-authored-by: Matthew <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: Rohith Nadimpally <[email protected]>
Co-authored-by: rohithn1 <[email protected]>
Co-authored-by: Victor Zhu <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: SSRraymond <[email protected]>
Co-authored-by: jbarz1 <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Barboza <[email protected]>
Co-authored-by: ruiliann666 <[email protected]>
Co-authored-by: Rohan Gujarathi <[email protected]>
Co-authored-by: svia3 <[email protected]>
Co-authored-by: Zhankui Lu <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Stephen Via <[email protected]>
Co-authored-by: Namrata Madan <[email protected]>
Co-authored-by: Stacia Choe <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Rohan Gujarathi <[email protected]>
Co-authored-by: JohnaAtAWS <[email protected]>
Co-authored-by: Vera Yu <[email protected]>
Co-authored-by: bhaoz <[email protected]>
Co-authored-by: Qing Lan <[email protected]>
Co-authored-by: Namrata Madan <[email protected]>
Co-authored-by: Sirut Buasai <[email protected]>
Co-authored-by: wayneyao <[email protected]>
Co-authored-by: Jacky Lee <[email protected]>
Co-authored-by: haNa-meister <[email protected]>
Co-authored-by: Shailav <[email protected]>
Fix unit tests (#1018)
Fix happy hf test (#1026)
fix logic setup (#1034)
fixes (#1045)
Fix flake error in init (#1050)
fix (#1053)
fix: skip tensorflow local mode notebook test (#4060)
fix: tags for jumpstart model package models (#4061)
fix: pipeline variable kms key (#4065)
fix: jumpstart cache using sagemaker session s3 client (#4051)
fix: gated models unsupported region (#4069)
fix: pipeline upsert failed to pass parallelism_config to update (#4066)
fix: temporarily skip kmeans notebook (#4092)
fixes (#1051)
Fix missing absolute import error (#1057)
Fix flake8 error in unit test (#1058)
fixes (#1056)
Fix flake8 error in integ test (#1060)
Fix black format error in test_pickle_dependencies (#1062)
Fix docstyle error under serve (#1065)
Fix docstyle error in builder failure (#1066)
fix black and flake8 formatting (#1069)
Fix format error (#1070)
Fix integ test (#1074)
fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072)
fix: log message when sdk defaults not applied (#4104)
fix: handle bad jumpstart default session (#4109)
Fix the version information, whl and flake8 (#1085)
Fix JSON serializer error (#1088)
Fix unit test (#1091)
fix format (#1103)
Fix local mode predictor (#1107)
Fix DJLPredictor (#1108)
Fix modelbuilder unit tests (#1118)
fixes (#1136)
fixes (#1165)
fixes (#1166)
fix: auto ml integ tests and add flaky test markers (#4136)
fix model data for JumpStartModel (#4135)
fix: transform step  unit test (#4151)
fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099)
fix: Fixed bug in _create_training_details (#4141)
fix: use correct line endings and s3 uris on windows (#4118)
fix: js tagging s3 prefix (#4167)
fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181)
fix: import error in unsupported js regions (#4188)
fix: update local mode schema (#4185)
fix: fix flaky Inference Recommender integration tests (#4156)
fix: clone distribution in validate_distribution (#4205)
Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208)
Fix master merge formatting (#1186)
Fix master unit tests (#1203)
Fix djl unit tests (#1204)
Fix merge conflicts (#1217)
fix: fix URL links (#4217)
fix: bump urllib3 version (#4223)
fix: relax upper bound on urllib in local mode requirements (#4219)
fixes (#1224)
fix formatting (#1233)
fix byoc unit tests (#1235)
fix byoc unit tests (#1236)
Fixed Modelpackage's deploy calling  model's deploy (#1155)
fix: jumpstart unit-test (#1265)
fixes (#963)
Fix TorchTensorSer/Deser (#969)
fix (#971)
fix local container mode (#972)
Fix auto detect (#979)
Fix routing fn (#981)
fix local container serialization (#989)
fix custom serialiazation with local container. Also remove a  lot of unused code (#994)
Fix custom serialization for local container mode (#1000)
fix pytorch version (#1001)
Fix unit test (#990)
fix: Multiple bug fixes including removing unsupported feature. (#1105)
Fix some problems with pipeline compilation (#1125)
fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164)
fix: invoke_function circular import (#1262)
fix: pylint (#1264)
fix: Add logging for docker build failures (#1267)
Fix session bug when provided in ModelBuilder (#1288)
fixes (#1313)
fix: Gated content bucket env var override (#1280)
fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287)
fix: HMAC signing for ModelBuilder Triton python backend (#1282)
fix: do not delete temp folder generated by sdist (#1291)
fix: Do not require model_server if provided image_uri is a 1p image. (#1303)
fix: check image type vs instance type (#1307)
fix: unit test (#1315)
fix: Fixed model builder's register unable to deploy (#1323)
fix: missing `self._framework` in `InferenceSpec` path (#1325)
fix: enable xgboost integ test in our own pipeline (#1326)
fix: skip py310 (#1328)
fix: Update autodetect dlc logic (#1329)
Fix secret key in the Model object (#1334)
fix: improve error message (#1333)
Fix unit testing (#1340)
fix: Typing and formatting (#1341)
fix: WaiterError on failed pipeline execution. results() (#1337)
Fix tox identified errors (#1344)
Fix issue when the user runs in Python 3.11 (#1345)
fixes (#1346)
fix: use copy instead of move in bootstrap script (#1339)
Resolve keynote3 conflicts (#1351)
Resolve keynote3 conflicts v2 (#1353)
Fix conflicts (#1354)
Fix conflicts v3 (#1355)
fix: get whl from local to run integ tests (#1357)
fix: enable triton pt tests (#1358)
fix: integ test (#1362)
Fix Python 3.11 issue with dataclass decorator (#1345)
fix: remote function include_local_workdir default value (#1342)
fix: error message (#1373)
fixes (#1372)
fix: Remvoe PickleSerializer (#1378)
benieric added a commit that referenced this issue Nov 29, 2023
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Raymond Liu <[email protected]>
Co-authored-by: John Barboza <[email protected]>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Mufaddal Rohawala <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Bhupendra Singh <[email protected]>
Co-authored-by: ci <ci>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Zuoyuan Huang <[email protected]>
Co-authored-by: evakravi <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Alexander Pivovarov <[email protected]>
Co-authored-by: SSRraymond <[email protected]>
Co-authored-by: Ruilian Gao <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>
Co-authored-by: mariumof <[email protected]>
Co-authored-by: matherit <[email protected]>
Co-authored-by: amzn-choeric <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Qingzi-Lan <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Sarah Castillo <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: stacicho <[email protected]>
Co-authored-by: martinRenou <[email protected]>
Co-authored-by: jiapinw <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Joseph Zhang <[email protected]>
Co-authored-by: Harsha Reddy <[email protected]>
Co-authored-by: Haixin Wang <[email protected]>
Co-authored-by: Kalyani Nikure <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: Jose Pena <[email protected]>
Co-authored-by: cansun <[email protected]>
Co-authored-by: AWS-pratab <[email protected]>
Co-authored-by: shenlongtang <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: chrivtho-github <[email protected]>
Co-authored-by: Justin <[email protected]>
Co-authored-by: Duc Trung Le <[email protected]>
Co-authored-by: HappyAmazonian <[email protected]>
Co-authored-by: cj-zhang <[email protected]>
Co-authored-by: Matthew <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: Rohith Nadimpally <[email protected]>
Co-authored-by: rohithn1 <[email protected]>
Co-authored-by: Victor Zhu <[email protected]>
Co-authored-by: jbarz1 <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Barboza <[email protected]>
Co-authored-by: ruiliann666 <[email protected]>
fixes (#963)
fix: skip tensorflow local mode notebook test (#4060)
Fix TorchTensorSer/Deser (#969)
fix (#971)
fix local container mode (#972)
Fix auto detect (#979)
Fix routing fn (#981)
fix: tags for jumpstart model package models (#4061)
fix: pipeline variable kms key (#4065)
fix: jumpstart cache using sagemaker session s3 client (#4051)
fix: gated models unsupported region (#4069)
fix local container serialization (#989)
fix custom serialiazation with local container. Also remove a  lot of unused code (#994)
Fix custom serialization for local container mode (#1000)
fix pytorch version (#1001)
Fix unit test (#990)
Fix unit tests (#1018)
Fix happy hf test (#1026)
fix logic setup (#1034)
fixes (#1045)
Fix flake error in init (#1050)
fix (#1053)
fix: pipeline upsert failed to pass parallelism_config to update (#4066)
fix: temporarily skip kmeans notebook (#4092)
fixes (#1051)
Fix missing absolute import error (#1057)
Fix flake8 error in unit test (#1058)
fixes (#1056)
Fix flake8 error in integ test (#1060)
Fix black format error in test_pickle_dependencies (#1062)
Fix docstyle error under serve (#1065)
Fix docstyle error in builder failure (#1066)
fix black and flake8 formatting (#1069)
Fix format error (#1070)
Fix integ test (#1074)
fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072)
fix: log message when sdk defaults not applied (#4104)
fix: handle bad jumpstart default session (#4109)
Fix the version information, whl and flake8 (#1085)
Fix JSON serializer error (#1088)
Fix unit test (#1091)
fix format (#1103)
Fix local mode predictor (#1107)
Fix DJLPredictor (#1108)
Fix modelbuilder unit tests (#1118)
fixes (#1136)
fixes (#1165)
fixes (#1166)
fix: auto ml integ tests and add flaky test markers (#4136)
fix model data for JumpStartModel (#4135)
fix: transform step  unit test (#4151)
fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099)
fix: Fixed bug in _create_training_details (#4141)
fix: use correct line endings and s3 uris on windows (#4118)
fix: js tagging s3 prefix (#4167)
fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181)
fix: import error in unsupported js regions (#4188)
fix: update local mode schema (#4185)
fix: fix flaky Inference Recommender integration tests (#4156)
fix: clone distribution in validate_distribution (#4205)
Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208)
Fix master merge formatting (#1186)
Fix master unit tests (#1203)
Fix djl unit tests (#1204)
Fix merge conflicts (#1217)
fix: fix URL links (#4217)
fix: bump urllib3 version (#4223)
fix: relax upper bound on urllib in local mode requirements (#4219)
fixes (#1224)
fix formatting (#1233)
fix byoc unit tests (#1235)
fix byoc unit tests (#1236)
benieric added a commit that referenced this issue Nov 29, 2023
Co-authored-by: Raymond Liu <[email protected]>
Co-authored-by: Ruilian Gao <[email protected]>
Co-authored-by: John Barboza <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Zuoyuan Huang <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Mufaddal Rohawala <[email protected]>
Co-authored-by: Mike Schneider <[email protected]>
Co-authored-by: Bhupendra Singh <[email protected]>
Co-authored-by: ci <ci>
Co-authored-by: Malav Shastri <[email protected]>
Co-authored-by: evakravi <[email protected]>
Co-authored-by: Keshav Chandak <[email protected]>
Co-authored-by: Alexander Pivovarov <[email protected]>
Co-authored-by: qidewenwhen <[email protected]>
Co-authored-by: mariumof <[email protected]>
Co-authored-by: matherit <[email protected]>
Co-authored-by: amzn-choeric <[email protected]>
Co-authored-by: Ao Guo <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Erick Benitez-Ramos <[email protected]>
Co-authored-by: Qingzi-Lan <[email protected]>
Co-authored-by: Sally Seok <[email protected]>
Co-authored-by: Manu Seth <[email protected]>
Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Sarah Castillo <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: EC2 Default User <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: stacicho <[email protected]>
Co-authored-by: martinRenou <[email protected]>
Co-authored-by: jiapinw <[email protected]>
Co-authored-by: Akash Goel <[email protected]>
Co-authored-by: Joseph Zhang <[email protected]>
Co-authored-by: Harsha Reddy <[email protected]>
Co-authored-by: Haixin Wang <[email protected]>
Co-authored-by: Kalyani Nikure <[email protected]>
Co-authored-by: Xin Wang <[email protected]>
Co-authored-by: Gili Nachum <[email protected]>
Co-authored-by: Jose Pena <[email protected]>
Co-authored-by: cansun <[email protected]>
Co-authored-by: AWS-pratab <[email protected]>
Co-authored-by: shenlongtang <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: chrivtho-github <[email protected]>
Co-authored-by: Justin <[email protected]>
Co-authored-by: Duc Trung Le <[email protected]>
Co-authored-by: HappyAmazonian <[email protected]>
Co-authored-by: cj-zhang <[email protected]>
Co-authored-by: Matthew <[email protected]>
Co-authored-by: Zach Kimberg <[email protected]>
Co-authored-by: Rohith Nadimpally <[email protected]>
Co-authored-by: rohithn1 <[email protected]>
Co-authored-by: Victor Zhu <[email protected]>
Co-authored-by: Gary Wang <[email protected]>
Co-authored-by: SSRraymond <[email protected]>
Co-authored-by: jbarz1 <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Mohan Gandhi <[email protected]>
Co-authored-by: Barboza <[email protected]>
Co-authored-by: ruiliann666 <[email protected]>
Co-authored-by: Rohan Gujarathi <[email protected]>
Co-authored-by: svia3 <[email protected]>
Co-authored-by: Zhankui Lu <[email protected]>
Co-authored-by: Dewen Qi <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Stephen Via <[email protected]>
Co-authored-by: Namrata Madan <[email protected]>
Co-authored-by: Stacia Choe <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Edward Sun <[email protected]>
Co-authored-by: Rohan Gujarathi <[email protected]>
Co-authored-by: JohnaAtAWS <[email protected]>
Co-authored-by: Vera Yu <[email protected]>
Co-authored-by: bhaoz <[email protected]>
Co-authored-by: Qing Lan <[email protected]>
Co-authored-by: Namrata Madan <[email protected]>
Co-authored-by: Sirut Buasai <[email protected]>
Co-authored-by: wayneyao <[email protected]>
Co-authored-by: Jacky Lee <[email protected]>
Co-authored-by: haNa-meister <[email protected]>
Co-authored-by: Shailav <[email protected]>
Fix unit tests (#1018)
Fix happy hf test (#1026)
fix logic setup (#1034)
fixes (#1045)
Fix flake error in init (#1050)
fix (#1053)
fix: skip tensorflow local mode notebook test (#4060)
fix: tags for jumpstart model package models (#4061)
fix: pipeline variable kms key (#4065)
fix: jumpstart cache using sagemaker session s3 client (#4051)
fix: gated models unsupported region (#4069)
fix: pipeline upsert failed to pass parallelism_config to update (#4066)
fix: temporarily skip kmeans notebook (#4092)
fixes (#1051)
Fix missing absolute import error (#1057)
Fix flake8 error in unit test (#1058)
fixes (#1056)
Fix flake8 error in integ test (#1060)
Fix black format error in test_pickle_dependencies (#1062)
Fix docstyle error under serve (#1065)
Fix docstyle error in builder failure (#1066)
fix black and flake8 formatting (#1069)
Fix format error (#1070)
Fix integ test (#1074)
fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072)
fix: log message when sdk defaults not applied (#4104)
fix: handle bad jumpstart default session (#4109)
Fix the version information, whl and flake8 (#1085)
Fix JSON serializer error (#1088)
Fix unit test (#1091)
fix format (#1103)
Fix local mode predictor (#1107)
Fix DJLPredictor (#1108)
Fix modelbuilder unit tests (#1118)
fixes (#1136)
fixes (#1165)
fixes (#1166)
fix: auto ml integ tests and add flaky test markers (#4136)
fix model data for JumpStartModel (#4135)
fix: transform step  unit test (#4151)
fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099)
fix: Fixed bug in _create_training_details (#4141)
fix: use correct line endings and s3 uris on windows (#4118)
fix: js tagging s3 prefix (#4167)
fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181)
fix: import error in unsupported js regions (#4188)
fix: update local mode schema (#4185)
fix: fix flaky Inference Recommender integration tests (#4156)
fix: clone distribution in validate_distribution (#4205)
Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208)
Fix master merge formatting (#1186)
Fix master unit tests (#1203)
Fix djl unit tests (#1204)
Fix merge conflicts (#1217)
fix: fix URL links (#4217)
fix: bump urllib3 version (#4223)
fix: relax upper bound on urllib in local mode requirements (#4219)
fixes (#1224)
fix formatting (#1233)
fix byoc unit tests (#1235)
fix byoc unit tests (#1236)
Fixed Modelpackage's deploy calling  model's deploy (#1155)
fix: jumpstart unit-test (#1265)
fixes (#963)
Fix TorchTensorSer/Deser (#969)
fix (#971)
fix local container mode (#972)
Fix auto detect (#979)
Fix routing fn (#981)
fix local container serialization (#989)
fix custom serialiazation with local container. Also remove a  lot of unused code (#994)
Fix custom serialization for local container mode (#1000)
fix pytorch version (#1001)
Fix unit test (#990)
fix: Multiple bug fixes including removing unsupported feature. (#1105)
Fix some problems with pipeline compilation (#1125)
fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164)
fix: invoke_function circular import (#1262)
fix: pylint (#1264)
fix: Add logging for docker build failures (#1267)
Fix session bug when provided in ModelBuilder (#1288)
fixes (#1313)
fix: Gated content bucket env var override (#1280)
fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287)
fix: HMAC signing for ModelBuilder Triton python backend (#1282)
fix: do not delete temp folder generated by sdist (#1291)
fix: Do not require model_server if provided image_uri is a 1p image. (#1303)
fix: check image type vs instance type (#1307)
fix: unit test (#1315)
fix: Fixed model builder's register unable to deploy (#1323)
fix: missing `self._framework` in `InferenceSpec` path (#1325)
fix: enable xgboost integ test in our own pipeline (#1326)
fix: skip py310 (#1328)
fix: Update autodetect dlc logic (#1329)
Fix secret key in the Model object (#1334)
fix: improve error message (#1333)
Fix unit testing (#1340)
fix: Typing and formatting (#1341)
fix: WaiterError on failed pipeline execution. results() (#1337)
Fix tox identified errors (#1344)
Fix issue when the user runs in Python 3.11 (#1345)
fixes (#1346)
fix: use copy instead of move in bootstrap script (#1339)
Resolve keynote3 conflicts (#1351)
Resolve keynote3 conflicts v2 (#1353)
Fix conflicts (#1354)
Fix conflicts v3 (#1355)
fix: get whl from local to run integ tests (#1357)
fix: enable triton pt tests (#1358)
fix: integ test (#1362)
Fix Python 3.11 issue with dataclass decorator (#1345)
fix: remote function include_local_workdir default value (#1342)
fix: error message (#1373)
fixes (#1372)
fix: Remvoe PickleSerializer (#1378)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants