Training with aws Sagemaker stuck if more than one epoch #1353

hjuhel-cdpq · 2020-03-12T15:48:43Z

Describe the bug
Training the simpletransformers.ner model on AWS Sagemaker for more than one epoch results in the training idling for ever before the end of the first epoch. If trained for one epoch, the training ends normally.

While, idling, the GPU use drop to 0 and the disk utilization grows and began to plateau. (see attached picture)

It might also be a simpleTransformers issue, but the same procedure works perfectly when training the same model directly on a Sagemaker notebook.

To Reproduce

Generate a docker, based on the pytorch-training and with Apex and simpletransformers installed

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04
RUN pip install simpletransformers
RUN git clone https://github.com/NVIDIA/apex

RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

Upload the Docker to ECR with a name <training_image>
Prepare a minimal training script and upload it to a Sagemaker Notebook

import os

from simpletransformers.ner import NERModel
import pandas as pd

if __name__ == "__main__":

    # Retrieve data and labels
    data = pd.read_csv(os.path.join(os.environ["SM_CHANNEL_TRAINING"], "training.csv"))
    labels = data.labels.unique().tolist()

    # Instanciate a NER model
    args = {
        "output_dir": os.environ["SM_MODEL_DIR"],
        "reprocess_input_data": True,
        "num_train_epochs": 2,
        "train_batch_size": 8,
    }
    model = NERModel("bert", "bert-base-uncased", args=args, labels=labels)

    model.train(data)

    return 0

Start the training from Sagemaker Notebook

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role

ROLE = get_execution_role()

pytorch_estimator = PyTorch(entry_point='train.py',
                            image_name="<docker_Image>", 
                            instance_type="ml.p3.2xlarge",
                            train_instance_count=1,
                            role = ROLE,
                           )

pytorch_estimator.fit({"training":"<path_to_training.csv>"})

Expected behavior
I was expecting a larger number of epochs to works

Screenshots

Desktop (please complete the following information):

OS : Ubuntu, from the base Sagemaker's pytorch Image

Additional context
Please, find attached thge logs. I'have remove the logs between the start of the first step of the first epoch and the last event received from the first epoch:

01:10:37
bash: cannot set terminal process group (-1): Inappropriate ioctl for device

01:10:37
bash: no job control in this shell

01:10:39
2020-03-12 01:10:38,648 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training

01:10:39
2020-03-12 01:10:38,727 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.

01:10:39
2020-03-12 01:10:38,728 sagemaker_pytorch_container.training INFO Invoking user training script.

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Module default_user_module_name does not provide a setup.py.

01:10:39
Generating setup.py

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating setup.cfg

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Generating MANIFEST.in

01:10:39
2020-03-12 01:10:39,071 sagemaker-containers INFO Installing module with the following command:

01:10:39
/opt/conda/bin/python -m pip install .

01:10:40
Processing /tmp/tmpvf5izqsk/module_dir

01:10:40
Building wheels for collected packages: default-user-module-name Building wheel for default-user-module-name (setup.py): started Building wheel for default-user-module-name (setup.py): finished with status 'done' Created wheel for default-user-module-name: filename=default_user_module_name-1.0.0-py2.py3-none-any.whl size=4365 sha256=a7c076c4c020f4b8b9f85a40721d629b752f9b1006cdb0ba2bf27132f9b

01:10:40
Successfully built default-user-module-name

01:10:41
Installing collected packages: default-user-module-name

01:10:41
Successfully installed default-user-module-name-1.0.0

01:10:41
2020-03-12 01:10:41,498 sagemaker-containers INFO Invoking user script

01:10:41
Training Env:

01:10:41
{ "additional_framework_parameters": {}, "channel_input_dirs": { "training": "/opt/ml/input/data/training" }, "current_host": "algo-1", "framework_module": "sagemaker_pytorch_container.training:main", "hosts": [ "algo-1" ], "hyperparameters": { "n_gpu": 8, "batch_size": 32, "seed": 2, "lower": true, "epochs": 5

01:10:41
}

01:10:41
Environment variables:

01:10:41
SM_HOSTS=["algo-1"]

01:10:41
SM_NETWORK_INTERFACE_NAME=eth0

01:10:41
SM_HPS={"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2}

01:10:41
SM_USER_ENTRY_POINT=train.py

01:10:41
SM_FRAMEWORK_PARAMS={}

01:10:41
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}

01:10:41
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}

01:10:41
SM_OUTPUT_DATA_DIR=/opt/ml/output/data

01:10:41
SM_CHANNELS=["training"]

01:10:41
SM_CURRENT_HOST=algo-1

01:10:41
SM_MODULE_NAME=train

01:10:41
SM_LOG_LEVEL=20

01:10:41
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main

01:10:41
SM_INPUT_DIR=/opt/ml/input

01:10:41
SM_INPUT_CONFIG_DIR=/opt/ml/input/config

01:10:41
SM_OUTPUT_DIR=/opt/ml/output

01:10:41
SM_NUM_CPUS=64

01:10:41
SM_NUM_GPUS=8

01:10:41
SM_MODEL_DIR=/opt/ml/model

01:10:41
SM_MODULE_DIR=<path>

01:10:41
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":32,"epochs":50,"lower":true,"n_gpu":8,"seed":2},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"

01:10:41
SM_USER_ARGS=["--batch_size","32","--epochs","50","--lower","True","--n_gpu","8","--seed","2"]

01:10:41
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate

01:10:41
SM_CHANNEL_TRAINING=/opt/ml/input/data/training

01:10:41
SM_HP_N_GPU=8

01:10:41
SM_HP_BATCH_SIZE=32

01:10:41
SM_HP_SEED=2

01:10:41
SM_HP_LOWER=true

01:10:41
SM_HP_EPOCHS=50

01:10:41
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

01:10:41
Invoking script with the following command:

01:10:41
/opt/conda/bin/python train.py --batch_size 32 --epochs 50 --lower True --n_gpu 8 --seed 2

01:11:03
Converting to features started.

01:11:03
2020-03-12 01:10:45,583 [train.py ] INFO Starting training...

01:11:07
[2020-03-12 01:11:07.095 algo-1:102 INFO json_config.py:90] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO No dataset provided for testing...training will be splitted

01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:152] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading training data at /opt/ml/input/data/training

01:11:07
[2020-03-12 01:11:07.096 algo-1:102 INFO hook.py:197] Saving to /opt/ml/output/tensors

01:11:07
2020-03-12 01:10:45,584 [train.py ] INFO Reading file : 00_data_IOB.csv, located at /opt/ml/input/data/training/00_data_IOB.csv

01:11:07
[2020-03-12 01:11:07.116 algo-1:102 INFO hook.py:326] Monitoring the collections: losses

01:11:07
train.py:95: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.055 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0.attention NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType training["words"] = training["words"].apply(str)

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType

01:11:08
train.py:96: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.064 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.0 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.069 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.070 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1.attention NoneType testing["words"] = testing["words"].apply(str)

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
train.py:99: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.073 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.1 NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.078 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention.self NoneType training["words"] = training["words"].apply(lambda x: x.lower().strip())

01:11:08
[2020-03-12 01:11:08.079 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2.attention NoneType

01:11:08
train.py:100: SettingWithCopyWarning:

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
A value is trying to be set on a copy of a slice from a DataFrame.

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
Try using .loc[row_indexer,col_indexer] = value instead

01:11:08
[2020-03-12 01:11:08.081 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.2 NoneType

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType

01:11:08
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType testing["words"] = testing["words"].apply(lambda x: x.lower().strip())

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention.self NoneType

01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO Training datapoitns : 71712

01:11:08
[2020-03-12 01:11:08.087 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3.attention NoneType

01:11:08
2020-03-12 01:10:45,676 [train.py ] INFO TraininTestingg datapoitns : 18001

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with label : ['O', 'B-E', 'I-E', 'B-C', 'I-C']

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
2020-03-12 01:10:45,681 [train.py ] INFO Training the model with arguments : {'output_dir': '/opt/ml/model/model', 'reprocess_input_data': True, 'num_train_epochs': 50, 'train_batch_size': 32, 'fp16': False, 'save_eval_checkpoints': False, 'save_steps': 8223372036854775807, 'save_model_every_epoch': False, 'overwrite_output_dir': True, 'logging_steps': 500, 'silent': False, 'use_early_stopp

01:11:08
[2020-03-12 01:11:08.090 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.3 NoneType

01:11:08
#015Downloading: 0%| | 0.00/361 [00:00<?, ?B/s]#015Downloading: 100%|██████████| 361/361 [00:00<00:00, 366kB/s]

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015Downloading: 0%| | 0.00/440M [00:00<?, ?B/s]#015Downloading: 1%| | 4.70M/440M [00:00<00:09, 47.0MB/s]#015Downloading: 2%|▏ | 9.55M/440M [00:00<00:09, 47.4MB/s]#015Downloading: 3%|▎ | 14.4M/440M [00:00<00:08, 47.9MB/s]#015Downloading: 4%|▍ | 19.0M/440M [00:00<00:08, 47.2MB/s]#015Downloading: 5%|▌ | 23.5M/440M [00:00<00:08, 46.5MB/s]#

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015Downloading: 0%| | 0.00/232k [00:00<?, ?B/s]#015Downloading: 100%|██████████| 232k/232k [00:00<00:00, 26.4MB/s]

01:11:08
[2020-03-12 01:11:08.095 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention.self NoneType

01:11:08
#015 0%| | 0/5234 [00:00<?, ?it/s]#015 0%| | 1/5234 [00:00<46:04, 1.89it/s]#015 57%|█████▋ | 3001/5234 [00:00<13:45, 2.70it/s]#015100%|██████████| 5234/5234 [00:00<00:00, 6824.67it/s]

01:11:08
[2020-03-12 01:11:08.096 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4.attention NoneType

01:11:08
#015Epoch: 0%| | 0/50 [00:00<?, ?it/s]

01:11:08
[2020-03-12 01:11:08.098 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:08
#015Current iteration: 0%| | 0/164 [00:00<?, ?it/s]#033[A

01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:08
#015Current iteration: 1%| | 1/164 [00:01<03:49, 1.41s/it]#033[A

01:11:08
[2020-03-12 01:11:08.099 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.4 NoneType

01:11:09
#015Current iteration: 1%| | 2/164 [00:01<02:51, 1.06s/it]#033[A

01:11:09
[2020-03-12 01:11:08.104 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.5.attention.self NoneType

01:11:09
#015Current iteration: 2%|▏ | 3/164 [00:01<02:10, 1.23it/s]#033[A

...



01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType

01:13:09
#015Current iteration: 82%|████████▏ | 134/164 [00:33<00:07, 4.15it/s]#033[A

01:13:09
[2020-03-12 01:13:09.473 algo-1:102 WARNING hook.py:808] var is not Tensor or list or tuple of Tensors, module_name:bert.encoder.layer.11 NoneType

01:13:09
#015Current iteration: 82%|████████▏ | 135/164 [00:33<00:07, 4.14it/s]#033[A

The training is stuck after the last event received. At the same time, GPU use drops to 0.


* Testedon 3 differents Sagemaker's instance types

The text was updated successfully, but these errors were encountered:

mmaybeno · 2020-03-13T20:35:11Z

I experienced something like this as well using the Pytorch 1.4.0 image. I am training some image fine tuning and it works on small subsets, but when I try and scale it up it starts to hang, even if I keep the batch sizes low. I've ran the same order of operations in a google colab and they've been fine.

hjuhel-cdpq · 2020-03-13T22:17:06Z

Hi @mmaybeno ,

Thanks for your answer ! Have you tried to run your program with à previous version of pytorch? I Will give it à try and keep you posted.

mmaybeno · 2020-03-13T22:22:07Z

I have not, I did however run it locally (not on aws, not their image) and it was running fine. The issues I'm having are all based on either their own image or a build I created from their dockerfile. I had to do my own build as I think there is a bug with their custom pytorch 1.4.0 python wheel.

aws/sagemaker-pytorch-training-toolkit#181

mmaybeno · 2020-03-15T20:12:45Z

I've been doing some experimenting on my custom built image. Using the CPU version of the image it will start successfully then crash without error. After reading #1225, I suspect the reason is because of a resource limit (my docker env was pinned to a max of 8GB). After increasing those limits (to 14GB), it wouldn't crash but it would hit ~12GB of ram usage and just hang there. If I run it outside of the container it progresses fine with ~8GB usage. I wonder if this hanging is similar to the GPU version as well.

hjuhel-cdpq · 2020-03-16T14:57:42Z

Hi @mmaybeno

I have tried with a custom image of pytorch build from one of Nvidia but it did not fix the problem, the training still get stuck before the end of the first eproch.

Regarding the resource limit : the dataset I am using is quite tiny with only 2k sentences, (all with less that 50 tokens) and the larger machine is used for training was the ml.p3dn.24xlarge (with 768G of RAM and 8 V100 GPU), so i'm doubtfull the probleme is related to resources. Plus, the training start to hang during the training of the first epoch IF the number of epochs is larger than one, but complete IF the total number of epochs is one. If resources starvation was, somehow, limitating the training, schould'nt it be independant from the number of epochs (modulo some caching) ?

mmaybeno · 2020-03-16T20:03:13Z

Well I found a few issues on my end that may be related. I was able to get my custom image working locally but without using the sagemaker-python-sdk. Turns out running locally uses the default docker shm-size of 64MB. If you manually create the container with a larger shm-size and run the training command from within the container it works fine. However AWS says they use larger shm-size volumes when they run the instances on their infrastructure.
pytorch/pytorch#2244
pytorch/pytorch#5040
#937

That lead me to looking at the dataloaders. pytorch/pytorch#1355 (comment).
I have to experiment with this more, but so far no luck. What you were mentioning is similar to what I'm experiencing. But for me, it doesn't even start the first epoch. GPU just spikes for several minutes then everything just drops.

mmaybeno · 2020-03-19T17:46:28Z

@hjuhel-cdpq have you tried seeing if it stops at a particular iteration in your training? I noticed it happened after about training 6325 images. If I created a small dataset it will continue with multiple iterations but still stop at that number, (epoch 0 - 5000 images, epoch 1 - 1325 images). If I only wanted 1 iteration but fell under that image threshold it would complete.

hjuhel-cdpq · 2020-04-07T13:38:45Z

Hi @mmaybeno

Great news, we managed to directly contact someone in AWS to get an update of this issue since they were not answering it on git. We have just be informed that they were aware of the issue and that they have fixed it in the last version of their docker image. The corresponding PR is here . The version of the container to use from now is here . Can you try by rebuilding your docker image first with a fresh version of the container ?.

mmaybeno · 2020-04-07T17:28:43Z

That's good to hear @hjuhel-cdpq. Unfortunately there is some other issues regarding the prebuilt images that have not been resolved on my end. Namely the torchvision::nms operator error. I'll try rebuilding based off these images to see if the epoch issue has been solved for me though. 🤞

mmaybeno · 2020-04-07T18:37:35Z

Just tried it with a custom image built on their updated image and indeed the epoch issue for me is resolved. I agree it was related to that PR with streaming the stderr. Thanks @hjuhel-cdpq

yifeim · 2020-06-07T17:27:37Z

Came across this discussion. I think now you can do this:

Default behavior and opting out
For TensorFlow, Keras, MXNet, PyTorch and XGBoost estimators, the DebuggerHookConfig is always initialized regardless of specification while initializing the estimator. This is done to minimize code changes needed to get useful debugging information.

To disable the hook initialization, you can do so by specifying False for value of debugger_hook_config in your framework estimator’s initialization:

estimator = TensorFlow(
role=role,
train_instance_count=1,
train_instance_type=train_instance_type,
debugger_hook_config=False
)

hjuhel-cdpq · 2020-06-08T13:45:41Z

Hi @yifeim
Thanks for your answer, since AWS released their fixed image, we don't have the bug anymore (april 2020) , I'am curious about you solution and will try it if I find a previous build of their image.

Thanks !

Co-authored-by: Raymond Liu <[email protected]> Co-authored-by: Ruilian Gao <[email protected]> Co-authored-by: John Barboza <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Zuoyuan Huang <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Mike Schneider <[email protected]> Co-authored-by: Bhupendra Singh <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: evakravi <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Alexander Pivovarov <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: mariumof <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: amzn-choeric <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Sarah Castillo <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: stacicho <[email protected]> Co-authored-by: martinRenou <[email protected]> Co-authored-by: jiapinw <[email protected]> Co-authored-by: Akash Goel <[email protected]> Co-authored-by: Joseph Zhang <[email protected]> Co-authored-by: Harsha Reddy <[email protected]> Co-authored-by: Haixin Wang <[email protected]> Co-authored-by: Kalyani Nikure <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Jose Pena <[email protected]> Co-authored-by: cansun <[email protected]> Co-authored-by: AWS-pratab <[email protected]> Co-authored-by: shenlongtang <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: chrivtho-github <[email protected]> Co-authored-by: Justin <[email protected]> Co-authored-by: Duc Trung Le <[email protected]> Co-authored-by: HappyAmazonian <[email protected]> Co-authored-by: cj-zhang <[email protected]> Co-authored-by: Matthew <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: Rohith Nadimpally <[email protected]> Co-authored-by: rohithn1 <[email protected]> Co-authored-by: Victor Zhu <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: SSRraymond <[email protected]> Co-authored-by: jbarz1 <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Barboza <[email protected]> Co-authored-by: ruiliann666 <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: svia3 <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Stephen Via <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Stacia Choe <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: JohnaAtAWS <[email protected]> Co-authored-by: Vera Yu <[email protected]> Co-authored-by: bhaoz <[email protected]> Co-authored-by: Qing Lan <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Sirut Buasai <[email protected]> Co-authored-by: wayneyao <[email protected]> Co-authored-by: Jacky Lee <[email protected]> Co-authored-by: haNa-meister <[email protected]> Co-authored-by: Shailav <[email protected]> Fix unit tests (#1018) Fix happy hf test (#1026) fix logic setup (#1034) fixes (#1045) Fix flake error in init (#1050) fix (#1053) fix: skip tensorflow local mode notebook test (#4060) fix: tags for jumpstart model package models (#4061) fix: pipeline variable kms key (#4065) fix: jumpstart cache using sagemaker session s3 client (#4051) fix: gated models unsupported region (#4069) fix: pipeline upsert failed to pass parallelism_config to update (#4066) fix: temporarily skip kmeans notebook (#4092) fixes (#1051) Fix missing absolute import error (#1057) Fix flake8 error in unit test (#1058) fixes (#1056) Fix flake8 error in integ test (#1060) Fix black format error in test_pickle_dependencies (#1062) Fix docstyle error under serve (#1065) Fix docstyle error in builder failure (#1066) fix black and flake8 formatting (#1069) Fix format error (#1070) Fix integ test (#1074) fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072) fix: log message when sdk defaults not applied (#4104) fix: handle bad jumpstart default session (#4109) Fix the version information, whl and flake8 (#1085) Fix JSON serializer error (#1088) Fix unit test (#1091) fix format (#1103) Fix local mode predictor (#1107) Fix DJLPredictor (#1108) Fix modelbuilder unit tests (#1118) fixes (#1136) fixes (#1165) fixes (#1166) fix: auto ml integ tests and add flaky test markers (#4136) fix model data for JumpStartModel (#4135) fix: transform step unit test (#4151) fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099) fix: Fixed bug in _create_training_details (#4141) fix: use correct line endings and s3 uris on windows (#4118) fix: js tagging s3 prefix (#4167) fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181) fix: import error in unsupported js regions (#4188) fix: update local mode schema (#4185) fix: fix flaky Inference Recommender integration tests (#4156) fix: clone distribution in validate_distribution (#4205) Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208) Fix master merge formatting (#1186) Fix master unit tests (#1203) Fix djl unit tests (#1204) Fix merge conflicts (#1217) fix: fix URL links (#4217) fix: bump urllib3 version (#4223) fix: relax upper bound on urllib in local mode requirements (#4219) fixes (#1224) fix formatting (#1233) fix byoc unit tests (#1235) fix byoc unit tests (#1236) Fixed Modelpackage's deploy calling model's deploy (#1155) fix: jumpstart unit-test (#1265) fixes (#963) Fix TorchTensorSer/Deser (#969) fix (#971) fix local container mode (#972) Fix auto detect (#979) Fix routing fn (#981) fix local container serialization (#989) fix custom serialiazation with local container. Also remove a lot of unused code (#994) Fix custom serialization for local container mode (#1000) fix pytorch version (#1001) Fix unit test (#990) fix: Multiple bug fixes including removing unsupported feature. (#1105) Fix some problems with pipeline compilation (#1125) fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164) fix: invoke_function circular import (#1262) fix: pylint (#1264) fix: Add logging for docker build failures (#1267) Fix session bug when provided in ModelBuilder (#1288) fixes (#1313) fix: Gated content bucket env var override (#1280) fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287) fix: HMAC signing for ModelBuilder Triton python backend (#1282) fix: do not delete temp folder generated by sdist (#1291) fix: Do not require model_server if provided image_uri is a 1p image. (#1303) fix: check image type vs instance type (#1307) fix: unit test (#1315) fix: Fixed model builder's register unable to deploy (#1323) fix: missing `self._framework` in `InferenceSpec` path (#1325) fix: enable xgboost integ test in our own pipeline (#1326) fix: skip py310 (#1328) fix: Update autodetect dlc logic (#1329) Fix secret key in the Model object (#1334) fix: improve error message (#1333) Fix unit testing (#1340) fix: Typing and formatting (#1341) fix: WaiterError on failed pipeline execution. results() (#1337) Fix tox identified errors (#1344) Fix issue when the user runs in Python 3.11 (#1345) fixes (#1346) fix: use copy instead of move in bootstrap script (#1339) Resolve keynote3 conflicts (#1351) Resolve keynote3 conflicts v2 (#1353) Fix conflicts (#1354) Fix conflicts v3 (#1355) fix: get whl from local to run integ tests (#1357) fix: enable triton pt tests (#1358) fix: integ test (#1362) Fix Python 3.11 issue with dataclass decorator (#1345) fix: remote function include_local_workdir default value (#1342) fix: error message (#1373) fixes (#1372) fix: Remvoe PickleSerializer (#1378)

mmaybeno mentioned this issue Mar 30, 2020

Stalled PyTorch training job on SageMaker with custom image #1372

Closed

hjuhel-cdpq closed this as completed Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with aws Sagemaker stuck if more than one epoch #1353

Training with aws Sagemaker stuck if more than one epoch #1353

hjuhel-cdpq commented Mar 12, 2020

mmaybeno commented Mar 13, 2020

hjuhel-cdpq commented Mar 13, 2020

mmaybeno commented Mar 13, 2020 •

edited

Loading

mmaybeno commented Mar 15, 2020

hjuhel-cdpq commented Mar 16, 2020

mmaybeno commented Mar 16, 2020

mmaybeno commented Mar 19, 2020

hjuhel-cdpq commented Apr 7, 2020

mmaybeno commented Apr 7, 2020

mmaybeno commented Apr 7, 2020

yifeim commented Jun 7, 2020

hjuhel-cdpq commented Jun 8, 2020

Training with aws Sagemaker stuck if more than one epoch #1353

Training with aws Sagemaker stuck if more than one epoch #1353

Comments

hjuhel-cdpq commented Mar 12, 2020

mmaybeno commented Mar 13, 2020

hjuhel-cdpq commented Mar 13, 2020

mmaybeno commented Mar 13, 2020 • edited Loading

mmaybeno commented Mar 15, 2020

hjuhel-cdpq commented Mar 16, 2020

mmaybeno commented Mar 16, 2020

mmaybeno commented Mar 19, 2020

hjuhel-cdpq commented Apr 7, 2020

mmaybeno commented Apr 7, 2020

mmaybeno commented Apr 7, 2020

yifeim commented Jun 7, 2020

hjuhel-cdpq commented Jun 8, 2020

mmaybeno commented Mar 13, 2020 •

edited

Loading