sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

ivankeller · 2020-01-11T01:10:08Z

Please fill out the form below.

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): SKLearn
Framework Version: N/A
Python Version: 3.6
CPU or GPU: CPU
Python SDK Version: 3.6
Are you using a custom image: No

Describe the problem

Sagemaker training job fails giving no trace back error messages only message like below:

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-01-11-00-25-23-491: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

Exact command to reproduce:

I instanciate an SKLearn estimator:

# create estimator
from sagemaker.sklearn.estimator import SKLearn

script_path = 'mailswitch-train-and-calibrate-lsvc.py'

sklearn_estimator = SKLearn(
    entry_point=script_path,
    train_instance_type='ml.c4.xlarge',   # 'ml.c4.xlarge' or 'local' for local mode but also comment sagemaker_session=sagemaker_session line below
    role=role,
    sagemaker_session=sagemaker_session,
    metric_definitions=[
        {'Name': 'micro-average-precision',
         'Regex': "micro-average-precision: (0.[0-9]+)"}],    # extract tracked metric from logs with regexp 
    hyperparameters={'regularization_C': 0.00038335633948752675}
)

The provided training script mailswitch-train-and-calibrate-lsvc.py is:

import argparse
import pandas as pd
import os

from scipy.sparse import load_npz, vstack
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.externals import joblib
from sklearn import metrics


if __name__ == '__main__':
    
    # ----- parse script arguments
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # Hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--regularization_C', type=float, default=1.0)
    
    # Data, model, and output directories
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--val', type=str, default=os.environ.get('SM_CHANNEL_VAL'))
    parser.add_argument('--calib', type=str, default=os.environ.get('SM_CHANNEL_CALIB'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-features-file', type=str, default='X_train_tune.npz')
    parser.add_argument('--train-labels-file', type=str, default='y_train_tune.parquet')
    parser.add_argument('--val-features-file', type=str, default='X_val_tune.npz')
    parser.add_argument('--val-labels-file', type=str, default='y_val_tune.parquet') 
    parser.add_argument('--calib-features-file', type=str, default='X_calib.npz')
    parser.add_argument('--calib-labels-file', type=str, default='y_calib.parquet') 
    parser.add_argument('--test-features-file', type=str, default='X_thres.npz')
    parser.add_argument('--test-labels-file', type=str, default='y_thres.parquet') 

    args, _ = parser.parse_known_args()
    
    # ----- read data
    # we read the training and validation data sets used in hyperparameter tuning and merge them to form a training set
    print('reading data')
    X_train1 = load_npz(os.path.join(args.train, args.train_features_file)) 
    y_train1 = pd.read_parquet(os.path.join(args.train, args.train_labels_file)).iloc[:, 0]  # to convert to Series
    X_train2 = load_npz(os.path.join(args.val, args.val_features_file)) 
    y_train2 = pd.read_parquet(os.path.join(args.val, args.val_labels_file)).iloc[:, 0]  # to convert to Series
    X_calib = load_npz(os.path.join(args.calib, args.calib_features_file)) 
    y_calib = pd.read_parquet(os.path.join(args.calib, args.calib_labels_file)).iloc[:, 0]  # to convert to Series    
    X_test = load_npz(os.path.join(args.test, args.test_features_file)) 
    y_test = pd.read_parquet(os.path.join(args.test, args.test_labels_file)).iloc[:, 0]  # to convert to Series
    
    # ----- prepare training set (merge 1 and 2)
    X_train = vstack([X_train1, X_train2])
    y_train = pd.concat([y_train1, y_train2])
    
    # ----- train uncalibrated model
    # use scikit-learn's LinearSVC classifier to train the model.
    C = args.regularization_C
    print('training uncalibrated model with hyperparameter C = {}'.format(C))
    lsvc = LinearSVC(C=C, max_iter=10000)
    uncalibrated_clf = lsvc.fit(X_train, y_train)
    
    # ----- train calibrated model
    print('training calibrated model')
    nb_calib_samples = X_calib.shape[0]
    if nb_calib_samples > 1000:
        method = 'isotonic'
    else:
        method = 'sigmoid'
    print('training calibrated model with {} method and prefit uncalibrated model'.format(method))
    calibrated_clf = CalibratedClassifierCV(base_estimator=uncalibrated_clf, method=method, cv='prefit') 
    calibrated_clf = calibrated_clf.fit(X_calib, y_calib)

    # ----- evaluation
    print('evaluate uncalibrated model')
    y_pred_uncalibrated = uncalibrated_clf.predict(X_test)
    print("micro-average-precision: " + str(round(metrics.precision_score(y_test, y_pred_uncalibrated, average='micro'), 3)))
    print('evaluate calibrated model')
    y_pred_calibrated = calibrated_clf.predict(X_test)
    print("micro-average-precision: " + str(round(metrics.precision_score(y_test, y_pred_calibrated, average='micro'), 3)))
    
    # ----- persist models
    path_uncalibrated = os.path.join(args.model_dir, "model_uncalibrated.joblib")
    joblib.dump(uncalibrated_clf, path_uncalibrated)
    print('uncalibrated model persisted at', path_uncalibrated)
    path_calibrated = os.path.join(args.model_dir, "model_calibrated.joblib")
    joblib.dump(calibrated_clf , path_calibrated)
    print('calibrated model persisted at', path_calibrated)

The error occurs when fitting the estimator:

# launch training job
sklearn_estimator.fit({
    'train': train_input, 
    'val': val_input, 
    'calib': calib_input,
    'test': test_input
})

All the input path are correct locations on s3.
Note that the training succeeds in local mode when specifying train_instance_type='local' in the SKLearn estimator instanciation (and commenting out the line defining sagemaker_session)

The complete logs of the failed fit execution is:

2020-01-11 00:25:24 Starting - Starting the training job...
2020-01-11 00:25:25 Starting - Launching requested ML instances...
2020-01-11 00:26:23 Starting - Preparing the instances for training......
2020-01-11 00:27:17 Downloading - Downloading input data......
2020-01-11 00:28:11 Training - Training image download completed. Training in progress.2020-01-11 00:28:12,270 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2020-01-11 00:28:12,273 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-01-11 00:28:12,284 sagemaker_sklearn_container.training INFO     Invoking user training script.
2020-01-11 00:28:12,551 sagemaker-containers INFO     Module mailswitch-train-and-calibrate-lsvc does not provide a setup.py. 
Generating setup.py
2020-01-11 00:28:12,551 sagemaker-containers INFO     Generating setup.cfg
2020-01-11 00:28:12,551 sagemaker-containers INFO     Generating MANIFEST.in
2020-01-11 00:28:12,551 sagemaker-containers INFO     Installing module with the following command:
/miniconda3/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mailswitch-train-and-calibrate-lsvc
  Building wheel for mailswitch-train-and-calibrate-lsvc (setup.py): started
  Building wheel for mailswitch-train-and-calibrate-lsvc (setup.py): finished with status 'done'
  Created wheel for mailswitch-train-and-calibrate-lsvc: filename=mailswitch_train_and_calibrate_lsvc-1.0.0-py2.py3-none-any.whl size=7687 sha256=5764f87a457b9e8efac835e1d0c4b7f2486f78a5f1df1e6c53fc54730674f04a
  Stored in directory: /tmp/pip-ephem-wheel-cache-q5juxd4g/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mailswitch-train-and-calibrate-lsvc
Installing collected packages: mailswitch-train-and-calibrate-lsvc
Successfully installed mailswitch-train-and-calibrate-lsvc-1.0.0
2020-01-11 00:28:13,798 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-01-11 00:28:13,808 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "val": "/opt/ml/input/data/val",
        "test": "/opt/ml/input/data/test",
        "calib": "/opt/ml/input/data/calib",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_sklearn_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "regularization_C": 0.00038335633948752675
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "val": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "test": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "calib": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-scikit-learn-2020-01-11-00-25-23-491",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz",
    "module_name": "mailswitch-train-and-calibrate-lsvc",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mailswitch-train-and-calibrate-lsvc.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"regularization_C":0.00038335633948752675}
SM_USER_ENTRY_POINT=mailswitch-train-and-calibrate-lsvc.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"calib":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"val":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["calib","test","train","val"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mailswitch-train-and-calibrate-lsvc
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"calib":"/opt/ml/input/data/calib","test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train","val":"/opt/ml/input/data/val"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"regularization_C":0.00038335633948752675},"input_config_dir":"/opt/ml/input/config","input_data_config":{"calib":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"val":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2020-01-11-00-25-23-491","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz","module_name":"mailswitch-train-and-calibrate-lsvc","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mailswitch-train-and-calibrate-lsvc.py"}
SM_USER_ARGS=["--regularization_C","0.00038335633948752675"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_VAL=/opt/ml/input/data/val
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_CALIB=/opt/ml/input/data/calib
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_REGULARIZATION_C=0.00038335633948752675
PYTHONPATH=/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages

Invoking script with the following command:

/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675


/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
extracting arguments
reading data
training uncalibrated model with hyperparameter C = 0.00038335633948752675
2020-01-11 00:28:33,612 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"

2020-01-11 00:28:39 Uploading - Uploading generated training model
2020-01-11 00:28:39 Failed - Training job failed

And the Traceback is:

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-16-8b1c1d92127e> in <module>()
      4     'val': val_input,
      5     'calib': calib_input,
----> 6     'test': test_input
      7 })

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    462         self.jobs.append(self.latest_training_job)
    463         if wait:
--> 464             self.latest_training_job.wait(logs=logs)
    465 
    466     def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1059         # If logs are requested, call logs_for_jobs.
   1060         if logs != "None":
-> 1061             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1062         else:
   1063             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   2972 
   2973         if wait:
-> 2974             self._check_job_status(job_name, description, "TrainingJobStatus")
   2975             if dot:
   2976                 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2566                 ),
   2567                 allowed_statuses=["Completed", "Stopped"],
-> 2568                 actual_status=status,
   2569             )
   2570 

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-01-11-00-25-23-491: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"

The text was updated successfully, but these errors were encountered:

laurenyu · 2020-01-11T01:30:19Z

I've reached out to the team that owns SKLearn to see if they have any insight. Thanks for the detailed information.

ivankeller · 2020-01-13T12:52:07Z

It was a computing resource issue: it succeeded in local mode on an ml.t3.xlarge notebook instance (4CPUs/16 GB RAM) and failed on a ml.c4.xlarge training instance (4 CPUs/7.5 GB RAM).
So I tried with a more powerful training instances ml.m5.2xlarge (8CPUs/32 GB RAM) just to be sure and it succeeded.

The logs are not meaningful when the training job fails for this reason.

yunzhe-tao · 2021-04-21T22:54:34Z

Does anyone know the root cause of this issue? I met the similar issue in another notebook.

Harsh462 · 2022-04-13T08:59:25Z

I am also having somewhat same issue .Tried changing the gpu wrt my region ,still issue persists.
Anyone have any solution ?
Help is appreciated!

sagarhadole · 2022-09-21T19:04:06Z

unable to resolve the error even by using powerful training instances ml.m5.2xlarge. how to solve it. Any suggestion is appreciated.

laurenyu added the type: question label Jan 11, 2020

ivankeller closed this as completed Jan 13, 2020

mmaybeno mentioned this issue Mar 15, 2020

Training with aws Sagemaker stuck if more than one epoch #1353

Closed

joglekara mentioned this issue May 27, 2021

Bring your own Container JAX - Stax aws/amazon-sagemaker-examples#2553

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

ivankeller commented Jan 11, 2020

laurenyu commented Jan 11, 2020

ivankeller commented Jan 13, 2020 •

edited

Loading

yunzhe-tao commented Apr 21, 2021

Harsh462 commented Apr 13, 2022

sagarhadole commented Sep 21, 2022

sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

Comments

ivankeller commented Jan 11, 2020

System Information

Describe the problem

Minimal repro / logs

laurenyu commented Jan 11, 2020

ivankeller commented Jan 13, 2020 • edited Loading

yunzhe-tao commented Apr 21, 2021

Harsh462 commented Apr 13, 2022

sagarhadole commented Sep 21, 2022

ivankeller commented Jan 13, 2020 •

edited

Loading