Skip to content

sagemaker-containers ERROR ExecuteUserScriptError without error details #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ivankeller opened this issue Jan 11, 2020 · 5 comments
Closed

Comments

@ivankeller
Copy link

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): SKLearn
  • Framework Version: N/A
  • Python Version: 3.6
  • CPU or GPU: CPU
  • Python SDK Version: 3.6
  • Are you using a custom image: No

Describe the problem

Sagemaker training job fails giving no trace back error messages only message like below:

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-01-11-00-25-23-491: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"

Minimal repro / logs

Please provide any logs and a bare minimum reproducible test case, as this will be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

  • Exact command to reproduce:

I instanciate an SKLearn estimator:

# create estimator
from sagemaker.sklearn.estimator import SKLearn

script_path = 'mailswitch-train-and-calibrate-lsvc.py'

sklearn_estimator = SKLearn(
    entry_point=script_path,
    train_instance_type='ml.c4.xlarge',   # 'ml.c4.xlarge' or 'local' for local mode but also comment sagemaker_session=sagemaker_session line below
    role=role,
    sagemaker_session=sagemaker_session,
    metric_definitions=[
        {'Name': 'micro-average-precision',
         'Regex': "micro-average-precision: (0.[0-9]+)"}],    # extract tracked metric from logs with regexp 
    hyperparameters={'regularization_C': 0.00038335633948752675}
)

The provided training script mailswitch-train-and-calibrate-lsvc.py is:

import argparse
import pandas as pd
import os

from scipy.sparse import load_npz, vstack
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.externals import joblib
from sklearn import metrics


if __name__ == '__main__':
    
    # ----- parse script arguments
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # Hyperparameters sent by the client are passed as command-line arguments to the script.
    parser.add_argument('--regularization_C', type=float, default=1.0)
    
    # Data, model, and output directories
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--val', type=str, default=os.environ.get('SM_CHANNEL_VAL'))
    parser.add_argument('--calib', type=str, default=os.environ.get('SM_CHANNEL_CALIB'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-features-file', type=str, default='X_train_tune.npz')
    parser.add_argument('--train-labels-file', type=str, default='y_train_tune.parquet')
    parser.add_argument('--val-features-file', type=str, default='X_val_tune.npz')
    parser.add_argument('--val-labels-file', type=str, default='y_val_tune.parquet') 
    parser.add_argument('--calib-features-file', type=str, default='X_calib.npz')
    parser.add_argument('--calib-labels-file', type=str, default='y_calib.parquet') 
    parser.add_argument('--test-features-file', type=str, default='X_thres.npz')
    parser.add_argument('--test-labels-file', type=str, default='y_thres.parquet') 

    args, _ = parser.parse_known_args()
    
    # ----- read data
    # we read the training and validation data sets used in hyperparameter tuning and merge them to form a training set
    print('reading data')
    X_train1 = load_npz(os.path.join(args.train, args.train_features_file)) 
    y_train1 = pd.read_parquet(os.path.join(args.train, args.train_labels_file)).iloc[:, 0]  # to convert to Series
    X_train2 = load_npz(os.path.join(args.val, args.val_features_file)) 
    y_train2 = pd.read_parquet(os.path.join(args.val, args.val_labels_file)).iloc[:, 0]  # to convert to Series
    X_calib = load_npz(os.path.join(args.calib, args.calib_features_file)) 
    y_calib = pd.read_parquet(os.path.join(args.calib, args.calib_labels_file)).iloc[:, 0]  # to convert to Series    
    X_test = load_npz(os.path.join(args.test, args.test_features_file)) 
    y_test = pd.read_parquet(os.path.join(args.test, args.test_labels_file)).iloc[:, 0]  # to convert to Series
    
    # ----- prepare training set (merge 1 and 2)
    X_train = vstack([X_train1, X_train2])
    y_train = pd.concat([y_train1, y_train2])
    
    # ----- train uncalibrated model
    # use scikit-learn's LinearSVC classifier to train the model.
    C = args.regularization_C
    print('training uncalibrated model with hyperparameter C = {}'.format(C))
    lsvc = LinearSVC(C=C, max_iter=10000)
    uncalibrated_clf = lsvc.fit(X_train, y_train)
    
    # ----- train calibrated model
    print('training calibrated model')
    nb_calib_samples = X_calib.shape[0]
    if nb_calib_samples > 1000:
        method = 'isotonic'
    else:
        method = 'sigmoid'
    print('training calibrated model with {} method and prefit uncalibrated model'.format(method))
    calibrated_clf = CalibratedClassifierCV(base_estimator=uncalibrated_clf, method=method, cv='prefit') 
    calibrated_clf = calibrated_clf.fit(X_calib, y_calib)

    # ----- evaluation
    print('evaluate uncalibrated model')
    y_pred_uncalibrated = uncalibrated_clf.predict(X_test)
    print("micro-average-precision: " + str(round(metrics.precision_score(y_test, y_pred_uncalibrated, average='micro'), 3)))
    print('evaluate calibrated model')
    y_pred_calibrated = calibrated_clf.predict(X_test)
    print("micro-average-precision: " + str(round(metrics.precision_score(y_test, y_pred_calibrated, average='micro'), 3)))
    
    # ----- persist models
    path_uncalibrated = os.path.join(args.model_dir, "model_uncalibrated.joblib")
    joblib.dump(uncalibrated_clf, path_uncalibrated)
    print('uncalibrated model persisted at', path_uncalibrated)
    path_calibrated = os.path.join(args.model_dir, "model_calibrated.joblib")
    joblib.dump(calibrated_clf , path_calibrated)
    print('calibrated model persisted at', path_calibrated)

The error occurs when fitting the estimator:

# launch training job
sklearn_estimator.fit({
    'train': train_input, 
    'val': val_input, 
    'calib': calib_input,
    'test': test_input
})

All the input path are correct locations on s3.
Note that the training succeeds in local mode when specifying train_instance_type='local' in the SKLearn estimator instanciation (and commenting out the line defining sagemaker_session)

The complete logs of the failed fit execution is:

2020-01-11 00:25:24 Starting - Starting the training job...
2020-01-11 00:25:25 Starting - Launching requested ML instances...
2020-01-11 00:26:23 Starting - Preparing the instances for training......
2020-01-11 00:27:17 Downloading - Downloading input data......
2020-01-11 00:28:11 Training - Training image download completed. Training in progress.2020-01-11 00:28:12,270 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training
2020-01-11 00:28:12,273 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-01-11 00:28:12,284 sagemaker_sklearn_container.training INFO     Invoking user training script.
2020-01-11 00:28:12,551 sagemaker-containers INFO     Module mailswitch-train-and-calibrate-lsvc does not provide a setup.py. 
Generating setup.py
2020-01-11 00:28:12,551 sagemaker-containers INFO     Generating setup.cfg
2020-01-11 00:28:12,551 sagemaker-containers INFO     Generating MANIFEST.in
2020-01-11 00:28:12,551 sagemaker-containers INFO     Installing module with the following command:
/miniconda3/bin/python -m pip install . 
Processing /opt/ml/code
Building wheels for collected packages: mailswitch-train-and-calibrate-lsvc
  Building wheel for mailswitch-train-and-calibrate-lsvc (setup.py): started
  Building wheel for mailswitch-train-and-calibrate-lsvc (setup.py): finished with status 'done'
  Created wheel for mailswitch-train-and-calibrate-lsvc: filename=mailswitch_train_and_calibrate_lsvc-1.0.0-py2.py3-none-any.whl size=7687 sha256=5764f87a457b9e8efac835e1d0c4b7f2486f78a5f1df1e6c53fc54730674f04a
  Stored in directory: /tmp/pip-ephem-wheel-cache-q5juxd4g/wheels/35/24/16/37574d11bf9bde50616c67372a334f94fa8356bc7164af8ca3
Successfully built mailswitch-train-and-calibrate-lsvc
Installing collected packages: mailswitch-train-and-calibrate-lsvc
Successfully installed mailswitch-train-and-calibrate-lsvc-1.0.0
2020-01-11 00:28:13,798 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2020-01-11 00:28:13,808 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "val": "/opt/ml/input/data/val",
        "test": "/opt/ml/input/data/test",
        "calib": "/opt/ml/input/data/calib",
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_sklearn_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "regularization_C": 0.00038335633948752675
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "val": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "test": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "calib": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        },
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-scikit-learn-2020-01-11-00-25-23-491",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz",
    "module_name": "mailswitch-train-and-calibrate-lsvc",
    "network_interface_name": "eth0",
    "num_cpus": 4,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "mailswitch-train-and-calibrate-lsvc.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"regularization_C":0.00038335633948752675}
SM_USER_ENTRY_POINT=mailswitch-train-and-calibrate-lsvc.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"calib":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"val":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["calib","test","train","val"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=mailswitch-train-and-calibrate-lsvc
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_sklearn_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=4
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"calib":"/opt/ml/input/data/calib","test":"/opt/ml/input/data/test","train":"/opt/ml/input/data/train","val":"/opt/ml/input/data/val"},"current_host":"algo-1","framework_module":"sagemaker_sklearn_container.training:main","hosts":["algo-1"],"hyperparameters":{"regularization_C":0.00038335633948752675},"input_config_dir":"/opt/ml/input/config","input_data_config":{"calib":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"test":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"},"val":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-scikit-learn-2020-01-11-00-25-23-491","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-eu-central-1-314189662144/sagemaker-scikit-learn-2020-01-11-00-25-23-491/source/sourcedir.tar.gz","module_name":"mailswitch-train-and-calibrate-lsvc","network_interface_name":"eth0","num_cpus":4,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"mailswitch-train-and-calibrate-lsvc.py"}
SM_USER_ARGS=["--regularization_C","0.00038335633948752675"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_VAL=/opt/ml/input/data/val
SM_CHANNEL_TEST=/opt/ml/input/data/test
SM_CHANNEL_CALIB=/opt/ml/input/data/calib
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_REGULARIZATION_C=0.00038335633948752675
PYTHONPATH=/miniconda3/bin:/miniconda3/lib/python37.zip:/miniconda3/lib/python3.7:/miniconda3/lib/python3.7/lib-dynload:/miniconda3/lib/python3.7/site-packages

Invoking script with the following command:

/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675


/miniconda3/lib/python3.7/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:47: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
extracting arguments
reading data
training uncalibrated model with hyperparameter C = 0.00038335633948752675
2020-01-11 00:28:33,612 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"

2020-01-11 00:28:39 Uploading - Uploading generated training model
2020-01-11 00:28:39 Failed - Training job failed

And the Traceback is:

---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-16-8b1c1d92127e> in <module>()
      4     'val': val_input,
      5     'calib': calib_input,
----> 6     'test': test_input
      7 })

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    462         self.jobs.append(self.latest_training_job)
    463         if wait:
--> 464             self.latest_training_job.wait(logs=logs)
    465 
    466     def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1059         # If logs are requested, call logs_for_jobs.
   1060         if logs != "None":
-> 1061             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1062         else:
   1063             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   2972 
   2973         if wait:
-> 2974             self._check_job_status(job_name, description, "TrainingJobStatus")
   2975             if dot:
   2976                 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   2566                 ),
   2567                 allowed_statuses=["Completed", "Stopped"],
-> 2568                 actual_status=status,
   2569             )
   2570 

UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2020-01-11-00-25-23-491: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/miniconda3/bin/python -m mailswitch-train-and-calibrate-lsvc --regularization_C 0.00038335633948752675"
@laurenyu
Copy link
Contributor

I've reached out to the team that owns SKLearn to see if they have any insight. Thanks for the detailed information.

@ivankeller
Copy link
Author

ivankeller commented Jan 13, 2020

It was a computing resource issue: it succeeded in local mode on an ml.t3.xlarge notebook instance (4CPUs/16 GB RAM) and failed on a ml.c4.xlarge training instance (4 CPUs/7.5 GB RAM).
So I tried with a more powerful training instances ml.m5.2xlarge (8CPUs/32 GB RAM) just to be sure and it succeeded.

The logs are not meaningful when the training job fails for this reason.

@yunzhe-tao
Copy link

Does anyone know the root cause of this issue? I met the similar issue in another notebook.

@Harsh462
Copy link

I am also having somewhat same issue .Tried changing the gpu wrt my region ,still issue persists.
Anyone have any solution ?
Help is appreciated!

@sagarhadole
Copy link

unable to resolve the error even by using powerful training instances ml.m5.2xlarge. how to solve it. Any suggestion is appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants