entry point not detected in Pytorch script mode when in source folder #1325

hjuhel-cdpq · 2020-03-04T22:13:59Z

Describe the bug
While using:

A custom image, forked from the Pytorch official image
Some custom training code, located in the code folder (code/train.py),

The entrypoint train.py is never detected

Expected behavior
As stated in the documentation, the entry point schould be correctly detected

System information

SageMaker Python SDK version:
Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch
Framework version: 1.4.0
Python version: 3.6
CPU or GPU: GPU
Custom Docker image (Y/N): Y

To reproduce

Create a docker image, adding requirements to the official pytorch image and push it to ECR

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.4.0-gpu-py36-cu101-ubuntu16.04

## Install and Download Spacy models
RUN pip install spacy
RUN python -m spacy download "en"
RUN python -m spacy download "en_core_web_lg"
RUN python -m spacy download "en_core_web_md"

In Jupyter sagemaker, create a folder code containing the following train.py script:

def train(args):
    print("done")

if __name__ == "__main__":

    # Retrieve hyperparameters
    parser = argparse.ArgumentParser()

    # Parse the hyperparameters
    parser.add_argument("--embedding_dim", type=int, default=300)

    args = parser.parse_args()

    try:
        _train(args)
        sys.exit(1)

    else:
        sys.exit(0)

The folder tree must looks like :

$ tree code
code
|-- train.py

Instanciate a Pytorch training job with :

Please, note that source_dir is the folder containing the entrypoint :

import sagemaker
from sagemaker.pytorch import PyTorch
from sagemaker.estimator import Estimator
from sagemaker import get_execution_role

ROLE = get_execution_role()
INSTANCE = 'ml.p3.2xlarge'

output_path = '<path_to_s3_location>'

pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir="code",
                            image_name="<path_to_docker_image>",
                            base_job_name="foobar",
                            train_instance_type=INSTANCE,
                            train_instance_count=1,
                            role = ROLE
                           )

pytorch_estimator.fit()

The job fail with error :

/opt/conda/bin/python train.py

/opt/conda/bin/python: can't open file 'train.py': [Errno 2] No such file or directory
2020-03-04 18:05:55,332 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py"
/opt/conda/bin/python: can't open file 'train.py': [Errno 2] No such file or directory

Screenshots or logs
Complete log is :

No framework_version specified, defaulting to version 0.4. This is not the latest supported version. If you would like to use version 1.3.1, please add framework_version=1.3.1 to your constructor.
2020-03-04 18:00:29 Starting - Starting the training job...
2020-03-04 18:00:30 Starting - Launching requested ML instances......
2020-03-04 18:01:35 Starting - Preparing the instances for training......
2020-03-04 18:02:45 Downloading - Downloading input data
2020-03-04 18:02:45 Training - Downloading the training image..................
2020-03-04 18:05:56 Uploading - Uploading generated training modelbash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2020-03-04 18:05:52,689 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training
2020-03-04 18:05:52,713 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2020-03-04 18:05:52,714 sagemaker_pytorch_container.training INFO     Invoking user training script.
2020-03-04 18:05:53,164 sagemaker-containers INFO     Module default_user_module_name does not provide a setup.py. 
Generating setup.py
2020-03-04 18:05:53,164 sagemaker-containers INFO     Generating setup.cfg
2020-03-04 18:05:53,164 sagemaker-containers INFO     Generating MANIFEST.in
2020-03-04 18:05:53,165 sagemaker-containers INFO     Installing module with the following command:
/opt/conda/bin/python -m pip install . 
Processing /tmp/tmpzvar_c2z/module_dir
Building wheels for collected packages: default-user-module-name
  Building wheel for default-user-module-name (setup.py): started
  Building wheel for default-user-module-name (setup.py): finished with status 'done'
  Created wheel for default-user-module-name: filename=default_user_module_name-1.0.0-py2.py3-none-any.whl size=4364 sha256=cbf38ce610a10a5f5588edd12c6984667076665eaadbcfb858505e580ff243a5
  Stored in directory: /tmp/pip-ephem-wheel-cache-lfxv8mpn/wheels/52/eb/a1/d0d0d1552198986736428c8139b9ba7e9d91768a9bfb9b8094
Successfully built default-user-module-name
Installing collected packages: default-user-module-name
Successfully installed default-user-module-name-1.0.0
2020-03-04 18:05:55,303 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {},
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sentiment10k-sgmk-cnn-2-2020-03-04-18-00-28-905",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://leo-data/sentiment10k-sgmk-cnn-2-2020-03-04-18-00-28-905/source/sourcedir.tar.gz",
    "module_name": "/opt/ml/code/train",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "/opt/ml/code/train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={}
SM_USER_ENTRY_POINT=/opt/ml/code/train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=/opt/ml/code/train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://.../source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"...","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://.../source/sourcedir.tar.gz","module_name":"/opt/ml/code/train","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"/opt/ml/code/train.py"}
SM_USER_ARGS=[]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages

Invoking script with the following command:

/opt/conda/bin/python train.py

/opt/conda/bin/python: can't open file 'train.py': [Errno 2] No such file or directory
2020-03-04 18:05:55,332 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/opt/conda/bin/python train.py"
/opt/conda/bin/python: can't open file 'train.py': [Errno 2] No such file or directory

Could you please help moi figuring what I'm doing wrong ?

The text was updated successfully, but these errors were encountered:

ehsanmok · 2020-03-09T18:05:07Z

Examine your SAGEMAKER_PROGRAM env variable and see what's missing.

General development tips:

Local mode
Enable logs in debug mode use container_log_level=10.

hjuhel-cdpq · 2020-03-12T13:55:48Z

Thanks @ehsanmok for your answer.

The SAGEMAKER_PROGRAM was correctly set-up. The issue was due to a requirements.txt. It's unclear for me, why was happening, but copying the requirements to /opt/ml/code was, somehow, causing the bug

Switching from

COPY requirements.txt /opt/ml/code/requirements.txt
RUN pip install /opt/ml/code/requirements.txt

to :

COPY requirements.txt /home/requirements.txt
RUN pip install /home/requirements.txt

resolved the issue.

LecJackS · 2020-10-27T17:58:48Z

Just for the record, this also happened to me while cloning a repo (with a requirements.txt file) inside /opt/ml/code

Cloning it outside /code (in /opt/ml/) solved the problem for me.

Co-authored-by: Raymond Liu <[email protected]> Co-authored-by: Ruilian Gao <[email protected]> Co-authored-by: John Barboza <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Zuoyuan Huang <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Mufaddal Rohawala <[email protected]> Co-authored-by: Mike Schneider <[email protected]> Co-authored-by: Bhupendra Singh <[email protected]> Co-authored-by: ci <ci> Co-authored-by: Malav Shastri <[email protected]> Co-authored-by: evakravi <[email protected]> Co-authored-by: Keshav Chandak <[email protected]> Co-authored-by: Alexander Pivovarov <[email protected]> Co-authored-by: qidewenwhen <[email protected]> Co-authored-by: mariumof <[email protected]> Co-authored-by: matherit <[email protected]> Co-authored-by: amzn-choeric <[email protected]> Co-authored-by: Ao Guo <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Erick Benitez-Ramos <[email protected]> Co-authored-by: Qingzi-Lan <[email protected]> Co-authored-by: Sally Seok <[email protected]> Co-authored-by: Manu Seth <[email protected]> Co-authored-by: Miyoung <[email protected]> Co-authored-by: Sarah Castillo <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: EC2 Default User <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: stacicho <[email protected]> Co-authored-by: martinRenou <[email protected]> Co-authored-by: jiapinw <[email protected]> Co-authored-by: Akash Goel <[email protected]> Co-authored-by: Joseph Zhang <[email protected]> Co-authored-by: Harsha Reddy <[email protected]> Co-authored-by: Haixin Wang <[email protected]> Co-authored-by: Kalyani Nikure <[email protected]> Co-authored-by: Xin Wang <[email protected]> Co-authored-by: Gili Nachum <[email protected]> Co-authored-by: Jose Pena <[email protected]> Co-authored-by: cansun <[email protected]> Co-authored-by: AWS-pratab <[email protected]> Co-authored-by: shenlongtang <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: chrivtho-github <[email protected]> Co-authored-by: Justin <[email protected]> Co-authored-by: Duc Trung Le <[email protected]> Co-authored-by: HappyAmazonian <[email protected]> Co-authored-by: cj-zhang <[email protected]> Co-authored-by: Matthew <[email protected]> Co-authored-by: Zach Kimberg <[email protected]> Co-authored-by: Rohith Nadimpally <[email protected]> Co-authored-by: rohithn1 <[email protected]> Co-authored-by: Victor Zhu <[email protected]> Co-authored-by: Gary Wang <[email protected]> Co-authored-by: SSRraymond <[email protected]> Co-authored-by: jbarz1 <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Mohan Gandhi <[email protected]> Co-authored-by: Barboza <[email protected]> Co-authored-by: ruiliann666 <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: svia3 <[email protected]> Co-authored-by: Zhankui Lu <[email protected]> Co-authored-by: Dewen Qi <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Stephen Via <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Stacia Choe <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Edward Sun <[email protected]> Co-authored-by: Rohan Gujarathi <[email protected]> Co-authored-by: JohnaAtAWS <[email protected]> Co-authored-by: Vera Yu <[email protected]> Co-authored-by: bhaoz <[email protected]> Co-authored-by: Qing Lan <[email protected]> Co-authored-by: Namrata Madan <[email protected]> Co-authored-by: Sirut Buasai <[email protected]> Co-authored-by: wayneyao <[email protected]> Co-authored-by: Jacky Lee <[email protected]> Co-authored-by: haNa-meister <[email protected]> Co-authored-by: Shailav <[email protected]> Fix unit tests (#1018) Fix happy hf test (#1026) fix logic setup (#1034) fixes (#1045) Fix flake error in init (#1050) fix (#1053) fix: skip tensorflow local mode notebook test (#4060) fix: tags for jumpstart model package models (#4061) fix: pipeline variable kms key (#4065) fix: jumpstart cache using sagemaker session s3 client (#4051) fix: gated models unsupported region (#4069) fix: pipeline upsert failed to pass parallelism_config to update (#4066) fix: temporarily skip kmeans notebook (#4092) fixes (#1051) Fix missing absolute import error (#1057) Fix flake8 error in unit test (#1058) fixes (#1056) Fix flake8 error in integ test (#1060) Fix black format error in test_pickle_dependencies (#1062) Fix docstyle error under serve (#1065) Fix docstyle error in builder failure (#1066) fix black and flake8 formatting (#1069) Fix format error (#1070) Fix integ test (#1074) fix: HuggingFaceProcessor parameterized instance_type when image_uri is absent (#4072) fix: log message when sdk defaults not applied (#4104) fix: handle bad jumpstart default session (#4109) Fix the version information, whl and flake8 (#1085) Fix JSON serializer error (#1088) Fix unit test (#1091) fix format (#1103) Fix local mode predictor (#1107) Fix DJLPredictor (#1108) Fix modelbuilder unit tests (#1118) fixes (#1136) fixes (#1165) fixes (#1166) fix: auto ml integ tests and add flaky test markers (#4136) fix model data for JumpStartModel (#4135) fix: transform step unit test (#4151) fix: Update pipeline.py and selective_execution_config.py with small fixes (#1099) fix: Fixed bug in _create_training_details (#4141) fix: use correct line endings and s3 uris on windows (#4118) fix: js tagging s3 prefix (#4167) fix: Update Ec2 instance type to g5.4xlarge in test_huggingface_torch_distributed.py (#4181) fix: import error in unsupported js regions (#4188) fix: update local mode schema (#4185) fix: fix flaky Inference Recommender integration tests (#4156) fix: clone distribution in validate_distribution (#4205) Fix hyperlinks in feature_processor.scheduler parameter descriptions (#4208) Fix master merge formatting (#1186) Fix master unit tests (#1203) Fix djl unit tests (#1204) Fix merge conflicts (#1217) fix: fix URL links (#4217) fix: bump urllib3 version (#4223) fix: relax upper bound on urllib in local mode requirements (#4219) fixes (#1224) fix formatting (#1233) fix byoc unit tests (#1235) fix byoc unit tests (#1236) Fixed Modelpackage's deploy calling model's deploy (#1155) fix: jumpstart unit-test (#1265) fixes (#963) Fix TorchTensorSer/Deser (#969) fix (#971) fix local container mode (#972) Fix auto detect (#979) Fix routing fn (#981) fix local container serialization (#989) fix custom serialiazation with local container. Also remove a lot of unused code (#994) Fix custom serialization for local container mode (#1000) fix pytorch version (#1001) Fix unit test (#990) fix: Multiple bug fixes including removing unsupported feature. (#1105) Fix some problems with pipeline compilation (#1125) fix: Refactor JsonGet s3 URI and add serialize_output_to_json flag (#1164) fix: invoke_function circular import (#1262) fix: pylint (#1264) fix: Add logging for docker build failures (#1267) Fix session bug when provided in ModelBuilder (#1288) fixes (#1313) fix: Gated content bucket env var override (#1280) fix: Change the library used in pytorch test causing cloudpickle version conflict (#1287) fix: HMAC signing for ModelBuilder Triton python backend (#1282) fix: do not delete temp folder generated by sdist (#1291) fix: Do not require model_server if provided image_uri is a 1p image. (#1303) fix: check image type vs instance type (#1307) fix: unit test (#1315) fix: Fixed model builder's register unable to deploy (#1323) fix: missing `self._framework` in `InferenceSpec` path (#1325) fix: enable xgboost integ test in our own pipeline (#1326) fix: skip py310 (#1328) fix: Update autodetect dlc logic (#1329) Fix secret key in the Model object (#1334) fix: improve error message (#1333) Fix unit testing (#1340) fix: Typing and formatting (#1341) fix: WaiterError on failed pipeline execution. results() (#1337) Fix tox identified errors (#1344) Fix issue when the user runs in Python 3.11 (#1345) fixes (#1346) fix: use copy instead of move in bootstrap script (#1339) Resolve keynote3 conflicts (#1351) Resolve keynote3 conflicts v2 (#1353) Fix conflicts (#1354) Fix conflicts v3 (#1355) fix: get whl from local to run integ tests (#1357) fix: enable triton pt tests (#1358) fix: integ test (#1362) Fix Python 3.11 issue with dataclass decorator (#1345) fix: remote function include_local_workdir default value (#1342) fix: error message (#1373) fixes (#1372) fix: Remvoe PickleSerializer (#1378)

hjuhel-cdpq closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

entry point not detected in Pytorch script mode when in source folder #1325

entry point not detected in Pytorch script mode when in source folder #1325

hjuhel-cdpq commented Mar 4, 2020

ehsanmok commented Mar 9, 2020 •

edited

Loading

hjuhel-cdpq commented Mar 12, 2020 •

edited

Loading

LecJackS commented Oct 27, 2020 •

edited

Loading

entry point not detected in Pytorch script mode when in source folder #1325

entry point not detected in Pytorch script mode when in source folder #1325

Comments

hjuhel-cdpq commented Mar 4, 2020

ehsanmok commented Mar 9, 2020 • edited Loading

hjuhel-cdpq commented Mar 12, 2020 • edited Loading

LecJackS commented Oct 27, 2020 • edited Loading

ehsanmok commented Mar 9, 2020 •

edited

Loading

hjuhel-cdpq commented Mar 12, 2020 •

edited

Loading

LecJackS commented Oct 27, 2020 •

edited

Loading