Not detect requirements.txt in TensorFlow script mode #911

xkumiyu · 2019-07-07T07:07:47Z

System Information

Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
Framework Version: 1.12.0
Python Version: 3.6
CPU or GPU: CPU
Python SDK Version: 1.32.1
Are you using a custom image: No

Describe the problem

In #839, it was mentioned that Docker Image detected requirements.txt and installed python libraries, but in my experiment it was not installed.
I think the Image did not detect requirements.txt, but I confirmed that requirements.txt exists in sourcedir.tar.gz uploaded to S3.

Minimal repro / logs

2019-07-07 06:54:53 Starting - Starting the training job...
2019-07-07 06:54:58 Starting - Launching requested ML instances......
2019-07-07 06:56:05 Starting - Preparing the instances for training...
2019-07-07 06:56:43 Downloading - Downloading input data...
2019-07-07 06:57:31 Training - Training image download completed. Training in progress.
2019-07-07 06:57:31 Uploading - Uploading generated training model
2019-07-07 06:57:31 Failed - Training job failed

2019-07-07 06:57:21,299 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
2019-07-07 06:57:21,306 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,547 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,561 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,572 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "model_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 2,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=2
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--model_dir","s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_MODEL_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages

Invoking script with the following command:

/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model


Traceback (most recent call last):
  File "train.py", line 1, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
2019-07-07 06:57:21,597 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"

Exact command to reproduce:

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train.py',
                       source_dir='src',
                       role=role,
                       train_instance_type='ml.m5.large',
                       train_instance_count=1,
                       framework_version='1.12.0',
                       py_version='py3')
estimator.fit(input_data)

$ tree src
src
|-- train.py
`-- requirements.txt

train.py

import matplotlib.pyplot as plt

if __name__ == "__main__":
    pass

requirements.txt

-i https://pypi.org/simple
matplotlib==3.1.1

The text was updated successfully, but these errors were encountered:

ChoiByungWook · 2019-07-09T21:14:01Z

@xkumiyu,

Thanks for bringing this up to our attention and for providing a quick minimal repro.

I'm going to attempt to reproduce this and see why this is happening. I'll report back with my results.

Thank you for your patience.

ChoiByungWook · 2019-07-09T21:22:21Z

Was able to repro the same error. Investigating now.

algo-1-5phqw_1 | Traceback (most recent call last): algo-1-5phqw_1 | File "train.py", line 1, in <module> algo-1-5phqw_1 | import matplotlib.pyplot as plt algo-1-5phqw_1 | ModuleNotFoundError: No module named 'matplotlib' algo-1-5phqw_1 | 2019-07-09 21:19:03,170 sagemaker-containers ERROR ExecuteUserScriptError: algo-1-5phqw_1 | Command "/usr/bin/python train.py --data_dir /opt/ml/input/data/training --model_dir s3://sagemaker-us-west-2-633083500428/sagemaker-tensorflow-scriptmode-2019-07-09-21-18-36-624/model --num_epochs 1" tmpyehcq8ep_algo-1-5phqw_1 exited with code 1 Aborting on container exit...

ChoiByungWook · 2019-07-09T22:06:41Z

Okay, it looks like this is intended. Apologies on the mixed message in #839

There a few workarounds to enable requirements.txt

Change the entry_point to be a bash script, which does the following:
1. pip install your requirements.txt
2. Call your python file with the right arguments
  The container will call it as defined here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_process.py#L82
Add a setup.py in our source_dir, to fulfill the conditional that enables installing your requirements.txt as shown here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_entry_point_type.py#L37. This conditional enables installing shown here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/entry_point.py#L112.

Please let me know if there is anything I can clarify.

xkumiyu · 2019-07-10T15:17:25Z

Hi @ChoiByungWook ,

I add setup.py in src/, and it worked.
Thank you very much!!

Then I created an example repository containing setup.py.
https://github.com/xkumiyu/sagemaker-keras-example

ralphbrooks · 2019-09-16T16:28:18Z

@xkumiyu The sagemaker-keras-example that you put together worked for me as well. I confirmed the approach that you wrote which places setup.py and requirements.txt in the src directory; the combination of those two files will pull in the dependencies that are needed.

gilinachum · 2019-12-01T11:48:35Z

Just wondering: Why is setup.py required for requirements.txt to be installed (said it was intentional)?

marianokamp · 2020-04-16T15:35:52Z

Just wondering: Why is setup.py required for requirements.txt to be installed (said it was intentional)?

Also, I am wondering how this fits to the documentation? Using setup.py is not mentioned, right? Also, when using PyTorch with SageMaker this is not necessary.

talgos1 · 2020-05-25T14:47:55Z

Nothing is mentioned about the setup.py (had to spend hours until finding this workaround)

BTW, this is the misleading doc: https://sagemaker.readthedocs.io/en/stable/using_tf.html#use-third-party-libraries

yuankunluo · 2020-06-08T18:04:54Z

Yes, it it very confution about tf2.2,

simple_classifier_estimator = TensorFlow(entry_point='simple_regression_model.py', role=role, train_instance_count=2, train_instance_type='ml.m4.xlarge', framework_version='2.2.0', py_version='py3', source_dir="models", distributions={'parameter_server': {'enabled': True}})
I got error
UnexpectedStatusException: Error for Training job tensorflow-training-2020-06-08-16-39-20-042: Failed. Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

filthysocks · 2020-06-11T09:13:10Z

I found another problem. Using the setup.py will only work if the entrypoint is directly in the source dir.
Using the example from https://github.com/xkumiyu/sagemaker-keras-example
assuming the train.py would be under utils then the following code would not work:

estimator = TensorFlow(
    source_dir='src',
    entry_point='utils/train.py',
    model_dir=model_dir,
    train_instance_type=train_instance_type,
    train_instance_count=1,
    hyperparameters=hyperparameters,
    role=execution_role,
    base_job_name='tf-2-workflow',
    framework_version='2.1',
    py_version='py3',
)

the reason is that it sagemaker trys to execute the command: python3 -m utils/train but it should be python3 -m utils.train. You cannot change the entrypoint to utils.train though.

cfregly · 2020-08-04T18:19:07Z

Can somebody explain why setup.py is required in this case? This isnt consistent with other parts of the sdk.

And this seems to break quite often. There are many GitHub issues around the simple - and very common - need to specify a requirements.txt.

Some clarity on this feature, it’s various workarounds, and stability would be greatly appreciated. @ChoiByungWook

kunalkishore · 2020-08-14T04:38:54Z

Yes, it it very confution about tf2.2,

simple_classifier_estimator = TensorFlow(entry_point='simple_regression_model.py', role=role, train_instance_count=2, train_instance_type='ml.m4.xlarge', framework_version='2.2.0', py_version='py3', source_dir="models", distributions={'parameter_server': {'enabled': True}})
I got error
UnexpectedStatusException: Error for Training job tensorflow-training-2020-06-08-16-39-20-042: Failed. Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

I was getting the same error but then it worked with the usual guide as in doc when I used py37 with tf2.2.

ranking_estimator = TensorFlow(entry_point='train_model.py',
                             role=role,
                             train_instance_count=1,
                             train_instance_type='ml.m5.xlarge',
                             framework_version='2.2.0',
                             py_version='py37',
                             distributions={'parameter_server': {'enabled': True}},
                              source_dir="model_files")

m-ali-awan · 2020-11-04T04:03:45Z

You, can also install external modules/libraries in the following way:
Inclue this in your training script

import subprocess
import sys

package='matplolib'
def install(package):
subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])

install('matplotlib')

ChoiByungWook added the type: bug label Jul 9, 2019

xkumiyu closed this as completed Jul 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not detect requirements.txt in TensorFlow script mode #911

Not detect requirements.txt in TensorFlow script mode #911

xkumiyu commented Jul 7, 2019 •

edited

Loading

ChoiByungWook commented Jul 9, 2019

ChoiByungWook commented Jul 9, 2019 •

edited

Loading

ChoiByungWook commented Jul 9, 2019

xkumiyu commented Jul 10, 2019 •

edited

Loading

ralphbrooks commented Sep 16, 2019

gilinachum commented Dec 1, 2019

marianokamp commented Apr 16, 2020

talgos1 commented May 25, 2020 •

edited

Loading

yuankunluo commented Jun 8, 2020

filthysocks commented Jun 11, 2020

cfregly commented Aug 4, 2020

kunalkishore commented Aug 14, 2020

m-ali-awan commented Nov 4, 2020

Not detect requirements.txt in TensorFlow script mode #911

Not detect requirements.txt in TensorFlow script mode #911

Comments

xkumiyu commented Jul 7, 2019 • edited Loading

System Information

Describe the problem

Minimal repro / logs

ChoiByungWook commented Jul 9, 2019

ChoiByungWook commented Jul 9, 2019 • edited Loading

ChoiByungWook commented Jul 9, 2019

xkumiyu commented Jul 10, 2019 • edited Loading

ralphbrooks commented Sep 16, 2019

gilinachum commented Dec 1, 2019

marianokamp commented Apr 16, 2020

talgos1 commented May 25, 2020 • edited Loading

yuankunluo commented Jun 8, 2020

filthysocks commented Jun 11, 2020

cfregly commented Aug 4, 2020

kunalkishore commented Aug 14, 2020

m-ali-awan commented Nov 4, 2020

xkumiyu commented Jul 7, 2019 •

edited

Loading

ChoiByungWook commented Jul 9, 2019 •

edited

Loading

xkumiyu commented Jul 10, 2019 •

edited

Loading

talgos1 commented May 25, 2020 •

edited

Loading