Skip to content

Not detect requirements.txt in TensorFlow script mode #911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xkumiyu opened this issue Jul 7, 2019 · 13 comments
Closed

Not detect requirements.txt in TensorFlow script mode #911

xkumiyu opened this issue Jul 7, 2019 · 13 comments

Comments

@xkumiyu
Copy link

xkumiyu commented Jul 7, 2019

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): TensorFlow
  • Framework Version: 1.12.0
  • Python Version: 3.6
  • CPU or GPU: CPU
  • Python SDK Version: 1.32.1
  • Are you using a custom image: No

Describe the problem

In #839, it was mentioned that Docker Image detected requirements.txt and installed python libraries, but in my experiment it was not installed.
I think the Image did not detect requirements.txt, but I confirmed that requirements.txt exists in sourcedir.tar.gz uploaded to S3.

Minimal repro / logs

2019-07-07 06:54:53 Starting - Starting the training job...
2019-07-07 06:54:58 Starting - Launching requested ML instances......
2019-07-07 06:56:05 Starting - Preparing the instances for training...
2019-07-07 06:56:43 Downloading - Downloading input data...
2019-07-07 06:57:31 Training - Training image download completed. Training in progress.
2019-07-07 06:57:31 Uploading - Uploading generated training model
2019-07-07 06:57:31 Failed - Training job failed

2019-07-07 06:57:21,299 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training
2019-07-07 06:57:21,306 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,547 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,561 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2019-07-07 06:57:21,572 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "model_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz",
    "module_name": "train",
    "network_interface_name": "eth0",
    "num_cpus": 2,
    "num_gpus": 0,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "hosts": [
            "algo-1"
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train.py"
}

Environment variables:

SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"}
SM_USER_ENTRY_POINT=train.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["training"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=2
SM_NUM_GPUS=0
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"training":"/opt/ml/input/data/training"},"current_host":"algo-1","framework_module":"sagemaker_tensorflow_container.training:main","hosts":["algo-1"],"hyperparameters":{"model_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"training":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/source/sourcedir.tar.gz","module_name":"train","network_interface_name":"eth0","num_cpus":2,"num_gpus":0,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"eth0"},"user_entry_point":"train.py"}
SM_USER_ARGS=["--model_dir","s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAINING=/opt/ml/input/data/training
SM_HP_MODEL_DIR=s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model
PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/lib/python36.zip:/usr/lib/python3.6:/usr/lib/python3.6/lib-dynload:/usr/local/lib/python3.6/dist-packages:/usr/lib/python3/dist-packages

Invoking script with the following command:

/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model


Traceback (most recent call last):
  File "train.py", line 1, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
2019-07-07 06:57:21,597 sagemaker-containers ERROR    ExecuteUserScriptError:
Command "/usr/bin/python train.py --model_dir s3://sagemaker-ap-northeast-1-xxx/sagemaker-tensorflow-scriptmode-2019-07-07-06-54-50-259/model"
  • Exact command to reproduce:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train.py',
                       source_dir='src',
                       role=role,
                       train_instance_type='ml.m5.large',
                       train_instance_count=1,
                       framework_version='1.12.0',
                       py_version='py3')
estimator.fit(input_data)
$ tree src
src
|-- train.py
`-- requirements.txt
  • train.py
import matplotlib.pyplot as plt

if __name__ == "__main__":
    pass
  • requirements.txt
-i https://pypi.org/simple
matplotlib==3.1.1
@ChoiByungWook
Copy link
Contributor

@xkumiyu,

Thanks for bringing this up to our attention and for providing a quick minimal repro.

I'm going to attempt to reproduce this and see why this is happening. I'll report back with my results.

Thank you for your patience.

@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Jul 9, 2019

Was able to repro the same error. Investigating now.

algo-1-5phqw_1 | Traceback (most recent call last): algo-1-5phqw_1 | File "train.py", line 1, in <module> algo-1-5phqw_1 | import matplotlib.pyplot as plt algo-1-5phqw_1 | ModuleNotFoundError: No module named 'matplotlib' algo-1-5phqw_1 | 2019-07-09 21:19:03,170 sagemaker-containers ERROR ExecuteUserScriptError: algo-1-5phqw_1 | Command "/usr/bin/python train.py --data_dir /opt/ml/input/data/training --model_dir s3://sagemaker-us-west-2-633083500428/sagemaker-tensorflow-scriptmode-2019-07-09-21-18-36-624/model --num_epochs 1" tmpyehcq8ep_algo-1-5phqw_1 exited with code 1 Aborting on container exit...

@ChoiByungWook
Copy link
Contributor

Okay, it looks like this is intended. Apologies on the mixed message in #839

There a few workarounds to enable requirements.txt

  1. Change the entry_point to be a bash script, which does the following:

    1. pip install your requirements.txt
    2. Call your python file with the right arguments
      The container will call it as defined here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_process.py#L82
  2. Add a setup.py in our source_dir, to fulfill the conditional that enables installing your requirements.txt as shown here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/_entry_point_type.py#L37. This conditional enables installing shown here: https://github.com/aws/sagemaker-containers/blob/master/src/sagemaker_containers/entry_point.py#L112.

Please let me know if there is anything I can clarify.

@xkumiyu
Copy link
Author

xkumiyu commented Jul 10, 2019

Hi @ChoiByungWook ,

I add setup.py in src/, and it worked.
Thank you very much!!

Then I created an example repository containing setup.py.
https://github.com/xkumiyu/sagemaker-keras-example

@xkumiyu xkumiyu closed this as completed Jul 10, 2019
@ralphbrooks
Copy link

@xkumiyu The sagemaker-keras-example that you put together worked for me as well. I confirmed the approach that you wrote which places setup.py and requirements.txt in the src directory; the combination of those two files will pull in the dependencies that are needed.

@gilinachum
Copy link
Contributor

Just wondering: Why is setup.py required for requirements.txt to be installed (said it was intentional)?

@marianokamp
Copy link
Collaborator

Just wondering: Why is setup.py required for requirements.txt to be installed (said it was intentional)?

Also, I am wondering how this fits to the documentation? Using setup.py is not mentioned, right? Also, when using PyTorch with SageMaker this is not necessary.

@talgos1
Copy link

talgos1 commented May 25, 2020

Nothing is mentioned about the setup.py (had to spend hours until finding this workaround)

BTW, this is the misleading doc: https://sagemaker.readthedocs.io/en/stable/using_tf.html#use-third-party-libraries

@yuankunluo
Copy link

Yes, it it very confution about tf2.2,

simple_classifier_estimator = TensorFlow(entry_point='simple_regression_model.py', role=role, train_instance_count=2, train_instance_type='ml.m4.xlarge', framework_version='2.2.0', py_version='py3', source_dir="models", distributions={'parameter_server': {'enabled': True}})
I got error
UnexpectedStatusException: Error for Training job tensorflow-training-2020-06-08-16-39-20-042: Failed. Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

@filthysocks
Copy link

I found another problem. Using the setup.py will only work if the entrypoint is directly in the source dir.
Using the example from https://github.com/xkumiyu/sagemaker-keras-example
assuming the train.py would be under utils then the following code would not work:

estimator = TensorFlow(
    source_dir='src',
    entry_point='utils/train.py',
    model_dir=model_dir,
    train_instance_type=train_instance_type,
    train_instance_count=1,
    hyperparameters=hyperparameters,
    role=execution_role,
    base_job_name='tf-2-workflow',
    framework_version='2.1',
    py_version='py3',
)

the reason is that it sagemaker trys to execute the command: python3 -m utils/train but it should be python3 -m utils.train. You cannot change the entrypoint to utils.train though.

@cfregly
Copy link

cfregly commented Aug 4, 2020

Can somebody explain why setup.py is required in this case? This isnt consistent with other parts of the sdk.

And this seems to break quite often. There are many GitHub issues around the simple - and very common - need to specify a requirements.txt.

Some clarity on this feature, it’s various workarounds, and stability would be greatly appreciated. @ChoiByungWook

@kunalkishore
Copy link

Yes, it it very confution about tf2.2,

simple_classifier_estimator = TensorFlow(entry_point='simple_regression_model.py', role=role, train_instance_count=2, train_instance_type='ml.m4.xlarge', framework_version='2.2.0', py_version='py3', source_dir="models", distributions={'parameter_server': {'enabled': True}})
I got error
UnexpectedStatusException: Error for Training job tensorflow-training-2020-06-08-16-39-20-042: Failed. Reason: ClientError: Cannot pull algorithm container. Either the image does not exist or its permissions are incorrect.

I was getting the same error but then it worked with the usual guide as in doc when I used py37 with tf2.2.

ranking_estimator = TensorFlow(entry_point='train_model.py',
                             role=role,
                             train_instance_count=1,
                             train_instance_type='ml.m5.xlarge',
                             framework_version='2.2.0',
                             py_version='py37',
                             distributions={'parameter_server': {'enabled': True}},
                              source_dir="model_files")

@m-ali-awan
Copy link

You, can also install external modules/libraries in the following way:
Inclue this in your training script

import subprocess
import sys

package='matplolib'
def install(package):
subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])

install('matplotlib')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests