Skip to content

error: Invalid distribution name or version syntax: #966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hung-ai-dev opened this issue Aug 5, 2019 · 6 comments
Closed

error: Invalid distribution name or version syntax: #966

hung-ai-dev opened this issue Aug 5, 2019 · 6 comments

Comments

@hung-ai-dev
Copy link

Please fill out the form below.

System Information

  • Framework (e.g. TensorFlow) / Algorithm (e.g. KMeans): Pytorch
  • Framework Version: 1.0.0
  • Python Version: py3
  • CPU or GPU: GPU
  • Python SDK Version: 1.35.1
  • Are you using a custom image: No

Describe the problem

I use Pytorch to train model, after Traning image download complated. The error happend

Minimal repro / logs

2019-08-05 07:38:31 Starting - Starting the training job...
2019-08-05 07:38:32 Starting - Launching requested ML instances.........
2019-08-05 07:40:05 Starting - Preparing the instances for training...
2019-08-05 07:41:01 Downloading - Downloading input data......
2019-08-05 07:41:59 Training - Training image download completed. Training in progress..
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-08-05 07:42:00,068 sagemaker-containers INFO Imported framework sagemaker_pytorch_container.training
2019-08-05 07:42:00,093 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
2019-08-05 07:42:00,094 sagemaker_pytorch_container.training INFO Invoking user training script.
2019-08-05 07:42:02,809 sagemaker-containers INFO Module ./process/train does not provide a setup.py.
Generating setup.py
2019-08-05 07:42:02,809 sagemaker-containers INFO Generating setup.cfg
2019-08-05 07:42:02,809 sagemaker-containers INFO Generating MANIFEST.in
2019-08-05 07:42:02,809 sagemaker-containers INFO Installing module with the following command:
/usr/bin/python -m pip install -U .
Processing /opt/ml/code
Complete output from command python setup.py egg_info:
WARNING: '' not a valid package name; please use only .-separated package names in setup.py
running egg_info
error: Invalid distribution name or version syntax: .-process-train-1.0.0

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-cdtylol6/
You are using pip version 18.1, however version 19.2.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-08-05 07:42:04,058 sagemaker-containers ERROR InstallModuleError:
Command "/usr/bin/python -m pip install -U ."

@icywang86rui
Copy link
Contributor

@NDHUNG
Could you show me the code you run to start the training job? The container is trying to install your training script as a python module. It uses the name of the script as the module name. It looks like the script name is not an valid module name and that tripped up pip.

@hung-ai-dev
Copy link
Author

hung-ai-dev commented Aug 6, 2019

@icywang86rui here is my code

hyperparams = {'data_root': "/opt/ml/input/data/train"}

pt_estimator = PyTorch(entry_point='./process/train.py',
                            source_dir=".",
                            code_location=code_location,
                            output_path=output_path,
                            role=role, #sagemaker.get_execution_role(),
                            train_instance_type=instance_type,
                            train_instance_count=1,
                            base_job_name='hector-graphkv',
                            train_max_run=8*60*60,
                            train_volume_size=20,
                            framework_version='1.0.0',
                            py_version="py3",
                            hyperparameters=hyperparams
                        )
pt_estimator.fit({'train': train_data_path})

./process/train.py

import argparse
import subprocess
import json
import os
import os.path as osp
import codecs

if osp.exists('/opt/ml/input/data/train/requirement.txt'):
    print('reqirement exist')
    subprocess.run('pip install -r {0}'.format('/opt/ml/input/data/train/requirement.txt'), shell=True)

@jesterhazy
Copy link
Contributor

@NDHUNG,

There is an issue that prevents entry_point values that include a path from being handled correctly within the training container, when you also provide a source_dir value.

You can work around this a few ways:

  1. change entry_point to process/train.py and remove source_dir
  2. change entry_point to train.py and use source_dir="./process"

@laurenyu
Copy link
Contributor

laurenyu commented Sep 6, 2019

closing due to inactivity

@laurenyu laurenyu closed this as completed Sep 6, 2019
@ViktorMalesevic
Copy link

ViktorMalesevic commented Feb 4, 2020

Hello, I am facing the exact same issue here:

My source_dir is a folder that contains all my dependencies:

BertSum
   |--src
       |--train.py
   |--stanford_tokenizer
   |--requirements.txt
   | etc...

source_dir = 'BertSum'
entry_point = 'src/train.py'

I get the final error:

UnexpectedStatusException: Error for Training job sagemaker-pytorch-2020-02-04-10-39-39-174: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/bin/python -m src/train"
/usr/bin/python: No module named src/train

And a few lines above as warning:

Running setup.py bdist_wheel for src-train: finished with status 'error'
  Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/pip-req-build-bcuvdt78/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-t9sjhhrx --python-tag cp36:
  WARNING: '' not a valid package name; please use only .-separated package names in setup.py
  running bdist_wheel

Yet a few lines under I have this:

Failed to build src-train
Installing collected packages: regex, pytorch-pretrained-bert, protobuf, tensorboardX, dill, multiprocess, pyrouge, src-train
  Running setup.py install for src-train: started
    Running setup.py install for src-train: finished with status 'done'
Successfully installed dill-0.3.1.1 multiprocess-0.70.9 protobuf-3.11.3 pyrouge-0.1.3 pytorch-pretrained-bert-0.6.2 regex-2020.1.8 src-train-1.0.0 tensorboardX-2.0

@laurenyu
Copy link
Contributor

laurenyu commented Feb 4, 2020

@ViktorMalesevic, unfortunately, right now, there is not support for an entrypoint that isn't in the top-level directory of the source. Please refer to the workarounds that @jesterhazy posted above.

There is ongoing work to enable paths in the entrypoint: #941

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants