Skip to content

[FATAL tini (7)] exec train failed: No such file or directory #254

Open
@celsofranssa

Description

@celsofranssa

BUG Description
I'm trying to automate and scale a large collection of experiments using AWS SageMamker via Python SDK. However, I am facing an error that does not give any direction to resolve it.

To reproduce

role = "arn:..."

    estimator = PyTorch(
        image_uri="1...ecr...amazonaws.com/...:prototype",
        entry_point="main.py",
        role=role,
        region="us-...",
        instance_type="ml...xlarge",
        instance_count=1,
        volume_size=225,
        hyperparameters=hparams
    )
    estimator.fit()

Expected behavior
The model is expected to start to train and log metrics and losses.

Screenshots or logs

[2023-09-10 23:08:59,329][sagemaker][INFO] - Creating training-job with name: xmtc-2023-09-11-02-08-56-094
2023-09-11 02:09:00 Starting - Starting the training job...
2023-09-11 02:09:18 Starting - Preparing the instances for training......
2023-09-11 02:10:27 Downloading - Downloading input data
2023-09-11 02:10:27 Training - Downloading the training image..................
2023-09-11 02:13:33 Training - Training image download completed. Training in progress..[FATAL tini (7)] exec train failed: No such file or directory

2023-09-11 02:14:15 Uploading - Uploading generated training model
2023-09-11 02:14:15 Failed - Training job failed
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/celso/projects/LightningPrototype/run_on_sagemaker.py", line 32, in run_on_sagemaker
    estimator.fit()
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 311, in wrapper
    return run_func(*args, **kwargs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 1292, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/estimator.py", line 2474, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 4849, in logs_for_job
    _logs_for_job(self.boto_session, job_name, wait, poll, log_type, timeout)
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6760, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/celso/projects/LightningPrototype/venv/lib/python3.10/site-packages/sagemaker/session.py", line 6813, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job xmtc-2023-09-11-02-08-56-094: Failed. Reason: AlgorithmError: , exit code: 127

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Process finished with exit code 1

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: sagemaker 2.177.1
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): Pytorch 2.0.1
  • Python version: Python 3.10
  • Custom Docker image (Y/N): Yes, on ECR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions