Skip to content

Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
harshraj22 opened this issue Mar 30, 2023 · 2 comments

Comments

@harshraj22
Copy link

harshraj22 commented Mar 30, 2023

Describe the bug
Trying Sagemaker Local Mode
One step of the pipeline (inference) fails quoting Failure message is: PermissionError: [Errno 13] Permission denied 'filename.json. The step is expected to create a few json files. Note that the directory /opt/ml/processing/output was not automatically created and had to be created manually inside the script. After the step fails, I crosschecked, the json files have been uploaded to the s3. I also confirmed, the json files have 777 permission.

To reproduce

estimator = PyTorch(
    entry_point="train.py",
    source_dir="code",
    role=role,
    # instance_type=instance_type_inference,
    instance_type='local_gpu',
    instance_count=1, #instance_count,
    framework_version='1.8.0',
    py_version='py3',
    output_path=output_path,
    # checkpoint_local_path="/opt/ml/checkpoints",
    # checkpoint_s3_uri=checkpoint_path,
    max_run=4 * 24 * 60 * 60,  # 4days
    metric_definitions=[
       {'Name': 'train:error', 'Regex': 'train Loss: (.*)'},
       {'Name': 'validation:error', 'Regex': 'val Loss: (.*)'}
    ],
    hyperparameters={
        'num_epochs': num_epochs.to_string(),
        'batch_size': batch_size.to_string(),
        'lr': learning_rate.to_string()
    },
    sagemaker_session=sess
)

step_train_model = TrainingStep(
    name="train-bodyscore",
    estimator=estimator,
    inputs={
        "train": TrainingInput(
            s3_data=dataset_path,
        )
    },
)

model_saved_path = step_train_model.properties.ModelArtifacts.S3ModelArtifacts

inference_processor = PyTorchProcessor(
    framework_version="1.8.0",
    role=role,
    # instance_type=instance_type_inference,
    instance_type='local_gpu',
    instance_count=1, #instance_count,
    sagemaker_session=sess
)

inference_args = inference_processor.get_run_args(
    inputs=[
                ProcessingInput(
                    input_name ='models',
                    source=model_saved_path,
                    destination="/opt/ml/processing/models"
                ),
                ProcessingInput(
                    input_name="input-images",
                    source=dataset_path,
                    destination="/opt/ml/processing/input"
                ),
            ],
    outputs=[
                ProcessingOutput(
                    output_name="inference-output",
                    source="/opt/ml/processing/output"
                )
            ],
    code='inference.py',
    source_dir='code',
),

inference_args = inference_args[0]

step_inference = ProcessingStep(
    name="inference-bodyscore",
    processor=inference_processor,
    # step_args=processor_args,
    inputs=inference_args.inputs,
    outputs=inference_args.outputs,
    job_arguments=inference_args.arguments,
    code=inference_args.code,
    # property_files=[evaluation_report]
)

pipeline = Pipeline(
    name="tlnk-bodyscore-pipeline",
    parameters=[
        instance_count,
        instance_type_inference,
        instance_type_processing,
        checkpoint_path,
        output_path,
        dataset_path,
        num_epochs,
        batch_size,
        learning_rate,
        ],
    steps=[
            # step_data_analysis,
            step_train_model,
            step_inference,
            # step_visualize,
            # step_evaluate,
           ],
    sagemaker_session=sess
)

definition = json.loads(pipeline.definition())
print(json.dumps(definition, indent=2))
pipeline.upsert(role_arn=role)

execution = pipeline.start()
# execution.wait()

Logs

7dc4or5zgg-algo-1-ink52 | INFO:__main__:calculated bodyscore: 3
7dc4or5zgg-algo-1-ink52 | permission for file is:  0o777
7dc4or5zgg-algo-1-ink52 exited with code 0
Aborting on container exit...
Pipeline step 'inference-bodyscore' FAILED. Failure message is: PermissionError: [Errno 13] Permission denied: 'filename.json'
Pipeline execution xyzabc FAILED because step 'inference-bodyscore' failed.
@harshraj22
Copy link
Author

Not sure what kind of permission is being denied here. Is it the one to copy the files locally to some place, or something related to aws (though I feel I have given all permissions)

@harshraj22
Copy link
Author

Was using older version of sagemaker. Issue has been fixed in latest version: aws/sagemaker-python-sdk#2647

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant