Skip to content

fix: localmode subprocess parent process not sending SIGTERM to child #2572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Aug 30, 2021

Conversation

ranzvi
Copy link
Contributor

@ranzvi ranzvi commented Aug 12, 2021

Issue #, if available:

Description of changes:

Due to an unknown issue, on some machines executing predictor.delete_endpoint() simply hangs.
This is because the SIGTERM sent to the subprocess doesn't propagate to the child process.

image

E.g pid 7521 doesn't propagate SIGTERM to child 7522.

This means that self.container.join() is waiting for a container which hasn't received a signal to stop execution and hangs indefinitely.

By sending SIGTERM to all child processes of the subprocess.Popen process, we can guarantee that the container will be Gracefully stopped...

Testing done:

  • Executed all unit and integration tests
  • Verified behaviour remains the same on machines without the bug
  • Verified bug fixed on faulty machines by executing multiple endpoints sequentially

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 145aef2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-notebook-tests
  • Commit ID: 145aef2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 145aef2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 145aef2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-pr
  • Commit ID: 145aef2
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ranzvi ranzvi changed the title Fix: localmode subprocess parent process not sending SIGTERM to child fix: localmode subprocess parent process not sending SIGTERM to child Aug 15, 2021
@ranzvi ranzvi force-pushed the fix-localmode-subprocess-termination branch from 145aef2 to 9207f5b Compare August 15, 2021 07:03
@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 9207f5b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: d7f3ed6
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: a2699a5
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 355db78
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 719eae1
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ahsan-z-khan ahsan-z-khan merged commit 09f9908 into aws:master Aug 30, 2021
@parsons-kyle-89
Copy link

I believe this broke version 2.58.0. In a fresh environment with 2.58.0 installed, the command import sagemaker fails with the error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/__init__.py", line 18, in <module>
    from sagemaker import estimator, parameter, tuner  # noqa: F401
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/estimator.py", line 28, in <module>
    from sagemaker import git_utils, image_uris
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/image_uris.py", line 22, in <module>
    from sagemaker.spark import defaults
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/spark/__init__.py", line 16, in <module>
    from sagemaker.spark.processing import PySparkProcessor, SparkJarProcessor  # noqa: F401
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/spark/processing.py", line 35, in <module>
    from sagemaker.local.image import _ecr_login_if_needed, _pull_image
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/local/__init__.py", line 16, in <module>
    from .local_session import (  # noqa: F401
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/local/local_session.py", line 23, in <module>
    from sagemaker.local.image import _SageMakerContainer
  File "/home/kyle/code/sagemaker/.venv/lib/python3.7/site-packages/sagemaker/local/image.py", line 37, in <module>
    import psutil
ModuleNotFoundError: No module named 'psutil'

It appears that psutil should be a required package or perhaps the psutil import should be guarded in some way? This probably wasn't caught by CI due to psutil being a test dependency.

ahsan-z-khan added a commit to ahsan-z-khan/sagemaker-python-sdk that referenced this pull request Aug 31, 2021
@ahsan-z-khan
Copy link
Member

@ranzvi Can you create a separate PR to solve the issue you had without adding any required packages?

@ranzvi
Copy link
Contributor Author

ranzvi commented Sep 1, 2021

@ahsan-z-khan

Sure, here is the fixed PR
#2613

shreyapandit pushed a commit that referenced this pull request Sep 1, 2021
* update required packages

* revert #2572

* remove psutil
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants