Skip to content

fix: localmode subprocess parent process not sending SIGTERM to child #2613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 1, 2021

Conversation

ranzvi
Copy link
Contributor

@ranzvi ranzvi commented Sep 1, 2021

Issue #, if available:

Description of changes:

Due to an unknown issue, on some machines executing predictor.delete_endpoint() simply hangs.
This is because the SIGTERM sent to the subprocess doesn't propagate to the child process.

image

E.g pid 7521 doesn't propagate SIGTERM to child 7522.

This means that self.container.join() is waiting for a container which hasn't received a signal to stop execution and hangs indefinitely.

By sending SIGTERM to all child processes of the subprocess.Popen process, we can guarantee that the container will be Gracefully stopped...

Testing done:

Executed all unit and integration tests
Verified behaviour remains the same on machines without the bug
Verified bug fixed on faulty machines by executing multiple endpoints sequentially
Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

  • I have read the CONTRIBUTING doc
  • I used the commit message format described in CONTRIBUTING
  • I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
  • I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have checked that my tests are not configured for a specific region or account (if appropriate)
  • I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 193c23e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 268357f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 5e33b4f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 23ef42e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ranzvi ranzvi closed this Sep 1, 2021
@ranzvi ranzvi force-pushed the fix-localmode-subprocess-termination branch from 268357f to 7b07b90 Compare September 1, 2021 11:58
@ranzvi
Copy link
Contributor Author

ranzvi commented Sep 1, 2021

@ahsan-z-khan This PR doesn't contain and new dependencies

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: c3708bd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 7b07b90
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 37eeb5e
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@gilinachum
Copy link
Contributor

@shreyapandit Hi! can you have a look if the failing tests are relevant to the PR? And assign someone to review it?

@ranzvi
Copy link
Contributor Author

ranzvi commented Oct 28, 2021

@gilinachum any updates?

@gilinachum
Copy link
Contributor

@gilinachum any updates?

Sorry about that. Checking again :\

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: b6cb578
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

BasilBeirouti
BasilBeirouti previously approved these changes Oct 28, 2021
@@ -841,6 +840,8 @@ def run(self):

def down(self):
"""Placeholder docstring"""
if os.name != "nt":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this check for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's checking if the os is a windows machine, since I'm using unix commands to kill the processes

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: b6cb578
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: b6cb578
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@shreyapandit
Copy link
Contributor

shreyapandit commented Oct 30, 2021

Hi @ranzvi, thanks for your contribution!

Please follow the one time setup for enabling git hooks, and this should allow you to reproduce the errors, and fix the tests cc @ranzvi @gilinachum. The unit tests are failing on linting issues.
Here is the setup can you do. Please change brew install to whatever is compatible with your OS.

# Install using brew:
brew install pyenv pyenv-virtualenv pyenv-virtualenvwrapper

# Make sure to add the pyenv and virtualenv init functions to your corresponding shell init (.bashrc, .zshrc, etc):
eval "*$(*pyenv init --path*)*"      # old usage was: $ eval "$(pyenv init -)"
eval "*$(*pyenv virtualenv-init -*)*"

# Start a new shell once you do that to pick up your changes.
# Install following Python versions:
pyenv install 3.8.8
pyenv install 3.7.10
pyenv install 3.6.13

# Set them as global versions:
pyenv global 3.8.8
pyenv global 3.7.10
pyenv global 3.6.13

# Verify they show up when you do:
pyenv versions

# Restart your shell and run the command again to verify that it persists across shell sessions.

# Install tox to run our tests:
pip install tox

# Set all versions as global:
pyenv global $(pyenv versions --bare)
pyenv rehash

# Setup GitHub Pre Push hooks
# To enable all git hooks in the .githooks directory.
# Run these commands in the GitHub repository directory:

find .git/hooks -type l -exec rm {} \;
find .githooks -type f -exec ln -sf ../../{} .git/hooks/ \;```

Copy link
Contributor

@shreyapandit shreyapandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As outlined in comments

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: dd7f66c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: 4166ceb
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-unit-tests
  • Commit ID: a2ddd4c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: 4166ceb
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: a2ddd4c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-slow-tests
  • Commit ID: dd7f66c
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: dd7f66c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: a2ddd4c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@sagemaker-bot
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: sagemaker-python-sdk-local-mode-tests
  • Commit ID: 4166ceb
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ranzvi
Copy link
Contributor Author

ranzvi commented Nov 1, 2021

@shreyapandit Thanks for the insights. I fixed the linting issues and executed the unittest locally and they pass.

Do you have any idea why the rest of the ci is still failing?

@navinsoni
Copy link
Contributor

@ranzvi I am restarting your tests

Copy link
Contributor

@shreyapandit shreyapandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for your contribution

Copy link
Contributor

@jeniyat jeniyat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

@jeniyat jeniyat merged commit c4075bb into aws:master Nov 1, 2021
EthanShouhanCheng pushed a commit to SissiChenxy/sagemaker-python-sdk that referenced this pull request Jan 11, 2022
@wcarpenter1-godaddy
Copy link
Contributor

@ranzvi did this fix your issue? Interestingly, this new code introduces an error log for us during delete_endpoint:

Gracefully stopping... (press Ctrl+C again to force)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/Users/wcarpenter1/Library/Caches/pypoetry/virtualenvs/cerbo-model-bynQLcpi-py3.9/lib/python3.9/site-packages/sagemaker/local/image.py", line 852, in run
    _stream_output(self.process)
  File "/Users/wcarpenter1/Library/Caches/pypoetry/virtualenvs/cerbo-model-bynQLcpi-py3.9/lib/python3.9/site-packages/sagemaker/local/image.py", line 914, in _stream_output
    raise RuntimeError("Process exited with code: %s" % exit_code)
RuntimeError: Process exited with code: 2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/wcarpenter1/.pyenv/versions/3.9.11/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/threading.py", line 973, in _bootstrap_inner
    self.run()
  File "/Users/wcarpenter1/Library/Caches/pypoetry/virtualenvs/cerbo-model-bynQLcpi-py3.9/lib/python3.9/site-packages/sagemaker/local/image.py", line 857, in run
    raise RuntimeError(msg)
RuntimeError: Failed to run: ['docker-compose', '-f', '/private/var/folders/90/2dm4q2zj72ncdgmgvmgykjy00000gp/T/tmpn531lijb/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit'], Process exited with code: 2

Commenting out these lines fixes our issue.

For us, self.process.terminate(), successfully kills both the process and it's children. So kill_child_processes is not needed. But not only is it not needed, it is causing the process to exit return 2, which generates the RuntimeError above.

I'd be interesting in potentially rolling back these changes if that was OK with you.

@ranzvi
Copy link
Contributor Author

ranzvi commented Jul 20, 2022

@wcarpenter1-godaddy This did fix our issue, we had a problem with the self.process.terminate() which is why I added kill_child_processes. Could you investigate where this error is originating and why?

@wcarpenter1-godaddy
Copy link
Contributor

@ranzvi what os are you on?

@ranzvi
Copy link
Contributor Author

ranzvi commented Jul 22, 2022

@wcarpenter1-godaddy MacOS, but this works on ubuntu as well. the code has a safeguard for windows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants