Skip to content

Terminate on SIGTERM #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 12, 2019
Merged

Conversation

consideRatio
Copy link
Member

Closes #215 by listening for a SIGTERM signal.

This has been tested to make a difference by issuing a docker stop command as described in #215. It now shuts down very quickly while before it took the full timeout duration for it to shutdown, which I interpret as it required a SIGKILL signal to be sent for it to react at all.

I wrote out an exit code of 2 like for SIGINT, I'm not sure what makes sense here or if it will matter anywhere.

Having this merged and released could impact Z2JH / BinderHub performance during upgrades and when draining nodes etc, making a BinderHub not go inresponsive for 30 seconds I believe.

Closes jupyterhub#215 by listening for a SIGTERM signal.
@betatim
Copy link
Member

betatim commented Nov 11, 2019

In general faster shutdowns/reaction to instructions is a good thing.

Why would the CHP be "unresponsive" in these 30s? I assumed it would just keep running, the Service would keep sending traffic and then suddenly when the SIGKILL comes it stops instantaneously.

I don't see a reason for us not to already quit on SIGTERM. It would be great if we could switch into a "connection draining" mode or otherwise stop accepting new connections but still serve existing ones. However I think this gets us into full blown HA proxy land.


Education: I didn't have the k8s "pod deletion" sequence of signals ready in my head, so I googled and found https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html. This mentions SIGTERM as the first signal that gets sent and then 30s (or "configurable grace period") later it sends SIGKILL. Where does SIGINT fit in? This suggests that SIGINT is what a user sends when they press Ctrl-C.

@consideRatio
Copy link
Member Author

Regarding unresponsive binderhub: I recall there was a time when the pods on a node stopped getting traffic from the kubernetes service because they were located on a cordoned node or similar. Wasnt it so?

Or is it only terminating pods? This pod will be terminating for 30 seconds due to the toleration grace periods defaulting to that, so if the proxy needs to relocate it will at least take those 30 during its terminating sequence which starts by SIGTERM to its containers i think.


I dont know if we can shut down more gracefully than this, im not at all read up on chp.

@betatim
Copy link
Member

betatim commented Nov 12, 2019

I dont know if we can shut down more gracefully than this, im not at all read up on chp.

I don't think so, but that is a guess. Should be a new PR/issue though and not for this PR.

We do have the problem that a Service will only consider backends (the pods that actually serve the traffic) which are on nodes that aren't cordoned. In addition Pods don't unschedule themselves from cordoned nodes because the NoExecute:30s toleration doesn't work on GKE (yet).

Merging now because I think this is good as it is.

@betatim betatim merged commit 5745f40 into jupyterhub:master Nov 12, 2019
@consideRatio consideRatio mentioned this pull request Nov 12, 2019
@minrk
Copy link
Member

minrk commented Nov 12, 2019

I don't understand why it's not terminating on SIGTERM already. This is the default behavior for a process. Has anyone confirmed that this change actually accomplishes shutdown on SIGTERM where it wasn't occurring before? I can locally verify that earlier releases of CHP exit on SIGTERM already.

@consideRatio
Copy link
Member Author

@minrk I didn't know 100% i tested for SIGTERM specifically, but i tested docker stop on a container of the image built from the Dockerfile in this repo. The test can be reproduced quickly in a one-liner I wrote down in #215 (comment).

To reproduce the different behavior with docker stop, you could run that one-liner on various checked out git commits.


I read this about docker stop (source):

image

@consideRatio
Copy link
Member Author

@minrk I believe SIGTERM is passed to a process as some kind of input, and as SIGTERM is a friendly "please terminate" it can be ignored while SIGKILL is managed by the kernel that will simply kill the process no matter what it sais. This is my current understanding. I assume the node's process does not itself handle the SIGTERM signal and require us to consider it.

I'm unexperienced in this domain, both Linux termination signals and Node stuff though.

@minrk
Copy link
Member

minrk commented Nov 13, 2019

You're absolutely right! I had assumed that node.js behaves like ~every other program, where SIGTERM results in shutdown by default (You will see this with any Python or C program) and a SIGTERM handler is only used to register additional actions prior to a clean shutdown. But indeed SIGTERM is ignored by default in node.js, which is very unusual.

Lessons I've learned here:

  • Never assume a new environment (node.js) is like the one you're familiar with (Python or C)
  • Test your assumptions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SIGTERM ignored - we should handle it
3 participants