Terminate on SIGTERM #217

consideRatio · 2019-11-11T17:48:47Z

Closes #215 by listening for a SIGTERM signal.

This has been tested to make a difference by issuing a docker stop command as described in #215. It now shuts down very quickly while before it took the full timeout duration for it to shutdown, which I interpret as it required a SIGKILL signal to be sent for it to react at all.

I wrote out an exit code of 2 like for SIGINT, I'm not sure what makes sense here or if it will matter anywhere.

Having this merged and released could impact Z2JH / BinderHub performance during upgrades and when draining nodes etc, making a BinderHub not go inresponsive for 30 seconds I believe.

Closes jupyterhub#215 by listening for a SIGTERM signal.

betatim · 2019-11-11T19:40:20Z

In general faster shutdowns/reaction to instructions is a good thing.

Why would the CHP be "unresponsive" in these 30s? I assumed it would just keep running, the Service would keep sending traffic and then suddenly when the SIGKILL comes it stops instantaneously.

I don't see a reason for us not to already quit on SIGTERM. It would be great if we could switch into a "connection draining" mode or otherwise stop accepting new connections but still serve existing ones. However I think this gets us into full blown HA proxy land.

Education: I didn't have the k8s "pod deletion" sequence of signals ready in my head, so I googled and found https://pracucci.com/graceful-shutdown-of-kubernetes-pods.html. This mentions SIGTERM as the first signal that gets sent and then 30s (or "configurable grace period") later it sends SIGKILL. Where does SIGINT fit in? This suggests that SIGINT is what a user sends when they press Ctrl-C.

consideRatio · 2019-11-11T21:15:44Z

Regarding unresponsive binderhub: I recall there was a time when the pods on a node stopped getting traffic from the kubernetes service because they were located on a cordoned node or similar. Wasnt it so?

Or is it only terminating pods? This pod will be terminating for 30 seconds due to the toleration grace periods defaulting to that, so if the proxy needs to relocate it will at least take those 30 during its terminating sequence which starts by SIGTERM to its containers i think.

I dont know if we can shut down more gracefully than this, im not at all read up on chp.

betatim · 2019-11-12T06:47:46Z

I dont know if we can shut down more gracefully than this, im not at all read up on chp.

I don't think so, but that is a guess. Should be a new PR/issue though and not for this PR.

We do have the problem that a Service will only consider backends (the pods that actually serve the traffic) which are on nodes that aren't cordoned. In addition Pods don't unschedule themselves from cordoned nodes because the NoExecute:30s toleration doesn't work on GKE (yet).

Merging now because I think this is good as it is.

minrk · 2019-11-12T15:29:13Z

I don't understand why it's not terminating on SIGTERM already. This is the default behavior for a process. Has anyone confirmed that this change actually accomplishes shutdown on SIGTERM where it wasn't occurring before? I can locally verify that earlier releases of CHP exit on SIGTERM already.

consideRatio · 2019-11-12T16:18:39Z

@minrk I didn't know 100% i tested for SIGTERM specifically, but i tested docker stop on a container of the image built from the Dockerfile in this repo. The test can be reproduced quickly in a one-liner I wrote down in #215 (comment).

To reproduce the different behavior with docker stop, you could run that one-liner on various checked out git commits.

I read this about docker stop (source):

consideRatio · 2019-11-12T19:24:18Z

@minrk I believe SIGTERM is passed to a process as some kind of input, and as SIGTERM is a friendly "please terminate" it can be ignored while SIGKILL is managed by the kernel that will simply kill the process no matter what it sais. This is my current understanding. I assume the node's process does not itself handle the SIGTERM signal and require us to consider it.

I'm unexperienced in this domain, both Linux termination signals and Node stuff though.

minrk · 2019-11-13T16:01:56Z

You're absolutely right! I had assumed that node.js behaves like ~every other program, where SIGTERM results in shutdown by default (You will see this with any Python or C program) and a SIGTERM handler is only used to register additional actions prior to a clean shutdown. But indeed SIGTERM is ignored by default in node.js, which is very unusual.

Lessons I've learned here:

Never assume a new environment (node.js) is like the one you're familiar with (Python or C)
Test your assumptions!

Terminate on SIGTERM

cc85a2f

Closes jupyterhub#215 by listening for a SIGTERM signal.

consideRatio requested review from betatim and minrk November 11, 2019 17:49

betatim merged commit 5745f40 into jupyterhub:master Nov 12, 2019

consideRatio mentioned this pull request Nov 12, 2019

Making a release #219

Closed

consideRatio mentioned this pull request Nov 30, 2019

Are we listening to SIGTERM? We should! jupyterhub/zero-to-jupyterhub-k8s#1476

Closed

consideRatio mentioned this pull request Oct 5, 2020

Terminating user pods slower than needed? jupyterhub/zero-to-jupyterhub-k8s#883

Closed

consideRatio mentioned this pull request Jun 2, 2021

template-sync sidecar container ignores SIGTERM 2i2c-org/infrastructure#443

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate on SIGTERM #217

Terminate on SIGTERM #217

consideRatio commented Nov 11, 2019

betatim commented Nov 11, 2019

consideRatio commented Nov 11, 2019

betatim commented Nov 12, 2019

minrk commented Nov 12, 2019

consideRatio commented Nov 12, 2019

consideRatio commented Nov 12, 2019

minrk commented Nov 13, 2019

Terminate on SIGTERM #217

Terminate on SIGTERM #217

Conversation

consideRatio commented Nov 11, 2019

betatim commented Nov 11, 2019

consideRatio commented Nov 11, 2019

betatim commented Nov 12, 2019

minrk commented Nov 12, 2019

consideRatio commented Nov 12, 2019

consideRatio commented Nov 12, 2019

minrk commented Nov 13, 2019