-
Notifications
You must be signed in to change notification settings - Fork 132
Introduce keep alive and fix AWS Load Balancer 502 errors #491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce keep alive and fix AWS Load Balancer 502 errors #491
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
Thanks! I think enabling keep-alive makes sense, and exposing the timeout as an option is sensible as well. What I'm trying to understand is the addition of the keepaliveagent package instead of using the standard-library server.keepAliveTimeout. Can you speak to that as to why it's needed beyond setting |
#481 enables keep-alive, which I think makes sense, and combining it with this to expose the keep-alive timeout as an option seems like the right way to go, unless the keepaliveagent solves a problem I'm not quite seeing. |
I have done some experiments and concluded that keep alive should be supported for the both directions (client side - Load Balancer, and server side - Jupyter Hub or Jupyter Server) But my experiment was not well organized to be shared. I haved used I will do the experiment again, and share it here. |
I have done simple experiment again. I opened a shell inside the proxy pod which is deployed by Z2JH. Then I executed CASE 1)
|
Can you test with #492? It seems to enable keep-alive all the way through from proxied requests from tornado. |
Actually, there seems to be something weird where we can't use a single agent for keep-alive on both http or https with the standard library (bizarre), so I think maybe this PR is the way to go. |
For the
Ok. I will post Also, I set up TLS termination on LB, so all my tests are done using HTTP. |
Fix#492 has issue.
I added Curl Testps result
Keep alive works. Netcat Testnc test is done manully, very naive. CASE 1 : Timeout=15000, Request, Wait 10 seconds, Request again ps result
It should keep alive after 10 seconds, it actually does. CASE 2 : Timeout=15000, Request, Wait 20 seconds, Request again ps result
It should close connection after 20 seconds, it actually does. ConclusionI think #492 works with a I thought #492 would not respect the given timeout. However, standard library |
I'm running Z2JH based service with about 1,000 DAU. It is deployed in AWS EKS attached to AWS ALB.

As DAU grows, users started to get 502 Responses from the LB.

This is well-known problem related to keep-alive setting. (AWS Article)
Unfortunately, configurable-http-proxy does not support keep-alive. So I implemented, and tested in production environment.
After the deployment the number of 502 errors descreased.

Technical/Implementation detail
It is very important to allow keep-alive both client side and server side. That's why
Agent
andkeepAliveTimeout
are both needed.The jupyter hub and jupyter server support keep-alive by default, because they are Tornado servers.
chp is given these parameters. They are AWS specific values.