Skip to content

Ext-Proc Cluster Health Checking #240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
danehans opened this issue Jan 28, 2025 · 4 comments
Open

Ext-Proc Cluster Health Checking #240

danehans opened this issue Jan 28, 2025 · 4 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@danehans
Copy link
Contributor

Currently, the Envoy static config used for e2e testing does not implement health checking. Envoy and EPP are configured for Kubelet health checking:

Envoy logs:

[2025-01-28 22:03:20.933][24][debug][http] [source/common/http/conn_manager_impl.cc:1183] [Tags: "ConnectionId":"168","StreamId":"3271118745624594159"] request headers complete (end_stream=true):
':authority', '10.244.0.174:19001'
':path', '/ready'
':method', 'GET'
'user-agent', 'kube-probe/1.27'
'accept', '*/*'
'connection', 'close'

EPP logs:

I0128 21:58:29.531032       1 health.go:22] gRPC health check serving: service:"inference-extension"

However, no active health checking is configured for the ext_proc cluster. Should this be added?

xref

@danehans
Copy link
Contributor Author

cc: @ahg-g

@ahg-g
Copy link
Contributor

ahg-g commented Jan 28, 2025

How would that look like?

@danehans
Copy link
Contributor Author

gRPC health_checks for the ext-proc cluster. Something like this:

clusters:
  - name: ext_proc
    ...
    health_checks:
      - timeout: 2s                # How long Envoy waits for a health-check response
        interval: 5s               # How often Envoy sends health checks
        unhealthy_threshold: 2     # Number of consecutive failures to mark EPP unhealthy
        healthy_threshold: 2       # Number of consecutive successes to mark EPP healthy
        grpc_health_check:
          # See https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/manifests/ext_proc.yaml#L91-L93
          service_name: "inference-extension" 
          authority: "$INFER_EXT_SVC_NAME.$INFER_EXT_SVC_NS:$INFER_EXT_HEALTH_PORT"
...

If ^ works as expected using the static Envoy config, consider including in the ext-proc protocol spec.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 3, 2025

Sounds good to me.

@kfswain kfswain added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

3 participants