🐛 Fix webhook issue when there are multiple replication of controller #217

vincent-pli · 2021-09-15T09:46:31Z

What this PR does / why we need it:
#215

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

k8s-ci-robot · 2021-09-15T09:46:39Z

Hi @vincent-pli. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-09-15T09:46:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vincent-pli
To complete the pull request process, please assign christopherhein after the PR has been reviewed.
You can assign the PR to them by writing /assign @christopherhein in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

virtualcluster/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gyliu513 · 2021-09-15T09:47:40Z

/ok-to-test

christopherhein

Rather than implementing this way can we maybe make this about removing the self-signing of certificates and setup the kustomize configs to auto annotate the controller to use a cert-manager based certificate like we get with normal Kubebuilder/controller runtime deployments?

vincent-pli · 2021-09-17T08:24:39Z

@christopherhein I'm OK for both solution.
but cert-manger itself is a big application and consist of many CRDs to handle its operation, is that a good idea to expect user install cert-manager just to handle admission webhook TLS certificate and CA?

christopherhein · 2021-09-17T15:51:26Z

Yes, but the idea behind moving away from the vc manager auto publishing the validatingwebhookconfiguration it's self and delegating that to the configuration will make this controller work "properly" eg as you found you can't have multiple replicas cause it independently deploys separate self-signed certs, we need a layer above at provisioning that writes the cert for us so we can just use it. By moving to cert-manager as a default means that someone could inject their own certificate management in the same phase and not need to care about using cert-manager. So this opens up more options for the user.

kfox1111 · 2021-09-17T16:00:47Z

+1 to cert-manager. Better to rely on a project designed to handle certs for handling certs. And, cluster api already depends on it.

christopherhein · 2021-09-17T17:07:48Z

We don't need to force users to use cert-manager but we should expect that if they choose not to use it that they manage certificates themselves.

cc @Fei-Guo @charleszheng44

christopherhein · 2021-09-17T17:09:00Z

I would take inspiration from a default scaffolded kubebuilder project or even how the capn base project is setup with the configs/ directory it will use kustomize to auto-inject the right information but allows a user to also "opt out" of using cert-manager.

Fei-Guo · 2021-09-17T18:12:55Z

I agree with building cert-manager integration for cert management in CAPN. Coming back to the fix, I think it is ok to have self-cert as long as it works without breaking fundamental architecture. We just don't recommend people to handle cert this way and provide cert-manager option next.

christopherhein · 2021-09-17T21:57:48Z

I think the no-op that this is causing users is that if they use the ENABLE_WEBHOOK flag it will auto deploy the ValidatingWebhookConfiguration from each replica of the vc-manager that they run, meaning the certificate that is supplied to the webhook might only work for one of the running replicas in the cluster and your webhook request might hit the wrong instance causing it to return https://virtualcluster-webhook-service.kube-system.svc:9443/validate-tenancy-x-k8s-io-v1alpha1-virtualcluster?timeout=30s": x509: certificate signed by unknown authority.

So right now the architecture doesn't allow vc-manager to run as more than one replica for higher availability.

vincent-pli · 2021-09-18T00:19:14Z

@christopherhein thanks for your comments.
Actually with this fix, the issue you presented will gone.
In HA case, only the replication who win the leader election will create admission webhook and sign the cert.
When fail-over occurred, the new winner will delete the older webhook and re-sign the cert, so the original issue will gone, we can have HA.

For cert-manager, I agree we can make integration at CAPN level, should be next step.

christopherhein · 2021-09-18T01:10:17Z

Hmm, are you sure? I am 90% sure when you use the LeaderElectedRunnable interface like you are in this PR you are only exposing the webhook handlers on the leader meaning, running multiple replicas doesn't help you scale the webhook horizontally like in a traditional webhook deployment. So this doesn't actually make it HA you now just can support hot standby controllers/webhook handler which isn't the same.

christopherhein · 2021-09-18T01:11:47Z

Maybe you could provide some samples showing a curl to the webhook endpoint on a non-leader running in a cluster then show the same against the leader and that would help me.

crazywill · 2021-09-18T03:59:11Z

I test in my env, it doesn't work because virtualcluster-webhook-service use selector: virtualcluster-webhook=true.
How about setting Endpoints instead of using selector. I will try and test it.

christopherhein · 2021-09-18T07:12:06Z

You should be able to port-forward and then curl the pod on the webhook port and see if it's listening.

crazywill · 2021-09-18T08:28:52Z

It works when i delete the label selector in virtualcluster-webhook-service and make winner create Endpoints itself (delete first if exists).

vc-manager has three replication, and endpoint points to the winner pod as expected.

vincent-pli · 2021-09-19T03:43:57Z

Thanks @christopherhein, I 100% use the LeaderElectedRunnable : )

The fix just resolve the conflict of cert generated but I forget the SVC, it act a role of load balance so will redirect validation request to a replication who has not set the http server(loser of leader election).

There is several solution to resolve the problem(as @crazywill mentioned) but I think the easiest way is to adopt cert-manager.

christopherhein · 2021-09-20T19:16:33Z

Agreed, how do you want to handle this, did you want to use this PR to move over to cert-manager based?

vincent-pli · 2021-09-22T00:11:51Z

No, I think we should have another one, this one will be close

christopherhein · 2021-09-22T00:50:16Z

Sounds good, thank you!

Make sure to take a look at the CAPN config directory for inspiration and direction so that it will auto-deploy everything for you.

christopherhein · 2021-09-22T00:51:13Z

Check out #194 for when that was implemented.

fix webhook issue when there are multiple replication of controller

2315dcd

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 15, 2021

k8s-ci-robot requested review from charleszheng44 and Fei-Guo September 15, 2021 09:46

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 15, 2021

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2021

christopherhein suggested changes Sep 16, 2021

View reviewed changes

christopherhein changed the title ~~Fix webhook issue when there are multiple replication of controller~~ 🐛 Fix webhook issue when there are multiple replication of controller Sep 16, 2021

vincent-pli closed this Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix webhook issue when there are multiple replication of controller #217

🐛 Fix webhook issue when there are multiple replication of controller #217

vincent-pli commented Sep 15, 2021

k8s-ci-robot commented Sep 15, 2021

k8s-ci-robot commented Sep 15, 2021

gyliu513 commented Sep 15, 2021

christopherhein left a comment

vincent-pli commented Sep 17, 2021

christopherhein commented Sep 17, 2021

kfox1111 commented Sep 17, 2021

christopherhein commented Sep 17, 2021

christopherhein commented Sep 17, 2021

Fei-Guo commented Sep 17, 2021

christopherhein commented Sep 17, 2021

vincent-pli commented Sep 18, 2021

christopherhein commented Sep 18, 2021

christopherhein commented Sep 18, 2021

crazywill commented Sep 18, 2021

christopherhein commented Sep 18, 2021

crazywill commented Sep 18, 2021

vincent-pli commented Sep 19, 2021

christopherhein commented Sep 20, 2021

vincent-pli commented Sep 22, 2021

christopherhein commented Sep 22, 2021

christopherhein commented Sep 22, 2021

🐛 Fix webhook issue when there are multiple replication of controller #217

🐛 Fix webhook issue when there are multiple replication of controller #217

Conversation

vincent-pli commented Sep 15, 2021

k8s-ci-robot commented Sep 15, 2021

k8s-ci-robot commented Sep 15, 2021

gyliu513 commented Sep 15, 2021

christopherhein left a comment

Choose a reason for hiding this comment

vincent-pli commented Sep 17, 2021

christopherhein commented Sep 17, 2021

kfox1111 commented Sep 17, 2021

christopherhein commented Sep 17, 2021

christopherhein commented Sep 17, 2021

Fei-Guo commented Sep 17, 2021

christopherhein commented Sep 17, 2021

vincent-pli commented Sep 18, 2021

christopherhein commented Sep 18, 2021

christopherhein commented Sep 18, 2021

crazywill commented Sep 18, 2021

christopherhein commented Sep 18, 2021

crazywill commented Sep 18, 2021

vincent-pli commented Sep 19, 2021

christopherhein commented Sep 20, 2021

vincent-pli commented Sep 22, 2021

christopherhein commented Sep 22, 2021

christopherhein commented Sep 22, 2021