Skip to content
This repository was archived by the owner on Oct 28, 2024. It is now read-only.

🐛 Fix webhook issue when there are multiple replication of controller #217

Conversation

vincent-pli
Copy link
Contributor

What this PR does / why we need it:
#215

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot
Copy link
Contributor

Hi @vincent-pli. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 15, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vincent-pli
To complete the pull request process, please assign christopherhein after the PR has been reviewed.
You can assign the PR to them by writing /assign @christopherhein in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 15, 2021
@gyliu513
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 15, 2021
Copy link
Contributor

@christopherhein christopherhein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than implementing this way can we maybe make this about removing the self-signing of certificates and setup the kustomize configs to auto annotate the controller to use a cert-manager based certificate like we get with normal Kubebuilder/controller runtime deployments?

@christopherhein christopherhein changed the title Fix webhook issue when there are multiple replication of controller 🐛 Fix webhook issue when there are multiple replication of controller Sep 16, 2021
@vincent-pli
Copy link
Contributor Author

@christopherhein I'm OK for both solution.
but cert-manger itself is a big application and consist of many CRDs to handle its operation, is that a good idea to expect user install cert-manager just to handle admission webhook TLS certificate and CA?

@christopherhein
Copy link
Contributor

Yes, but the idea behind moving away from the vc manager auto publishing the validatingwebhookconfiguration it's self and delegating that to the configuration will make this controller work "properly" eg as you found you can't have multiple replicas cause it independently deploys separate self-signed certs, we need a layer above at provisioning that writes the cert for us so we can just use it. By moving to cert-manager as a default means that someone could inject their own certificate management in the same phase and not need to care about using cert-manager. So this opens up more options for the user.

@kfox1111
Copy link

+1 to cert-manager. Better to rely on a project designed to handle certs for handling certs. And, cluster api already depends on it.

@christopherhein
Copy link
Contributor

We don't need to force users to use cert-manager but we should expect that if they choose not to use it that they manage certificates themselves.

cc @Fei-Guo @charleszheng44

@christopherhein
Copy link
Contributor

I would take inspiration from a default scaffolded kubebuilder project or even how the capn base project is setup with the configs/ directory it will use kustomize to auto-inject the right information but allows a user to also "opt out" of using cert-manager.

@Fei-Guo
Copy link

Fei-Guo commented Sep 17, 2021

I agree with building cert-manager integration for cert management in CAPN. Coming back to the fix, I think it is ok to have self-cert as long as it works without breaking fundamental architecture. We just don't recommend people to handle cert this way and provide cert-manager option next.

@christopherhein
Copy link
Contributor

I think the no-op that this is causing users is that if they use the ENABLE_WEBHOOK flag it will auto deploy the ValidatingWebhookConfiguration from each replica of the vc-manager that they run, meaning the certificate that is supplied to the webhook might only work for one of the running replicas in the cluster and your webhook request might hit the wrong instance causing it to return https://virtualcluster-webhook-service.kube-system.svc:9443/validate-tenancy-x-k8s-io-v1alpha1-virtualcluster?timeout=30s": x509: certificate signed by unknown authority.

So right now the architecture doesn't allow vc-manager to run as more than one replica for higher availability.

@vincent-pli
Copy link
Contributor Author

@christopherhein thanks for your comments.
Actually with this fix, the issue you presented will gone.
In HA case, only the replication who win the leader election will create admission webhook and sign the cert.
When fail-over occurred, the new winner will delete the older webhook and re-sign the cert, so the original issue will gone, we can have HA.

For cert-manager, I agree we can make integration at CAPN level, should be next step.

@christopherhein
Copy link
Contributor

Hmm, are you sure? I am 90% sure when you use the LeaderElectedRunnable interface like you are in this PR you are only exposing the webhook handlers on the leader meaning, running multiple replicas doesn't help you scale the webhook horizontally like in a traditional webhook deployment. So this doesn't actually make it HA you now just can support hot standby controllers/webhook handler which isn't the same.

@christopherhein
Copy link
Contributor

Maybe you could provide some samples showing a curl to the webhook endpoint on a non-leader running in a cluster then show the same against the leader and that would help me.

@crazywill
Copy link
Contributor

I test in my env, it doesn't work because virtualcluster-webhook-service use selector: virtualcluster-webhook=true.
How about setting Endpoints instead of using selector. I will try and test it.

@christopherhein
Copy link
Contributor

You should be able to port-forward and then curl the pod on the webhook port and see if it's listening.

@crazywill
Copy link
Contributor

It works when i delete the label selector in virtualcluster-webhook-service and make winner create Endpoints itself (delete first if exists).

vc-manager has three replication, and endpoint points to the winner pod as expected.
image

image

@vincent-pli
Copy link
Contributor Author

Thanks @christopherhein, I 100% use the LeaderElectedRunnable : )

The fix just resolve the conflict of cert generated but I forget the SVC, it act a role of load balance so will redirect validation request to a replication who has not set the http server(loser of leader election).

There is several solution to resolve the problem(as @crazywill mentioned) but I think the easiest way is to adopt cert-manager.

@christopherhein
Copy link
Contributor

Agreed, how do you want to handle this, did you want to use this PR to move over to cert-manager based?

@vincent-pli
Copy link
Contributor Author

No, I think we should have another one, this one will be close

@christopherhein
Copy link
Contributor

Sounds good, thank you!

Make sure to take a look at the CAPN config directory for inspiration and direction so that it will auto-deploy everything for you.

@christopherhein
Copy link
Contributor

Check out #194 for when that was implemented.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants