Skip to content

Latest commit

 

History

History
527 lines (383 loc) · 21.1 KB

File metadata and controls

527 lines (383 loc) · 21.1 KB

KEP-2699: Add webhook hosting capability to CCM framework

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
    • e2e Tests for all Beta API Operations (endpoints)
    • (R) Ensure GA e2e tests for meet requirements for Conformance Tests
    • (R) Minimum Two Week Window for GA e2e tests to prove flake free
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • (R) Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP will detail enhancing the CCM framework to support cloud provider specific webhooks. The intent is to make it easy to either generate a binary or enhance the existing CCM binary to host such webhooks. We also intend to allow for easily linking in "standard" webhooks needed by other SIGs which need to be customized for particular cloud providers.

Motivation

The Cloud (Provider) Controller Manager (CCM) is the binary into which the Cloud Provider places all the controllers needed to make a Kubernetes cluster work correctly on their Cloud. There are also occasions when it makes sense for a Cloud Provider to want these customizations to be applied inline rather than asynchronously after a change has already been applied.

Our initial example of this is from SIG Storage. These would like the functionality from PVL admission controller (kubernetes/kubernetes#52617). This needs to be completed for cloud provider extraction to complete. Several Cloud Providers have indicated that this should be done inline, especially as the existing deprecated solution is an inline solution.

Goals

Our imediate goal is to allow in tree Cloud Providers are able to stop using the existing PVL admission controller and do so using the framework. However we want to build a framework which wil be usable by similar solutions to problems. This KEP is about the framework needed to support the PVL webhook and not the webhook itself.

Non-Goals

We are not intending to create a general admission webhook solution. This is just intended to host Cloud Provider specific webhooks as part of the Control Plane.

Proposal

We will start by adding extension hooks which can be registered in the cmd/cloud-controller-manager/main.go. This would be similar to the mechanism we already use to register new controllers. The existing sample shows this with a sample of registering the nodeipamcontroller which is not a normally installed controller in the cloud controller manager. In a similar way we will have a sample of integrating a PVL mutating webhook into the sample CCM. We will also have the system automatically detect if there are both controllers and webhooks registered in the binary. If both are registered it will automatically add command line flags allowing webhooks and controller to be disabled. The controller flag will default to the controllers being enabled. The webhook flag will default to the webhooks being disabled. We would also like to provide a builder pattern for registering both the controller and webhook extensions.

Another issue to consider is how the mutating/asmission webhook configuration is written into the cluster. This may be somewhat dependent on if the Cloud Provider intends to run it on the Control Plane or on the Cluster. We would recommend on running in on the Control Plane. However for some Coud Providers that can lead to special issues with the configuration. As such we will provide a flag which enables the serviec to automatically register the webhooks as part of startup. However that functionality can be disabled, allowing the Cloud Provider to do their own customer registration, as part of cluster setup.

User Stories (Optional)

The users of this KEP are Cloud Providers abd feature developers whose code impacts Cloud Providers. The intent is to make it easy for them both develop features and to maintain the CCM controllers and webhooks across multiple versions. At the same time we are attempting to make it easy for the SIGs to make controllers or webhooks which can do what they know needs to be done and integrated into Cloud Provider specific processes. We would like to do that in a way which makes merging upgrades relatively painless.

Story 1

Some Cloud Providers would prefer to keep controllers and webhooks in different processses. They have concerns about attempting to run batch controllers in the same process as webhooks which are "inline" and time sensitive. For these users it is easy to either build two different binaries or have the same binary act as two different binaries based on command line flags.

Story 2

For Cloud Providers who would like to keep things simple, it is easy to create a single process which handles both controllers and webhooks.

Story 3

PVL use case. Cloud Providers want to allow customers to migrate an existing workload to Kubernetes. That workload uses an existing persistent volume. To get that workload migrated the end user needs to be able to link the existing PV into the cluster. However this requires an association which requires calls out to the cloud provider for certain kinds of storage. Ideally the lookup and label of the PV to that pre-existing storage happens inline when the PV is written. That ensures the write volume is attached to the Node/Pod when it is scheduled and there are no race conditions.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Potentially there could be problems running webhooks and controller in the same process. Delays of 10 seconds or more can cause webhooks to fail. It is important to understand than irrespective of failure mode on the webhook coniguration, timeouts will always turn a webhook call into a FAIL. As such we are making it easy to easily turn the CCM into two processses to mitigate this. It will be upto the Cloud Provider to determine if they want the webhook policy to be FAIL or IGNORE. We will default to IGNORE as its the safe option. Incorrectly setting FAIL can quickly lead to a non functional cluster. Having a FAIL policy on Pods for example can prevent the system from allocating the webhook service, which prevents the webhook from ever passing.

Webhooks are configured by a runtime resource. As a consequence this configuration can be modified to deleted at runtime. That means that an admin on the cluster can disable or alter the functionality. This potentially makes it harder for a cloud provider to enforce that this logic is being applied. It also means that there needs to be a deployment mechanism for the webhook. It is left to the Cloud Provider to determine if the need for an inline request is sufficient to override these concerns. The Cloud Provider can alternately use the controller route which is not inline or use an actual admission controller, built into the APIServer.

We are actually changing the framework which generates the CCM and not the CCM itself. It has been pointed out that it is not the role of the controller manager to run webhooks. Controller managers should run controllers and webhooks are not controllers. As we are modifying the framework, we should consider this as we can create two processes. The CCM which just has controllers in it. We can also create a Cloud Webhook Manager. That is being left as homework for the Cloud Provider. However the sample CCM which demonstrates how this will be done will have both in the same sample to make it easy.

Design Details

A sample of how the Builder pattern might look is:

cmOptions, err := options.NewCloudManagerOptions()
if err != nil {
klog.Fatalf("unable to initialize command options: %v", err)
}
fss := cliflag.NamedFlagSets{}
cloudManagerBuilder := app.NewCloudManagerBuilder("name")
cloudManagerBuilder.setOptions(cmOptions)
cloudManagerBuilder.setFlags(fss)
cloudManagerBuilder.registerWebhook(gvkList, handler)
cloudManagerBuilder.registerWebhook(gvkSecondList, secondHandler)
manager, err := cloudManagerBuilder(wait.NeverStop)
if err != nil {
klog.Fatalf("unable to construct cloud manager: %v", err)
}
err := command.start()

This will not alter the existing extension hooks in the controller manager framework, as they are critical for backward compatibility. The builders are meant to be an abstraction layer on top to make the extensions easier to use. So for the existing controller manager code you might see changes like:

cloudControllerManagerBuilder.registerController("nodeipamcontroller", handler)
cloudControllerManagerBuilder.deregisterController("servicecontroller")

Test Plan

Graduation Criteria

Alpha

  • Have the sample CCM come up and able to serve PVL mutating webhook.

Beta

  • TBD

GA

  • TBD

Note: Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, in back-to-back releases.

For non-optional features moving to GA, the graduation criteria must include conformance tests.

Deprecation

  • Not deprecated

Upgrade / Downgrade Strategy

  • Upgrade is not believed to be an issue at this point.
  • Currently we are leaving upgrade as an issue for the Cloud Provider

Version Skew Strategy

  • We are currently assuming that this will be deployed as part of the control plane. We assume it will be upgraded with the KAS, KCM and CCM.

Production Readiness Review Questionnaire

  • TBD

Feature Enablement and Rollback

  • TBD
How can this feature be enabled / disabled in a live cluster?

This will be built into the CCM by the Cloud Provider. Code must be written specifically by the Cloud Provider to enable this feature.

Does enabling the feature change any default behavior?

This cannot just be "enabled".

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

If you build using our framework, then you will be able to disable using a command line flag. It can also be disabled by changing the admission webhook configuration.

What happens if we reenable the feature if it was previously rolled back?

For new update requests it will work. However it will not change any persisted resources, unless they are rewritten.

Are there any tests for feature enablement/disablement?
  • TBD

Rollout, Upgrade and Rollback Planning

  • TBD
How can a rollout or rollback fail? Can it impact already running workloads?
  • TBD
What specific metrics should inform a rollback?
  • TBD
Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
  • TBD
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
  • TBD

Monitoring Requirements

  • TBD
How can an operator determine if the feature is in use by workloads?

By examining the admission webhook configuration.

How can someone using this feature know that it is working for their instance?
  • TBD
What are the reasonable SLOs (Service Level Objectives) for the enhancement?
  • TBD
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
  • Metrics
    • Metric name:
    • [Optional] Aggregation method:
    • Components exposing the metric:
  • Other (treat as last resort)
    • Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
  • TBD

Dependencies

  • TBD
Does this feature depend on any specific services running in the cluster?
  • It requires on mutating/validating admission webhooks.

Scalability

  • The webhooks have an advantage that they can be more easily scaled than controllers.
Will enabling / using this feature result in any new API calls?

It requires a new call admission webhook call.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

Depends on the Cloud Providers implementation.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Yes, in the same way that any additional admission webhook call does. It is worth noting that the Cloud Provider has the option of instead using a controller, at least for the PVL case. However that is not the preferred mechanism. These is an optional extension mechanism.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
  • TBD

Troubleshooting

  • This is a admission webhook server. Those already exist and those troubleshooting mechanism should apply here as well.
How does this feature react if the API server and/or etcd is unavailable?
  • This feature does not apply unless the API server is functional.
What are other known failure modes?

Timeouts on webhooks act as failures, so any resource sent to the CCM will fail if it times out.

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

  • TBD

Drawbacks

  • TBD

Alternatives

The primary alternative is to use controllers to solve all the problems. This has an issue for things which need to be done inline. If it is not ok for state to be missing from a resource between creation and usage, the controllers are a problem

Initializers solve the problem between creation and usage, however this solution has been deprecated.

Infrastructure Needed (Optional)

  • TBD