Skip to content
This repository was archived by the owner on Oct 28, 2024. It is now read-only.

✨ Control Plane Upgrades #290

Merged
merged 13 commits into from
Jul 20, 2022

Conversation

m-messiah
Copy link
Contributor

@m-messiah m-messiah commented Jul 19, 2022

What this PR does / why we need it:
The PR adds the featureGate ClusterVersionPartialUpgrade to allow native provide to reconcile currently running virtual clusters to re-apply clusterversion specs to control plane.

Changes to the default behaviour or code

The PR introduces changes to the current native provisioner (Aliyun just returns not implemented for the feature):

  1. Use serverSide Apply instead of Create means we can reuse methods to re-apply state, instead of breaking with AlreadyExists conflicts.
  2. Kubernetes client requests are finally share the current context instead of TODO().
  3. ClusterVersion must contain apiVersion and kind for each Service and StatefulSet defined (it exists in tests, but could be omitted in real world. Using Patch require them to be set)

### Features of the featureGate

  1. The change requires control plane statefulsets to have RollingUpdateStatefulSetStrategyType instead of OnDelete to allow rollouts after upgrade.
  2. APIServer and ControllerManager do not watch certificates changes on disk, so the statefulsets are annotated to force pod recreation if the certificates changed. In real, the certificates are refreshed in each upgrade, while RootCA remains untouched.
  3. ETCD is already reloads certificates from the disk, so it does not need to be annotated
  4. Two metrics introduced:
# HELP clusters_upgrade_seconds Duration of cluster upgrade by reconciler in featuregate.ClusterVersionPartialUpgrade
# TYPE clusters_upgrade_seconds histogram
clusters_upgrade_seconds_bucket{cluster_version="v1.19-test",resource_version="857463552",le="10"} 1
clusters_upgrade_seconds_bucket{cluster_version="v1.19-test",resource_version="857463552",le="+Inf"} 1
clusters_upgrade_seconds_sum{cluster_version="v1.19-test",resource_version="857463552"} 7.5169857780000005
clusters_upgrade_seconds_count{cluster_version="v1.19-test",resource_version="857463552"} 1
# HELP clusters_upgraded Amount of clusters upgraded by reconciler in featuregate.ClusterVersionPartialUpgrade
# TYPE clusters_upgraded counter
clusters_upgraded{cluster_version="v1.19-test",resource_version="857463552"} 1

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 19, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 19, 2022
@m-messiah
Copy link
Contributor Author

The PR is currently a draft and waits for further integration tests for the new feature

@Fei-Guo
Copy link

Fei-Guo commented Jul 19, 2022

I am trying to understand this feature. In the original design, we intend to make clusterversionName field immutable in VC CRD. Are you targeting a user case where user changes the clusterversionName? or you want to handle cases where the clusterversion CR itself is updated and you want to populate the change to all the running VCs?

@m-messiah
Copy link
Contributor Author

I am trying to understand this feature. In the original design, we intend to make clusterversionName field immutable in VC CRD. Are you targeting a user case where user changes the clusterversionName? or you want to handle cases where the clusterversion CR itself is updated and you want to populate the change to all the running VCs?

The second. I want (as a cluster operator) to release new minor versions of clusterversion by adding/updating apiserver flags, updating minor versions of apiserver or controller-manager or manipulate the service params according to the normal process of software upgrades or sudden security patches. So, in this feature we trying to implement a way to populate the current clusterVersion (CR) state to virtual clusters that use that version and agreed to be upgraded (by setting a label).

In foreseeable future it is intended to be used for minor upgrades only, keeping different clusterVersion CRs for different major k8s versions.

Just now, it is already useful to change log verbosity or enable/disable some authorisation params, which anyway "are managed from outside"

@Fei-Guo
Copy link

Fei-Guo commented Jul 19, 2022

I am trying to understand this feature. In the original design, we intend to make clusterversionName field immutable in VC CRD. Are you targeting a user case where user changes the clusterversionName? or you want to handle cases where the clusterversion CR itself is updated and you want to populate the change to all the running VCs?

The second. I want (as a cluster operator) to release new minor versions of clusterversion by adding/updating apiserver flags, updating minor versions of apiserver or controller-manager or manipulate the service params according to the normal process of software upgrades or sudden security patches. So, in this feature we trying to implement a way to populate the current clusterVersion (CR) state to virtual clusters that use that version and agreed to be upgraded (by setting a label).

In foreseeable future it is intended to be used for minor upgrades only, keeping different clusterVersion CRs for different major k8s versions.

Just now, it is already useful to change log verbosity or enable/disable some authorisation params, which anyway "are managed from outside"

Thanks. Just FYI, we are trying to resolve the problems you mentioned using CAPN. We didn't handle vc update in the native provisioner because it was designed for PoC purpose, and we don't recommend using it in production. We want to make it simple. As you may see, the etcd Pod does not even use a persistent storage there. I am glad you are looking into the update problem, but I would suggest enhancing CPAN provider.

@christopherhein
Copy link
Contributor

@Fei-Guo Per our chats in Slack, this is more of stop-gap while we haven't been able to move over to CAPN, I think it's worthwhile in the interim to still move this forward since it doesn't hurt anything necessarily. We've been able to overcome a lot of those challenges by maintaining our own custom ClusterVersions with persistence for example and we're trying to keep etcd upgrades out of the scope for the vc-manager.

Do you still object to this development?

@Fei-Guo
Copy link

Fei-Guo commented Jul 19, 2022

@Fei-Guo Per our chats in Slack, this is more of stop-gap while we haven't been able to move over to CAPN, I think it's worthwhile in the interim to still move this forward since it doesn't hurt anything necessarily. We've been able to overcome a lot of those challenges by maintaining our own custom ClusterVersions with persistence for example and we're trying to keep etcd upgrades out of the scope for the vc-manager.

Do you still object to this development?

Thanks for the clarification. I have no problem with enhancing the clusterversion based vc lifecycle management, but I hope the feature scope should be very clear. As we discussed offline, we may name the feature more specifically, something like "ClusterVersionPartialUpdate" or a better name.

@m-messiah m-messiah marked this pull request as ready for review July 20, 2022 13:01
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2022
@k8s-ci-robot k8s-ci-robot requested a review from zhuangqh July 20, 2022 13:01
Copy link
Contributor

@christopherhein christopherhein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good, a couple comments with the main theme being remove the etcd upgrades. Thanks!

Copy link
Contributor

@christopherhein christopherhein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Amazing work! 🎉

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: christopherhein, m-messiah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 20, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2022
@christopherhein
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 20, 2022
@k8s-ci-robot k8s-ci-robot merged commit 7ad275a into kubernetes-retired:main Jul 20, 2022
@m-messiah m-messiah deleted the cluster-upgrade branch July 20, 2022 17:46
@Fei-Guo
Copy link

Fei-Guo commented Jul 20, 2022

Also, please add a document to describe the workflow in tutorial. Based on my understanding, the workflow of triggering update is

  1. Change the clusterversion cr
  2. Add a label in the VC cr
    which is not obvious for many users.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants