Skip to content

feat: add cluster-autoscaler CRS addon #423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 8, 2024

Conversation

dkoshkin
Copy link
Contributor

@dkoshkin dkoshkin commented Mar 5, 2024

Adding the cluster-autoscaler addon, starting with ClusterResourceSet first.

The cluster-autoscaler addon is different than the other addons because it needs access to both the CAPI objects and the workload Cluster. Because of this, it needs to always be deployed on the management Cluster and use the CAPI kubeconfig Secret to watch the workload Cluster.

Tested locally, but was not able to work on e2e tests yet, since there is no mechanism yet to make the Cluster self-managed.

@dkoshkin dkoshkin requested a review from jimmidyson March 5, 2024 05:43
@dkoshkin
Copy link
Contributor Author

dkoshkin commented Mar 5, 2024

Would love some foeedback on this before going much further 🙏

@dkoshkin dkoshkin force-pushed the dkoshkin/feat-cluster-autoscaler branch 4 times, most recently from a6b53c8 to 0044549 Compare March 6, 2024 00:39
@jimmidyson
Copy link
Member

More discussion on cluster-autoscaler and ClusterClass in kubernetes-sigs/cluster-api#8217

@dkoshkin
Copy link
Contributor Author

dkoshkin commented Mar 6, 2024

More discussion on cluster-autoscaler and ClusterClass in kubernetes-sigs/cluster-api#8217

Interesting! We solved this in DKP by having the CLI client update both the annotations AND the MachineDeployment's replicas with an imperative command.

But I think the --enforce-node-group-min-size flag is a better solution here.

Not a huge fan that the internal annotation details are exposed to the user, but maybe thats ok for the initial implementation.

@dkoshkin dkoshkin force-pushed the dkoshkin/feat-cluster-autoscaler branch 3 times, most recently from ceceda7 to 72cdf4b Compare March 6, 2024 23:07
@dkoshkin dkoshkin changed the title feat: add cluster-autoscaler addon feat: add cluster-autoscaler CRS addon Mar 6, 2024
@github-actions github-actions bot added feature and removed feature labels Mar 6, 2024
Copy link
Member

@jimmidyson jimmidyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@jimmidyson jimmidyson enabled auto-merge (squash) March 7, 2024 13:28
@jimmidyson jimmidyson force-pushed the dkoshkin/feat-cluster-autoscaler branch from e51ce78 to 1be244d Compare March 7, 2024 13:29
@jimmidyson
Copy link
Member

CRS is not applied from bootstrap to workload:

- lastTransitionTime: "2024-03-07T14:34:49Z"
    message: '[creating object /v1, Kind=ServiceAccount quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u:
      namespaces "quick-start-zu4cbq" not found, creating object policy/v1, Kind=PodDisruptionBudget
      quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u: namespaces "quick-start-zu4cbq"
      not found, creating object rbac.authorization.k8s.io/v1, Kind=Role quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u:
      namespaces "quick-start-zu4cbq" not found, creating object rbac.authorization.k8s.io/v1,
      Kind=RoleBinding quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u: namespaces
      "quick-start-zu4cbq" not found, creating object /v1, Kind=Service quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u:
      namespaces "quick-start-zu4cbq" not found, creating object apps/v1, Kind=Deployment
      quick-start-zu4cbq/cluster-autoscaler-quick-start-la8a9u: namespaces "quick-start-zu4cbq"
      not found]'

@jimmidyson
Copy link
Member

Need to specify the namespace for CA and create all resources in there.

@jimmidyson
Copy link
Member

With this approach, the CA will always be deployed on the CP nodes because workers will be set to 0, CA will be deployed and only then will worker nodes be scaled up.

dkoshkin and others added 4 commits March 8, 2024 07:40
To change the number of replicas,
a user can set the min annotation to the desired number of replicas
and cluster-autoscaler will scale up the MachineDeployment.
@dkoshkin dkoshkin force-pushed the dkoshkin/feat-cluster-autoscaler branch from 1be244d to e9c7342 Compare March 8, 2024 17:58
@dkoshkin
Copy link
Contributor Author

dkoshkin commented Mar 8, 2024

Need to specify the namespace for CA and create all resources in there.

Yep, thank you e2e tests, should be fixed in e9c7342
When I tested locally I created the management cluster in the default namespace so missed that.

@dkoshkin
Copy link
Contributor Author

dkoshkin commented Mar 8, 2024

With this approach, the CA will always be deployed on the CP nodes because workers will be set to 0, CA will be deployed and only then will worker nodes be scaled up.

The CAPI webhook will set the MD replicas based on the CA annotation https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/webhooks/machinedeployment.go#L289-L291

I also see this behavior running locally when setting --worker-machine-count=3

k get md
NAME                              CLUSTER                REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE   VERSION
dkoshkin-ca-test-200-md-0-zzxt7   dkoshkin-ca-test-200   3          3       3         0             Running   19m   v1.28.7

We may want to support clusters with 0 replicas, but I think we should do that as a separate effort since other addons would need similar tolerations.

@dkoshkin dkoshkin force-pushed the dkoshkin/feat-cluster-autoscaler branch from e9c7342 to 0920fb0 Compare March 8, 2024 18:06
@jimmidyson jimmidyson merged commit 30bc784 into main Mar 8, 2024
@jimmidyson jimmidyson deleted the dkoshkin/feat-cluster-autoscaler branch March 8, 2024 18:21
@github-actions github-actions bot mentioned this pull request Mar 8, 2024
@dkoshkin
Copy link
Contributor Author

dkoshkin commented Mar 8, 2024

Oops, I didn't see the auto-merge was enabled but the e2e tests passed 🎉

dkoshkin referenced this pull request Mar 19, 2024
🤖 I have created a release *beep* *boop*
---


## 0.6.0 (2024-03-19)

<!-- Release notes generated using configuration in .github/release.yaml
at main -->

## What's Changed
### Exciting New Features 🎉
* feat: Support HelmAddon strategy to deploy NFD by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/390
* feat: Upgrade AWS ESB CSI and switch to using Helm chart by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/393
* feat: CAPA 2.4.0 APIs and e2e by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/415
* feat: Single defaults namespace flag by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/426
* feat: add cluster-autoscaler CRS addon by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/423
* feat: add Cluster Autoscaler Addon with HelmAddon by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/427
* feat: NFD v0.15.2 by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/442
* feat: Include CABPK APIs by @dlipovetsky in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/445
### Fixes 🔧
* fix: Ensure addons defaults namespaces are correctly wired up by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/409
* fix: Disable hubble in Cilium deployment via CRS by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/411
* fix: Fix Cilium helm values to use kubernetes IPAM by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/413
* fix: don't use an SSH key in AWS clusters by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/425
* fix: set default priorityClassName on Deployment by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/431
* fix: set default tolerations on Deployment by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/430
* fix: Remove vendored types for core CAPI providers (CAPD, CABPK, KCP)
by @dlipovetsky in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/452
### Other Changes
* test: Add initial e2e tests by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/360
* test(e2e): Add CNI e2e tests by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/383
* test(e2e): Resolve latest upstream provider releases in e2e config by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/388
* test(e2e): Add test for NFD addon by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/389
* build: Ignore controller-runtime upgrades by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/403
* test(e2e): Use ghcr.io/mesosphere/kind-node for bootstrap by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/406
* build: Update AWS CPI manifest filenames by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/410
* revert: Temporarily disable GOPROXY to workaround dodgy CAPA release
(#395) by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/407
* build: Ensure release namespace is use in kustomize helm inflator by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/412
* docs: Update menu ordering and add some icons by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/414
* test(e2e): Add AWS e2e tests by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/408
* build: clusterawsadm v2.4.0 by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/424
* docs: simplify running examples in README by @dkoshkin in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/422
* ci: Add dependabot for api module by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/432
* build: Fix up third-party CAPD go.mod CAPI dependency by @jimmidyson
in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/441
* build: controller-runtime v0.17.2 by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/440
* ci: Fix up release workflow by specifying workflow-dispatch version by
@jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/451
* docs: Update docsy module by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/455
* build: Rename module to
d2iq-labs/cluster-api-runtime-extensions-nutanix by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/454
* test(e2e): Update test config with new repo name by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/457
* build: Reorg example kustomizations by @jimmidyson in
https://github.com/d2iq-labs/cluster-api-runtime-extensions-nutanix/pull/453


**Full Changelog**:
d2iq-labs/cluster-api-runtime-extensions-nutanix@v0.5.0...v0.6.0

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants