bug: failed instances with no port breaks the whole CAPO controller #1805

okozachenko1203 · 2023-12-28T13:15:38Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]
A vm instance(corresponding to an OpenstackMachine CR) creation failed by some reason and there is no vif binded on that instance. That OpenstackMachine is marked as unhealthy and capo-controller tries to replace it. During the deletion reconcile loop, there is a task to remove ports binded on the instance. The port deletion task is failing with the following errors in capo-controller.

I1218 12:40:46.844270       1 controller.go:114] "Observed a panic in reconciler: runtime error: index out of range [0] with length 0" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="magnum-system/kube-a1d9n-default-worker-infra-t4kdt-f9xfs" namespace="magnum-system" name="kube-a1d9n-default-worker-infra-t4kdt-f9xfs" reconcileID=67e5c4eb-0d10-4d4f-ad5d-acd383736303
panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0
 
goroutine 460 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:115 +0x1fa
panic({0x19480c0, 0xc0014870f8})
        /usr/local/go/src/runtime/panic.go:884 +0x212
sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/networking.(*Service).GarbageCollectErrorInstancesPort(0xc008e98f40, {0x1d09a90, 0xc00b825900}, {0xc000b7ecc0, 0x2b}, {0xc001cdc200, 0x1, 0x0?})
        /workspace/pkg/cloud/services/networking/port.go:318 +0x249
sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/services/compute.(*Service).DeleteInstance(0xc000c99800, 0xc00af36280?, {0x1d09a90, 0xc00b825900}, 0xc008e988e0, 0xc00067e690)
        /workspace/pkg/cloud/services/compute/instance.go:620 +0x60c
sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).reconcileDelete(0x1d1cfa0?, {0x1d1de00, 0xc001f53a40}, 0xc007fa9520, 0xc0039b8000, 0xc00af36280, 0xc00b825900)
        /workspace/controllers/openstackmachine_controller.go:278 +0x61d
sigs.k8s.io/cluster-api-provider-openstack/controllers.(*OpenStackMachineReconciler).Reconcile(0xc00048c360, {0x1d19838, 0xc00aa251d0}, {{{0xc000c00e10?, 0x10?}, {0xc000b7ecc0?, 0x40dc07?}}})
        /workspace/controllers/openstackmachine_controller.go:150 +0xa8d
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1d19838?, {0x1d19838?, 0xc00aa251d0?}, {{{0xc000c00e10?, 0x17a1c00?}, {0xc000b7ecc0?, 0x10?}}})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:118 +0xc8
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0009780a0, {0x1d19790, 0xc00053e080}, {0x18bd520?, 0xc000bd2960?})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:314 +0x3a5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0009780a0, {0x1d19790, 0xc00053e080})
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:265 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:226 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222 +0x333

This just one failed OpenstackMachine breaks the whole capo-controller now. It ends up to crashloopbackoff of the capo-controller and causes validation webhook failures for operations against Openstack related CRs.

What did you expect to happen:
The failed instances with no port should be deleted without any issue.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): 0.9.0 >=
Cluster-API version: 1.5.x
OpenStack version: stable/zed
Minikube/KIND version:
Kubernetes version (use kubectl version): 1.27
OS (e.g. from /etc/os-release): Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 28, 2023

okozachenko1203 mentioned this issue Dec 28, 2023

fix: skip port deletion when instances have no port #1794

Closed

3 tasks

dulek mentioned this issue Jan 10, 2024

🐛 fix: skip port deletion when instances have no port #1818

Merged

2 tasks

k8s-ci-robot closed this as completed in #1818 Jan 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: failed instances with no port breaks the whole CAPO controller #1805

bug: failed instances with no port breaks the whole CAPO controller #1805

okozachenko1203 commented Dec 28, 2023

bug: failed instances with no port breaks the whole CAPO controller #1805

bug: failed instances with no port breaks the whole CAPO controller #1805

Comments

okozachenko1203 commented Dec 28, 2023