Stuck at deleting cluster #1792

nguyenhuukhoi · 2023-12-16T01:46:37Z

/kind bug

What steps did you take and what happened:

Create cluster with a user without load_balancer_member role.
Delete cluster >> Deleted Failed>> Add load_balancer_member role for this user.
Delete cluster again> DELETE_IN_PROGRESS forever.

What did you expect to happen:

Delete cluster successfully.

Environment:

Cluster API Provider OpenStack version (Or git rev-parse HEAD if manually built): v0.8.0
Cluster-API version: v1.5.4
OpenStack version: Yoga
Minikube/KIND version:
Kubernetes version (use kubectl version): v1.27.4
OS (e.g. from /etc/os-release): Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

k8s-triage-robot · 2024-03-15T02:07:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

lentzi90 · 2024-04-05T12:03:01Z

I think I run into essentially the same issue today. What I did was this:

Create a cluster with non-existent external network ID specified
Cluster enters failed state as it should, since the not found external network is a permanent error
Attempt to delete cluster -> stuck forever since CAPO gets stuck on the machine being nil

I0405 11:46:19.963688       1 openstackcluster_controller.go:343] "Reconciling Cluster" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/lennart-test" namespace="default" name="lennart-test" reconcileID="08026d49-cf57-4f21-b195-3a8de3c886d5" cluster="lennart-test"
I0405 11:46:19.963739       1 openstackcluster_controller.go:630] "Reconciling network components" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/lennart-test" namespace="default" name="lennart-test" reconcileID="08026d49-cf57-4f21-b195-3a8de3c886d5" cluster="lennart-test"
E0405 11:46:20.131835       1 controller.go:329] "Reconciler error" err="failed to reconcile external network: failed to get external network: Resource not found: [GET https://fra1.citycloud.com:9696/v2.0/networks/fba95253-5543-4078-b793-e2de58c31378], error message: {\"NeutronError\": {\"type\": \"NetworkNotFound\", \"message\": \"Network fba95253-5543-4078-b793-e2de58c31378 could not be found.\", \"detail\": \"\"}}" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="default/lennart-test" namespace="default" name="lennart-test" reconcileID="08026d49-cf57-4f21-b195-3a8de3c886d5"
I0405 11:46:25.353483       1 openstackmachine_controller.go:247] "Reconciling Machine delete" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/lennart-test-md-0-v1-28-6-vk9d5" namespace="default" name="lennart-test-md-0-v1-28-6-vk9d5" reconcileID="84ef9bf7-928f-40d3-8735-c23231d61bf4" openStackMachine="lennart-test-md-0-v1-28-6-vk9d5" machine="lennart-test-md-0-mfgvp-4ff6m" cluster="lennart-test" openStackCluster="lennart-test"
E0405 11:46:25.590991       1 controller.go:329] "Reconciler error" err="machine resolved is nil" controller="openstackmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackMachine" OpenStackMachine="default/lennart-test-md-0-v1-28-6-vk9d5" namespace="default" name="lennart-test-md-0-v1-28-6-vk9d5" reconcileID="84ef9bf7-928f-40d3-8735-c23231d61bf4"

The issue (for me anyway) seems to be here. I think we should check if the machine was resolved, and only try to get the instance spec if it was. Otherwise we can skip instance deletion.

Does this sound reasonable? @mdbooth
/remove-lifecycle stale

mdbooth · 2024-04-05T15:39:44Z

If we can prove that the machine cannot have been created it's safe to skip deletion. In fact, this is absolutely a direction I want us to move in.

I can't check right now, but I think there are still some edge cases where we can't be sure. However, these may be a lesser evil. I'll try to look at this in detail on Monday.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2024

mdbooth mentioned this issue Apr 9, 2024

🐛 Don't try to resolve machine on delete if cluster not ready #2006

Merged

k8s-ci-robot closed this as completed in #2006 Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stuck at deleting cluster #1792

Stuck at deleting cluster #1792

nguyenhuukhoi commented Dec 16, 2023

k8s-triage-robot commented Mar 15, 2024

Uh oh!

lentzi90 commented Apr 5, 2024

Uh oh!

mdbooth commented Apr 5, 2024

Uh oh!

Stuck at deleting cluster #1792

Stuck at deleting cluster #1792

Comments

nguyenhuukhoi commented Dec 16, 2023

k8s-triage-robot commented Mar 15, 2024

Uh oh!

lentzi90 commented Apr 5, 2024

Uh oh!

mdbooth commented Apr 5, 2024

Uh oh!