🐛 Fix port name after port creation failure #1941

mdbooth · 2024-03-12T17:00:34Z

CreatePorts was creating a port name based on the index of the port in the current reconcile. This could be different to the absolute index of the port if ports had been partially created in a previous reconcile.

We fix this by passing all current state into CreatePorts so it can create an absolute index. This also ensures that partially created ports will be persisted on failure so we don't have to rely on adoption in the next reconcile.

Fixes: #1940

This PR includes a second commit which refactors and simplifies how we initialises Bastion resources. The motivation for it was that first commit introduces a bug which causes a panic due to an assumption about state having been initialised. The new commit makes this mistake harder to make. It's worth looking at separately.

k8s-ci-robot · 2024-03-12T17:00:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mdbooth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mdbooth]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-03-12T17:00:49Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`801f5ef`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65f49cc79b5aab00084bea2a
😎 Deploy Preview	https://deploy-preview-1941--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

huxcrux · 2024-03-13T07:13:28Z

pkg/cloud/services/networking/port.go

@@ -58,7 +58,7 @@ func (s *Service) GetPortFromInstanceIP(instanceID string, ip string) ([]ports.P
 	return s.client.ListPort(portOpts)
 }

-func (s *Service) CreatePort(eventObject runtime.Object, clusterName string, portName string, portOpts *infrav1.PortOpts, instanceSecurityGroups []string, instanceTags []string) (*ports.Port, error) {
+func (s *Service) CreatePort(eventObject runtime.Object, clusterName string, portName string, portOpts *infrav1.PortOpts, defaultSecurityGroups []string, baseTags []string) (*ports.Port, error) {


Just a minor notice about the name defaultSecurityGroups. Since there is a default security group inside Openstack this may cause confusion at first glance.

Maybe there is a good reason for not sticking with the managedSecurityGroups name from the openstackmachine_controller?

good idea 👍

Because they're not managed security groups. They're the managed security group if defined + any security groups defined explicitly on the parent object (machine or bastion).

The problem is they really are 'default security groups'. From the existing code:

cluster-api-provider-openstack/pkg/cloud/services/networking/port.go

Lines 87 to 90 in 654d714

// inherit port security groups from the instance if not explicitly specified

if len(securityGroups) == 0 {

securityGroups = instanceSecurityGroups

}

So... if the port has no security groups defined, use these ones instead.

I appreciate the overloaded term isn't ideal, though. I'll have a think, but I'll gladly take suggestions. In the absence of anything better, though, this is what I've got.

It's internal API anyway, I'm fine.

Agree, we can levar it as is after the explanation

EmilienM · 2024-03-14T08:57:46Z

/test pull-cluster-api-provider-openstack-e2e-full-test
/lgtm
/hold

Feel free to hold cancel when the upgrade passed and when you think it's ready to be merged. Thanks for that work Matt, sorry I missed it in the initial iteration.

mdbooth · 2024-03-14T10:37:44Z

It looks like the full test failed to clean up a couple of ports on deletion. I'm not convinced that was a flake, but running it again to get another data point:

/test pull-cluster-api-provider-openstack-e2e-full-test

Unfortunately, if this is a bug on deletion we don't have controller logs from CI because it happens after the logs are collected.

mdbooth · 2024-03-14T10:59:47Z

Oh, we do have logs. The controller paniced: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1941/pull-cluster-api-provider-openstack-e2e-full-test/1768199842766524416/artifacts/clusters/clusterctl-upgrade-y11roz/logs/capo-system/capo-controller-manager/capo-controller-manager-66896777f9-bnz58/manager.log

mdbooth · 2024-03-14T18:01:23Z

#1949 should fix the panic seen in the upgrade test. I don't understand what's triggering it, though.

mdbooth · 2024-03-15T10:22:39Z

/test pull-cluster-api-provider-openstack-e2e-full-test

mdbooth · 2024-03-15T16:11:15Z

/test pull-cluster-api-provider-openstack-e2e-full-test

mdbooth · 2024-03-15T16:16:10Z

This PR broke cluster deletion when the cluster had a bastion, and the only test which caught it was the upgrade test. This is a bit worrying.

/test pull-cluster-api-provider-openstack-e2e-full-test

CreatePorts was creating a port name based on the index of the port in the *current reconcile*. This could be different to the absolute index of the port if ports had been partially created in a previous reconcile. We fix this by passing all current state into CreatePorts so it can create an absolute index. This also ensures that partially created ports will be persisted on failure so we don't have to rely on adoption in the next reconcile.

Firstly, we remove the resource reconcile calls from the cluster flow before calling reconcileNormal/reconcileDelete because the guards around them and various other guards throughout the code are heavily inter-depdendent and hard to reason about. Instead, we push them to the the places we: * know they are required * know we are sufficiently initialised that they can work Firstly we resolve references at the top of reconcileBastion. We know the cluster has been initialised at this point, so we don't need to guard against it. This also means that it is always called when entering that function, so we don't need to guard against it not having been called during first cluster initialisation. We also force that function to re-reconcile if it calls deleteBastion(), because deleteBastion() removes the bastion status. We reconcile again, so we always know that it is set. We also add an explicit call to resource reconcile in the reconcileDelete flow. This is the only place we now need a 'weird' guard against the cluster network not having been set. We add a comment about that appropriate to its weirdness.

mdbooth · 2024-03-15T19:18:50Z

/test pull-cluster-api-provider-openstack-e2e-full-test

mdbooth · 2024-03-15T19:38:36Z

Note that #1951 fixes this much more comprehensively by pre-resolving everything about the port prior to creation, including its name.

That PR is based on this one, though, so we still need to merge this first.

mdbooth · 2024-03-15T21:09:03Z

/hold cancel

EmilienM · 2024-03-16T06:56:22Z

Thanks for this work, I'm glad you sorted this out.
/lgtm

k8s-ci-robot requested a review from EmilienM March 12, 2024 17:00

k8s-ci-robot requested a review from jichenjc March 12, 2024 17:00

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 12, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 12, 2024

mdbooth mentioned this pull request Mar 12, 2024

🌱 Deduplicate AdoptMachinePorts and AdoptBastionPorts #1944

Merged

1 task

huxcrux reviewed Mar 13, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2024

k8s-ci-robot assigned EmilienM Mar 14, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2024

mdbooth force-pushed the issue1940 branch from 7ad5621 to 1027857 Compare March 15, 2024 10:22

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 15, 2024

mdbooth force-pushed the issue1940 branch from ef43432 to d8a216f Compare March 15, 2024 16:15

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2024

mdbooth added 2 commits March 15, 2024 18:31

mdbooth force-pushed the issue1940 branch from d8a216f to 801f5ef Compare March 15, 2024 19:08

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 15, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 15, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 16, 2024

k8s-ci-robot merged commit 6b87daf into kubernetes-sigs:main Mar 16, 2024

mdbooth deleted the issue1940 branch March 16, 2024 18:20

	// inherit port security groups from the instance if not explicitly specified
	if len(securityGroups) == 0 {
	securityGroups = instanceSecurityGroups
	}

🐛 Fix port name after port creation failure #1941

🐛 Fix port name after port creation failure #1941

Uh oh!

Conversation

mdbooth commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 12, 2024

Uh oh!

netlify bot commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Uh oh!

huxcrux Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

EmilienM Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

mdbooth Mar 13, 2024

Choose a reason for hiding this comment

Uh oh!

EmilienM Mar 14, 2024

Choose a reason for hiding this comment

Uh oh!

huxcrux Mar 15, 2024

Choose a reason for hiding this comment

Uh oh!

EmilienM commented Mar 14, 2024

Uh oh!

mdbooth commented Mar 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdbooth commented Mar 14, 2024

Uh oh!

mdbooth commented Mar 14, 2024

Uh oh!

mdbooth commented Mar 15, 2024

Uh oh!

mdbooth commented Mar 15, 2024

Uh oh!

mdbooth commented Mar 15, 2024

Uh oh!

mdbooth commented Mar 15, 2024

Uh oh!

mdbooth commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdbooth commented Mar 15, 2024

Uh oh!

EmilienM commented Mar 16, 2024

Uh oh!

Uh oh!

mdbooth commented Mar 12, 2024 •

edited

Loading

netlify bot commented Mar 12, 2024 •

edited

Loading

mdbooth commented Mar 14, 2024 •

edited

Loading

mdbooth commented Mar 15, 2024 •

edited

Loading