-
Notifications
You must be signed in to change notification settings - Fork 269
🐛 Fix port name after port creation failure #1941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mdbooth The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -58,7 +58,7 @@ func (s *Service) GetPortFromInstanceIP(instanceID string, ip string) ([]ports.P | |||
return s.client.ListPort(portOpts) | |||
} | |||
|
|||
func (s *Service) CreatePort(eventObject runtime.Object, clusterName string, portName string, portOpts *infrav1.PortOpts, instanceSecurityGroups []string, instanceTags []string) (*ports.Port, error) { | |||
func (s *Service) CreatePort(eventObject runtime.Object, clusterName string, portName string, portOpts *infrav1.PortOpts, defaultSecurityGroups []string, baseTags []string) (*ports.Port, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor notice about the name defaultSecurityGroups
. Since there is a default security group inside Openstack this may cause confusion at first glance.
Maybe there is a good reason for not sticking with the managedSecurityGroups
name from the openstackmachine_controller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because they're not managed security groups. They're the managed security group if defined + any security groups defined explicitly on the parent object (machine or bastion).
The problem is they really are 'default security groups'. From the existing code:
cluster-api-provider-openstack/pkg/cloud/services/networking/port.go
Lines 87 to 90 in 654d714
// inherit port security groups from the instance if not explicitly specified | |
if len(securityGroups) == 0 { | |
securityGroups = instanceSecurityGroups | |
} |
So... if the port has no security groups defined, use these ones instead.
I appreciate the overloaded term isn't ideal, though. I'll have a think, but I'll gladly take suggestions. In the absence of anything better, though, this is what I've got.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's internal API anyway, I'm fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we can levar it as is after the explanation
/test pull-cluster-api-provider-openstack-e2e-full-test Feel free to hold cancel when the upgrade passed and when you think it's ready to be merged. Thanks for that work Matt, sorry I missed it in the initial iteration. |
It looks like the full test failed to clean up a couple of ports on deletion. I'm not convinced that was a flake, but running it again to get another data point: /test pull-cluster-api-provider-openstack-e2e-full-test Unfortunately, if this is a bug on deletion we don't have controller logs from CI because it happens after the logs are collected. |
#1949 should fix the panic seen in the upgrade test. I don't understand what's triggering it, though. |
/test pull-cluster-api-provider-openstack-e2e-full-test |
1 similar comment
/test pull-cluster-api-provider-openstack-e2e-full-test |
This PR broke cluster deletion when the cluster had a bastion, and the only test which caught it was the upgrade test. This is a bit worrying. /test pull-cluster-api-provider-openstack-e2e-full-test |
CreatePorts was creating a port name based on the index of the port in the *current reconcile*. This could be different to the absolute index of the port if ports had been partially created in a previous reconcile. We fix this by passing all current state into CreatePorts so it can create an absolute index. This also ensures that partially created ports will be persisted on failure so we don't have to rely on adoption in the next reconcile.
Firstly, we remove the resource reconcile calls from the cluster flow before calling reconcileNormal/reconcileDelete because the guards around them and various other guards throughout the code are heavily inter-depdendent and hard to reason about. Instead, we push them to the the places we: * know they are required * know we are sufficiently initialised that they can work Firstly we resolve references at the top of reconcileBastion. We know the cluster has been initialised at this point, so we don't need to guard against it. This also means that it is always called when entering that function, so we don't need to guard against it not having been called during first cluster initialisation. We also force that function to re-reconcile if it calls deleteBastion(), because deleteBastion() removes the bastion status. We reconcile again, so we always know that it is set. We also add an explicit call to resource reconcile in the reconcileDelete flow. This is the only place we now need a 'weird' guard against the cluster network not having been set. We add a comment about that appropriate to its weirdness.
/test pull-cluster-api-provider-openstack-e2e-full-test |
Note that #1951 fixes this much more comprehensively by pre-resolving everything about the port prior to creation, including its name. That PR is based on this one, though, so we still need to merge this first. |
/hold cancel |
Thanks for this work, I'm glad you sorted this out. |
CreatePorts was creating a port name based on the index of the port in the current reconcile. This could be different to the absolute index of the port if ports had been partially created in a previous reconcile.
We fix this by passing all current state into CreatePorts so it can create an absolute index. This also ensures that partially created ports will be persisted on failure so we don't have to rely on adoption in the next reconcile.
Fixes: #1940
This PR includes a second commit which refactors and simplifies how we initialises Bastion resources. The motivation for it was that first commit introduces a bug which causes a panic due to an assumption about state having been initialised. The new commit makes this mistake harder to make. It's worth looking at separately.