✨ Reconcile SecurityGroups <-> Instances separately #1774

EmilienM · 2023-12-05T20:39:26Z

What this PR does / why we need it:

Originally, the openstackmachine controller was adding the SecurityGroups to the instanceSpec.
Now, the openstackmachine controller will first create the instances
without security groups. Then we have a loop that will reconcile the
desired security groups that we want to apply in the cluster[1] with the
ones that are added to the instance's ports.

[1] for now, we have 2 sets of rules: the one that can be added to
OpenStackMachine.Spec.SecurityGroups and the one managed
by OpenStackCluster.Spec.ManagedSecurityGroups.

TODOs:

Handle what to do with OpenStackMachineSpec.Ports is defined with a SecurityGroupFilter. Probably I need to rework all work in Status and puts Ports in Status.
For an update to applied security groups, think about how we could indicate in the cluster status as a summary of how many machines have been updated.
add unit tests to at least the openstackmachine_controller
Handle "default" security group applied to all created ports by passing an empty slice of strings if no security groups is required when creating the port.
update documentation
squashed commits

netlify · 2023-12-05T20:39:31Z

✅ Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name	Link
🔨 Latest commit	`8138e6a`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65732e56589bf800082faacd
😎 Deploy Preview	https://deploy-preview-1774--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

k8s-ci-robot · 2023-12-05T20:39:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: EmilienM
Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

EmilienM · 2023-12-06T01:50:16Z

api/v1alpha7/openstackmachine_types.go

@@ -107,6 +107,11 @@ type OpenStackMachineStatus struct {
 	// +optional
 	InstanceState *InstanceState `json:"instanceState,omitempty"`

+	// SecurityGroups is the list of security groups that were applied to the ports


Note for myself: in comment I'll explain that we only populate that once all the rules have been applied.

Note for reviewers: do we want to have another field SecurityGroupsReady once the reconciler has finished its job successfully? Now, let's say we managed to successfully reconcile once and Status.SecurityGroups was populated once, then somehow the reconciler fails but we still have Status.SecurityGroups, we have mark the instance as NotReady. Is that good enough?

Implementation note: something like this would be implemented as a Condition rather than a status field. I don't think it's worth having a condition for this unless we think we have a consumer for it.

The question is: who consumed the Ready signal and what are their expectations of it? The answer is defined here: https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine:

The status object must at least one field defined:

ready - a boolean field indicating if the infrastructure is ready to be used or not.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

Exactly. That's what I meant when writing that code:

err = r.reconcileSecurityGroupsToInstance(scope.Logger(), openStackCluster, machine, openStackMachine, instanceStatus, computeService, networkingService) if err != nil { conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.SecurityGroupApplyFailedReason, clusterv1.ConditionSeverityError, "Applying security groups failed: %v", err) return ctrl.Result{}, fmt.Errorf("add security groups to instance: %w", err) }

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

ok I'll try to think how the state of how many machines have been updated can be reflected in the OpenStackClusterStatus.

I've updated my TODO in the PR and I'll close that comment once I came up with something.

controllers/openstackmachine_controller.go

controllers/openstackmachine_controller_test.go

pkg/cloud/services/networking/securitygroups.go

EmilienM · 2023-12-06T01:58:24Z

pkg/cloud/services/networking/securitygroups.go

+			if !contains(desiredSecurityGroupIDs, securityGroupID) {
+				logger.Info("Port has security group that is not desired", "portID", portID, "securityGroupID", securityGroupID)
+				portUpdateNeeded = true
+				break


Note for myself: I'll remove the break so we get all logs in the loop. We know we need to update the port anyway at this point.

I checked in CI logs and I could see:

I1206 03:01:32.375132 1 securitygroups.go:434] "Port has security group that is not desired" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="e2e-hwo2dd/cluster-e2e-hwo2dd" namespace="e2e-hwo2dd" name="cluster-e2e-hwo2dd" reconcileID=8ce25157-7075-4f45-96db-1a5bc66468f3 cluster="cluster-e2e-hwo2dd" portID="06b4adb5-05b9-405f-ad05-a8f5e5eea001" securityGroupID="0f239ce1-0ab5-4e6e-9f91-1afe29e21e5d"

And 0f239ce1-0ab5-4e6e-9f91-1afe29e21e5d is the default security group (source). Do we want to ignore this one when checking unwanted SGs attached to the ports?

Hmm. Is it a bug that the default security group is attached to the ports?

Ports that are created when port security is enabled will have the default security group itself managed by Neutron.

I see two options:

We keep it and look for its ID (once and early to avoid many Neutron API calls) and put it in desiredSecuritygroupIDs, therefore it's not a problem anymore.

We remove it in this loop

I personally prefer option 1 for safety & because that's default OpenStack behaviours here.

cc @stephenfin with your OpenStack dev hat, any preference?

tbh, until today I didn't know that adding a port would add the default security group. What is the source of these ports? If they are ports that we are creating ourselves then we can instruct neutron not to assign a security group to the port. From openstackclient this is done by e.g.:

openstack port create --no-security-group ...

which under the hood sets the security_group_ids field to [] (empty list) when creating the port. Could we do the same thing?

If these ports are coming from elsewhere then it's possible to have other security groups than the default one assigned to the port (openstack port create --security-group <sec-group> ...) so I'm not sure how we can ignore everything? Do we forbid specifying ports with security groups defined?

Given the above, we probably need to decide what's authoritative: the security groups on the port, or the security ports defined on the Machine. I assume the latter, but if this is the case then we're going to modify something that the user gave us. If we go with the former, we're either modifying or ignoring something else that the user gave us.

What is the source of these ports?

This is where we pre-build ports before creating an instance:

cluster-api-provider-openstack/pkg/cloud/services/compute/instance.go

Lines 177 to 222 in 7961f21

func (s *Service) constructPorts(openStackCluster *infrav1.OpenStackCluster, instanceSpec *InstanceSpec) ([]infrav1.PortOpts, error) {

// Ensure user-specified ports have all required fields

ports, err := s.normalizePorts(instanceSpec.Ports, openStackCluster, instanceSpec)

if err != nil {

return nil, err

}

// no networks or ports found in the spec, so create a port on the cluster network

if len(ports) == 0 {

port := infrav1.PortOpts{

Network: &infrav1.NetworkFilter{

ID: openStackCluster.Status.Network.ID,

},

Trunk: &instanceSpec.Trunk,

}

for _, subnet := range openStackCluster.Status.Network.Subnets {

port.FixedIPs = append(port.FixedIPs, infrav1.FixedIP{

Subnet: &infrav1.SubnetFilter{

ID: subnet.ID,

},

})

}

ports = []infrav1.PortOpts{port}

}

// trunk support is required if any port has trunk enabled

portUsesTrunk := func() bool {

for _, port := range ports {

if port.Trunk != nil && *port.Trunk {

return true

}

}

return false

}

if portUsesTrunk() {

trunkSupported, err := s.isTrunkExtSupported()

if err != nil {

return nil, err

}

if !trunkSupported {

return nil, fmt.Errorf("there is no trunk support. please ensure that the trunk extension is enabled in your OpenStack deployment")

}

}

return ports, nil

}

We don't pass any security groups.

Could we do the same thing?

I don't think so because we don't provide a list but a list of SecurityGroupFilter, so I suspect that nothing will be passed to Neutron API if we pass an empty list of filters.

we probably need to decide what's authoritative:

I want CAPO to have the authority if possible: if users provide a security group for the Machine, then we add it to the port in the reconcile (not at instance creation) - so instance + initial ports have no SGs in theory.

I've pushed e651abb - I have little hope that will fix this "default" SG, however this is addressing a real problem. We can probably classify this change as unrelated to this comment initially but I wanted to make it visible for review.

I checked with @pierreprinetti and the good news is that port.CreateOpts.SecurityGroups in Gophercloud is *[]string so if we pass an empty slice of strings, the port should be created with no security group. I'll test it.

pkg/cloud/services/networking/securitygroups.go

mdbooth · 2023-12-06T13:28:20Z

api/v1alpha7/openstackmachine_types.go

@@ -107,6 +107,11 @@ type OpenStackMachineStatus struct {
 	// +optional
 	InstanceState *InstanceState `json:"instanceState,omitempty"`

+	// SecurityGroups is the list of security groups that were applied to the ports


Implementation note: something like this would be implemented as a Condition rather than a status field. I don't think it's worth having a condition for this unless we think we have a consumer for it.

The question is: who consumed the Ready signal and what are their expectations of it? The answer is defined here: https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine:

The status object must at least one field defined:

ready - a boolean field indicating if the infrastructure is ready to be used or not.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

api/v1alpha7/openstackmachine_types.go

pkg/cloud/services/compute/instance.go

pkg/cloud/services/networking/securitygroups.go

controllers/openstackmachine_controller.go

pkg/cloud/services/networking/securitygroups.go

Originally, we were adding the SecurityGroups to the instanceSpec. Now, the openstackmachine controller will first create the instances without security groups. Then we have a loop that will reconcile the desired security groups that we want to apply in the cluster[1] with the ones that are added to the instance's ports. [1] for now, we have 2 sets of rules: the one that can be added to `OpenStackMachine.Spec.SecurityGroups` and the one managed by `OpenStackCluster.Spec.ManagedSecurityGroups`.

When defining an OpenStackMachine with `Ports`, we no longer allow the use of `SecurityGroupFilters` because we now have a separate loop to reconcile ports based on 2 things: * If `OpenStackCluster.Spec.ManagedSecurityGroups`, machine ports will be reconciled to have the security group of their role (cp/worker). * If `OpenStackMachine.Spec.SecurityGroups` is specified for the machine, we'll reconcile the machine ports to ensure they have the Security Groups asked by the user. However if a user provides `OpenStackMachine.Spec.Ports`, we will return an error if the Port has `SecurityGroupFilters` because we have a separate reconcile which update *all* ports with the same Security Groups.

EmilienM · 2023-12-08T01:12:01Z

/test pull-cluster-api-provider-openstack-e2e-test

EmilienM · 2023-12-08T02:36:41Z

/test pull-cluster-api-provider-openstack-e2e-test

EmilienM · 2023-12-08T02:46:41Z

pkg/cloud/services/networking/securitygroups.go

+		if err != nil {
+			return fmt.Errorf("error getting port %s: %v", portID, err)
+		}
+


Instead of doing 1593e43 - we could:

Put a new struct which maps ports and their security group IDs in OpenStackMachine.Status so we do the lookup once

we don't want to add these SG IDs to the slices in the openstackmachine_controller that we created for SG IDs to reconcile because they are specific to ports and not applied for all ports (unlike ManagedSecurityGroups & Spec.SecurityGroups)

Here in the reconcile, if a port is found in the Status with SG IDs, we simply add them to the desired SGs.

This is a bit complex but this is to address the 3 current options that we have to attach security groups to machines:

OpenStackCluster.Spec.ManagedSecurityGroups

OpenStackMachine.Spec.SecurityGroups

OpenStackMachine.Spec.Ports.SecurityGroupFilters

Another option is we don't reconcile the security groups from ports defined in OpenStack.Machine.Spec.Ports because we consider them immutable. That feels odd to me, and I can't find yet a good option here that doesn't bring complexity.

Required to use slices package.

k8s-ci-robot · 2023-12-08T17:24:20Z

@EmilienM: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-openstack-e2e-test	`8138e6a`	link	true	`/test pull-cluster-api-provider-openstack-e2e-test`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

EmilienM · 2023-12-11T21:03:22Z

I'm taking a break from this PR as I figured that I really want the ports to be managed outside of the instance creation, see #1788
That will simply the workflow:

Reconcile Ports -> Reconcile Security groups -> Reconcile Instances

EmilienM · 2023-12-11T21:03:33Z

/hold

k8s-ci-robot · 2023-12-12T03:55:49Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

EmilienM · 2024-02-21T16:41:17Z

Note: this one is not staled, I'll come back to it at some point this year, now that we have a new API for Security Groups.

k8s-triage-robot · 2024-05-21T17:40:24Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-20T17:52:55Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-07-20T18:00:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-07-20T18:00:50Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 5, 2023

k8s-ci-robot requested review from lentzi90 and mdbooth December 5, 2023 20:39

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 5, 2023

EmilienM force-pushed the machine-sg branch from e1d7f0c to fc9f2e9 Compare December 5, 2023 20:53

EmilienM changed the title ~~Reconcile SecurityGroups <-> Instances separately~~ ✨ Reconcile SecurityGroups <-> Instances separately Dec 5, 2023

EmilienM mentioned this pull request Dec 5, 2023

✨ proposal: new API for managed Security Groups #1756

Closed

EmilienM force-pushed the machine-sg branch 2 times, most recently from 1bc5fd9 to f9f1bae Compare December 6, 2023 01:33