Skip to content

✨ Reconcile SecurityGroups <-> Instances separately #1774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

EmilienM
Copy link
Contributor

@EmilienM EmilienM commented Dec 5, 2023

What this PR does / why we need it:

Originally, the openstackmachine controller was adding the SecurityGroups to the instanceSpec.
Now, the openstackmachine controller will first create the instances
without security groups. Then we have a loop that will reconcile the
desired security groups that we want to apply in the cluster[1] with the
ones that are added to the instance's ports.

[1] for now, we have 2 sets of rules: the one that can be added to
OpenStackMachine.Spec.SecurityGroups and the one managed
by OpenStackCluster.Spec.ManagedSecurityGroups.

TODOs:

  • Handle what to do with OpenStackMachineSpec.Ports is defined with a SecurityGroupFilter. Probably I need to rework all work in Status and puts Ports in Status.
  • For an update to applied security groups, think about how we could indicate in the cluster status as a summary of how many machines have been updated.
  • add unit tests to at least the openstackmachine_controller
  • Handle "default" security group applied to all created ports by passing an empty slice of strings if no security groups is required when creating the port.
  • update documentation
  • squashed commits

Copy link

netlify bot commented Dec 5, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 8138e6a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/65732e56589bf800082faacd
😎 Deploy Preview https://deploy-preview-1774--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 5, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: EmilienM
Once this PR has been reviewed and has the lgtm label, please assign cecilerobertmichon for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 5, 2023
@EmilienM EmilienM changed the title Reconcile SecurityGroups <-> Instances separately ✨ Reconcile SecurityGroups <-> Instances separately Dec 5, 2023
@EmilienM EmilienM force-pushed the machine-sg branch 2 times, most recently from 1bc5fd9 to f9f1bae Compare December 6, 2023 01:33
@@ -107,6 +107,11 @@ type OpenStackMachineStatus struct {
// +optional
InstanceState *InstanceState `json:"instanceState,omitempty"`

// SecurityGroups is the list of security groups that were applied to the ports
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself: in comment I'll explain that we only populate that once all the rules have been applied.

Note for reviewers: do we want to have another field SecurityGroupsReady once the reconciler has finished its job successfully? Now, let's say we managed to successfully reconcile once and Status.SecurityGroups was populated once, then somehow the reconciler fails but we still have Status.SecurityGroups, we have mark the instance as NotReady. Is that good enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation note: something like this would be implemented as a Condition rather than a status field. I don't think it's worth having a condition for this unless we think we have a consumer for it.

The question is: who consumed the Ready signal and what are their expectations of it? The answer is defined here: https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine:

The status object must at least one field defined:

  • ready - a boolean field indicating if the infrastructure is ready to be used or not.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

Copy link
Contributor Author

@EmilienM EmilienM Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

Exactly. That's what I meant when writing that code:

err = r.reconcileSecurityGroupsToInstance(scope.Logger(), openStackCluster, machine, openStackMachine, instanceStatus, computeService, networkingService)
if err != nil {
	conditions.MarkFalse(openStackMachine, infrav1.InstanceReadyCondition, infrav1.SecurityGroupApplyFailedReason, clusterv1.ConditionSeverityError, "Applying security groups failed: %v", err)
	return ctrl.Result{}, fmt.Errorf("add security groups to instance: %w", err)
}

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

ok I'll try to think how the state of how many machines have been updated can be reflected in the OpenStackClusterStatus.

I've updated my TODO in the PR and I'll close that comment once I came up with something.

if !contains(desiredSecurityGroupIDs, securityGroupID) {
logger.Info("Port has security group that is not desired", "portID", portID, "securityGroupID", securityGroupID)
portUpdateNeeded = true
break
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for myself: I'll remove the break so we get all logs in the loop. We know we need to update the port anyway at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in CI logs and I could see:

I1206 03:01:32.375132       1 securitygroups.go:434] "Port has security group that is not desired" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="e2e-hwo2dd/cluster-e2e-hwo2dd" namespace="e2e-hwo2dd" name="cluster-e2e-hwo2dd" reconcileID=8ce25157-7075-4f45-96db-1a5bc66468f3 cluster="cluster-e2e-hwo2dd" portID="06b4adb5-05b9-405f-ad05-a8f5e5eea001" securityGroupID="0f239ce1-0ab5-4e6e-9f91-1afe29e21e5d"

And 0f239ce1-0ab5-4e6e-9f91-1afe29e21e5d is the default security group (source). Do we want to ignore this one when checking unwanted SGs attached to the ports?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is it a bug that the default security group is attached to the ports?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ports that are created when port security is enabled will have the default security group itself managed by Neutron.

I see two options:

  • We keep it and look for its ID (once and early to avoid many Neutron API calls) and put it in desiredSecuritygroupIDs, therefore it's not a problem anymore.
  • We remove it in this loop

I personally prefer option 1 for safety & because that's default OpenStack behaviours here.

cc @stephenfin with your OpenStack dev hat, any preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh, until today I didn't know that adding a port would add the default security group. What is the source of these ports? If they are ports that we are creating ourselves then we can instruct neutron not to assign a security group to the port. From openstackclient this is done by e.g.:

openstack port create --no-security-group ...

which under the hood sets the security_group_ids field to [] (empty list) when creating the port. Could we do the same thing?

If these ports are coming from elsewhere then it's possible to have other security groups than the default one assigned to the port (openstack port create --security-group <sec-group> ...) so I'm not sure how we can ignore everything? Do we forbid specifying ports with security groups defined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above, we probably need to decide what's authoritative: the security groups on the port, or the security ports defined on the Machine. I assume the latter, but if this is the case then we're going to modify something that the user gave us. If we go with the former, we're either modifying or ignoring something else that the user gave us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the source of these ports?

This is where we pre-build ports before creating an instance:

func (s *Service) constructPorts(openStackCluster *infrav1.OpenStackCluster, instanceSpec *InstanceSpec) ([]infrav1.PortOpts, error) {
// Ensure user-specified ports have all required fields
ports, err := s.normalizePorts(instanceSpec.Ports, openStackCluster, instanceSpec)
if err != nil {
return nil, err
}
// no networks or ports found in the spec, so create a port on the cluster network
if len(ports) == 0 {
port := infrav1.PortOpts{
Network: &infrav1.NetworkFilter{
ID: openStackCluster.Status.Network.ID,
},
Trunk: &instanceSpec.Trunk,
}
for _, subnet := range openStackCluster.Status.Network.Subnets {
port.FixedIPs = append(port.FixedIPs, infrav1.FixedIP{
Subnet: &infrav1.SubnetFilter{
ID: subnet.ID,
},
})
}
ports = []infrav1.PortOpts{port}
}
// trunk support is required if any port has trunk enabled
portUsesTrunk := func() bool {
for _, port := range ports {
if port.Trunk != nil && *port.Trunk {
return true
}
}
return false
}
if portUsesTrunk() {
trunkSupported, err := s.isTrunkExtSupported()
if err != nil {
return nil, err
}
if !trunkSupported {
return nil, fmt.Errorf("there is no trunk support. please ensure that the trunk extension is enabled in your OpenStack deployment")
}
}
return ports, nil
}

We don't pass any security groups.

Could we do the same thing?

I don't think so because we don't provide a list but a list of SecurityGroupFilter, so I suspect that nothing will be passed to Neutron API if we pass an empty list of filters.

we probably need to decide what's authoritative:

I want CAPO to have the authority if possible: if users provide a security group for the Machine, then we add it to the port in the reconcile (not at instance creation) - so instance + initial ports have no SGs in theory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed e651abb - I have little hope that will fix this "default" SG, however this is addressing a real problem. We can probably classify this change as unrelated to this comment initially but I wanted to make it visible for review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with @pierreprinetti and the good news is that port.CreateOpts.SecurityGroups in Gophercloud is *[]string so if we pass an empty slice of strings, the port should be created with no security group. I'll test it.

@@ -107,6 +107,11 @@ type OpenStackMachineStatus struct {
// +optional
InstanceState *InstanceState `json:"instanceState,omitempty"`

// SecurityGroups is the list of security groups that were applied to the ports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation note: something like this would be implemented as a Condition rather than a status field. I don't think it's worth having a condition for this unless we think we have a consumer for it.

The question is: who consumed the Ready signal and what are their expectations of it? The answer is defined here: https://cluster-api.sigs.k8s.io/developer/architecture/controllers/machine:

The status object must at least one field defined:

  • ready - a boolean field indicating if the infrastructure is ready to be used or not.

For machine creation I think it's clear that we would only set ready after security groups have been applied.

For an update to applied security groups it's less clear, and the CAPI definition doesn't indicate if setting it to false after having set it to true is meaningful or allowed. I would not be inclined to set it to false during a rollout of new managed security groups. I think this would be better indicated in the cluster status as a summary of how many machines have been updated.

@EmilienM EmilienM force-pushed the machine-sg branch 2 times, most recently from 051f8ea to 22ac367 Compare December 7, 2023 18:25
Originally, we were adding the SecurityGroups to the instanceSpec.
Now, the openstackmachine controller will first create the instances
without security groups. Then we have a loop that will reconcile the
desired security groups that we want to apply in the cluster[1] with the
ones that are added to the instance's ports.

[1] for now, we have 2 sets of rules: the one that can be added to
  `OpenStackMachine.Spec.SecurityGroups` and the one managed
  by `OpenStackCluster.Spec.ManagedSecurityGroups`.
When defining an OpenStackMachine with `Ports`, we no longer allow the
use of `SecurityGroupFilters` because we now have a separate loop to
reconcile ports based on 2 things:

* If `OpenStackCluster.Spec.ManagedSecurityGroups`, machine ports
  will be reconciled to have the security group of their role
  (cp/worker).
* If `OpenStackMachine.Spec.SecurityGroups` is specified for the
  machine, we'll reconcile the machine ports to ensure they have
  the Security Groups asked by the user.

However if a user provides `OpenStackMachine.Spec.Ports`, we will return
an error if the Port has `SecurityGroupFilters` because we have a
separate reconcile which update *all* ports with the same Security
Groups.
@EmilienM EmilienM mentioned this pull request Dec 7, 2023
3 tasks
@EmilienM
Copy link
Contributor Author

EmilienM commented Dec 8, 2023

/test pull-cluster-api-provider-openstack-e2e-test

1 similar comment
@EmilienM
Copy link
Contributor Author

EmilienM commented Dec 8, 2023

/test pull-cluster-api-provider-openstack-e2e-test

if err != nil {
return fmt.Errorf("error getting port %s: %v", portID, err)
}

Copy link
Contributor Author

@EmilienM EmilienM Dec 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing 1593e43 - we could:

  • Put a new struct which maps ports and their security group IDs in OpenStackMachine.Status so we do the lookup once
  • we don't want to add these SG IDs to the slices in the openstackmachine_controller that we created for SG IDs to reconcile because they are specific to ports and not applied for all ports (unlike ManagedSecurityGroups & Spec.SecurityGroups)
  • Here in the reconcile, if a port is found in the Status with SG IDs, we simply add them to the desired SGs.

This is a bit complex but this is to address the 3 current options that we have to attach security groups to machines:

  • OpenStackCluster.Spec.ManagedSecurityGroups
  • OpenStackMachine.Spec.SecurityGroups
  • OpenStackMachine.Spec.Ports.SecurityGroupFilters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is we don't reconcile the security groups from ports defined in OpenStack.Machine.Spec.Ports because we consider them immutable. That feels odd to me, and I can't find yet a good option here that doesn't bring complexity.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 8, 2023
Required to use slices package.
@k8s-ci-robot
Copy link
Contributor

@EmilienM: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-openstack-e2e-test 8138e6a link true /test pull-cluster-api-provider-openstack-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@EmilienM
Copy link
Contributor Author

I'm taking a break from this PR as I figured that I really want the ports to be managed outside of the instance creation, see #1788
That will simply the workflow:

Reconcile Ports -> Reconcile Security groups -> Reconcile Instances

@EmilienM
Copy link
Contributor Author

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 11, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@EmilienM
Copy link
Contributor Author

Note: this one is not staled, I'll come back to it at some point this year, now that we have a new API for Security Groups.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants