Skip to content
This repository was archived by the owner on Jul 30, 2021. It is now read-only.

🐛 Fix init lock #232

Merged
merged 1 commit into from
Sep 19, 2019
Merged

🐛 Fix init lock #232

merged 1 commit into from
Sep 19, 2019

Conversation

accepting
Copy link
Contributor

Signed-off-by: JunLi [email protected]

What this PR does / why we need it:
Fix bug for the init lock.

Consider the following case:
Two control plane machines(note them as C1, C2) are creating concurrently. Assume C1 first acquires the init lock. Since C2 failed to get lock, it exits the reconcile loop and releases the init lock.
Before C1 finished to initialize itself, we create another control plane machine(note it as C3). Obviously, C3 will acquire the init lock successfully and also generate controlplane-init config which is what we do not want to see.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2019
@k8s-ci-robot
Copy link
Contributor

Welcome @accepting!

It looks like this is your first PR to kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2019
@k8s-ci-robot
Copy link
Contributor

Hi @accepting. Thanks for your PR.

I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Sep 17, 2019
Copy link
Contributor

@fabriziopandini fabriziopandini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@accepting thanks for this PR!
/ok-to-test
IMO this use case deserves a unit test to get properly detected and ensured over time....

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2019
@chuckha
Copy link
Contributor

chuckha commented Sep 17, 2019

The second control plane can't release the lock that the first control plane is holding. If it does release the lock that is a bug.

@vincepri
Copy link
Contributor

Reading this code https://github.com/kubernetes-sigs/cluster-api-bootstrap-provider-kubeadm/blob/master/internal/locking/control_plane_init_mutex.go#L100-L126 it seems like we release the lock without checking if the Machine that has the lock is the same as the current one?

@vincepri
Copy link
Contributor

@chuckha
Copy link
Contributor

chuckha commented Sep 17, 2019

@vincepri !!! yeah that would be a bug 😞

@chuckha chuckha changed the title Fix init lock 🐛 Fix init lock Sep 17, 2019
@accepting
Copy link
Contributor Author

@vincepri @chuckha I don't think it's a good idea that only the owner machine can release the lock. As far as I know, the lock should be hold from the first acquiring to the time when cluster.Status.ControlPlaneInitialized turns to be true. Therefore, the lock owner maybe have no chance to release it.

@chuckha
Copy link
Contributor

chuckha commented Sep 18, 2019

In the easy case when there is simply one control plane machine:

  1. The control plane machine acquires the lock
  2. the control plane machine does not error out and sets bootstrap data. the lock is held
  3. some time passes and the control plane becomes initialized at which time the kubeadm will be reconciled again (due to the cluster watch)
  4. the control plane machine returns early because it has bootstrap data set.

Now a new bootstrap config is created for a worker machine.

  1. the control plane is initialized so the the lock is released. Not that it matters at this point.

Let's say there are two control planes and two managers. This situation should never happen since the deployment uses leader election to ensure only a single manager is running. But let's say this does happen.

The case you originally outlined indeed holds true. C1 acquires the lock. C2 attempts to acquire the lock and cannot and errors out and thus removes the lock. C3 is created and acquires the lock before C1 finishes initializing. Now you have two clusters. Very bad.

What if we never released the lock in the successful case?

I think one clean way to do this is to acquire the lock and never unlock it. The only time we want to unlock the lock is if we hit an error condition after the lock is created but before the bootstrap data is generated.

This simplifies the code a bit. This change is an improvement but we should still delete the Unlock code on line 265.

Interested to hear your thoughts @accepting

@accepting
Copy link
Contributor Author

@chuckha In my opinion, the exact place to release the lock is where cluster.Status.ControlPlaneInitialized is set to true in the cluster controller. If we can not make the cluster controller do the work, the Unlock code on line 265 may be the most right place in CABPK.

The only bad effect I can see is that the Unlock code on line 265 will be involved every time a new bootstrap config is created afterwards. That is really unnecessary.

If we never released the lock in the successful case, it can avoid the bad effect above. But It may lead some bad effects. For example,

  1. Create a cluster with only one control plane machine, note it as C1.
  2. C1 acquires the init lock successfully and hold it forever. That means the configmap {cluser-namespace}/{cluster-name}-lock always exist. And the configmap contains C1's machine name.
  3. After some days, we realize some problems in this cluster and want to rebuild this cluster. We delete the cluster and all machines in this cluster.
  4. We recreate the cluster with the same cluster name and with different control planes.
  5. Create control plane NC1 and its bootstrap config. NC1 tries to acquire the init lock. At this time, since the configmap {cluser-namespace}/{cluster-name}-lock exists, and the information in the configmap is C1's machine name which not equals to NC1's machine name, NC1 fails to acquire the init lock. Hence, NC1 can not generate bootstrap data forever.

@chuckha
Copy link
Contributor

chuckha commented Sep 18, 2019

@accepting the config map has an owner reference so it will be deleted when we delete the owning machine

if !cluster.Status.ControlPlaneInitialized {
defer func() {
if rerr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in this change, why do we only unlock when rerr has an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since any error case in this block means the bootstrap data is not generated successfully, we restore the lock. Do you have any better suggestion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chuckha I have moved this code behide the Lock code.

@chuckha
Copy link
Contributor

chuckha commented Sep 18, 2019

another way this would work is:

If the bootstrap config already has bootstrap data
and the bootstrap config is holding the lock and the cluster.status.controlplaneinitialized is true, we can then delete the lock at that point

@accepting
Copy link
Contributor Author

@chuckha You are right, I missed that. But the owner reference is set as the cluster now. We may need to change it to the related machine.

Signed-off-by: JunLi <[email protected]>
Copy link
Contributor

@chuckha chuckha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

This definitely cleans up the logic, thanks

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: accepting, chuckha

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2019
Copy link
Contributor

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 19, 2019
@k8s-ci-robot k8s-ci-robot merged commit b7c1ea3 into kubernetes-retired:master Sep 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants