Skip to content
This repository was archived by the owner on Dec 6, 2024. It is now read-only.

Fix bucket* controller framework #25

Merged
merged 1 commit into from
Mar 10, 2021

Conversation

wlan0
Copy link
Contributor

@wlan0 wlan0 commented Mar 8, 2021

  • backoff exponentially on errors
  • ensure correct behavior when multiple operations on the same object pile on top of one another

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wlan0

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 8, 2021

// Ensure that multiple operations on different versions of the same object
// do not happen in parallel
c.OpLock(op)
defer c.OpUnlock(op)
c.OpLock(uuid)
Copy link

@jeffvance jeffvance Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I am missing something here. How does tracking uuid prevent operating in parallel on different versions of the same object? Wouldn't this serialize on all versions of the object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our controller, we run a single shared informer between multiple crd resource event handlers. The single shared informer reports changes in bucket* objects as they occur. The events are processed by first encoding them in operation structs, and putting them into the tail of a workqueue. The workqueue implementation is what this PR mostly touches on.

The workqueue was previously holding operation objects (addOp, updateOp, deleteOp). On any event, these operation objects were put into the queue, and waited to be processed by the next available thread.

These objects were not comparable with one another, and that lead to a few problems:

  • Gven an object, if update occurs while add is waiting in the queue, it should only result in 1 event - add, with the updated fields
  • Gven an object, if update occurs while another update is waiting in the queue, then the second update is the only one that should be processed

The previous implementation, albeit technically eventually consistent, could not ensure the above dynamics. This was simply because it was not possible to determine new events on an object already present in the queue.

Now, for the uuid. This is a string field that is easily comparable (==), and by using the kubernetes UID as the unique value in this field, it is possible to compare two events occurring on the same object.

This change also allowed me to improve the exponential backoff characteristics. If new events occur while an operation on the same object is in back-off, the new event will not be processed before the back-off period is complete. i.e. the uuid positionin the queue is not changed. Instead, the value of opMap[uuid] is set to the new operation structure denoting the event. By decoupling the operation itself from the queuing order, we gain a lot of flexibility.

Now to answer your direct question,

Sorry, I am missing something here. How does tracking uuid prevent operating in parallel on different versions of the same object? Wouldn't this serialize on all versions of the object?

By using uuid, we do not serialize the steps. We either replace old events with new ones if both are updates or handle them differently based on the type of the events.

o.Indexer.Delete(o.Object)
} else {
glog.Errorf("Error deleting %s %s: %v", ns, name, err)
}
Copy link

@jeffvance jeffvance Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in default block can we insert the operator that was not handled, or is that logged earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

func (c *ObjectStorageController) GetOpLock(op types.UID) *sync.Mutex {
lockKey := op
c.lockerLock.Lock()
defer c.lockerLock.Unlock()
Copy link

@jeffvance jeffvance Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the defer be after L326 to handle the case of c.locker == nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the docs say that deferred funcs can be called anytime after the defer statement itself, my understanding is that it always gets called only after returning from the function it is called from. i.e. only after the defer starts going out of scope.

 - backoff exponentially on errors
 - ensure correct behavior when multiple operations on the same object pile on top of one another
@brahmaroutu
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2021
@k8s-ci-robot k8s-ci-robot merged commit 05acb5e into kubernetes-retired:master Mar 10, 2021
shanduur pushed a commit to shanduur/container-object-storage-interface-api that referenced this pull request Jun 6, 2024
BlaineEXE pushed a commit to BlaineEXE/cosi-api that referenced this pull request Jun 14, 2024
BlaineEXE pushed a commit to BlaineEXE/cosi-api that referenced this pull request Jun 14, 2024
remove opaque parameters from delete request and return bucket id o…
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants