Skip to content

Improve csi-snapshotter VolumeSnapshotContent requeue fairness #1282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pwschuurman opened this issue Mar 21, 2025 · 0 comments · May be fixed by #1284
Open

Improve csi-snapshotter VolumeSnapshotContent requeue fairness #1282

pwschuurman opened this issue Mar 21, 2025 · 0 comments · May be fixed by #1284

Comments

@pwschuurman
Copy link
Contributor

Is your feature request related to a problem?/Why is this needed

This enhancement is to improve the requeue behavior for syncing VolumeSnapshotContent resources.

VolumeSnapshotContent resources are reconciled via the contentQueue. For snapshots that are long running, this can be very useful to reduce the amount of polling required to determine if a snapshot is readyToUse=true. However the exponential nature of this backoff can result in the contentQueue rate limiter quickly reaching the maximum. The current default is 1 second, and the current maximum is 300 seconds. This only requires [9 requeue events] to reach the maximum. This limit can quickly be reached today, if a VolumeSnapshotContent is updated. Updates (especially re-entrant updates) trigger resync and requeue, which can quickly bump up the rate limiter retry number, resulting in long requeue wait times.

Describe the solution you'd like in detail

There are two things that should be fixed here:

  1. Prevent updates from bumping the requeue rate limiter limit: Ideally, a additional call to contentQueue.AddRateLimited() should not increase the rate limiter exponent if an item is already scheduled to be requeued. It should either maintain the same requeue schedule, or be adjusted to requeue further into the future, but with the same backoff exponent.
  2. Reduce the number of re-entrant updates. This can reduce the number of requeues (which can lead to the problem above). Some updates are necessary for tracking the lifecycle VolumeSnapshotContent. However it appears that the snapshot.storage.kubernetes.io/volumesnapshot-being-created annotation can be removed early during, prior to the snapshot actually being marked as readyToUse.

Describe alternatives you've considered

A quick fix alternative is just to decrease the max exponential backoff of contentQueue to a lower default (eg: 30 seconds, 60 seconds). This can be used by a CO to reduce the likelihood of higher latency VolumeSnapshotContent reconciliation.

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant