You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem?/Why is this needed
This enhancement is to improve the requeue behavior for syncing VolumeSnapshotContent resources.
VolumeSnapshotContent resources are reconciled via the contentQueue. For snapshots that are long running, this can be very useful to reduce the amount of polling required to determine if a snapshot is readyToUse=true. However the exponential nature of this backoff can result in the contentQueue rate limiter quickly reaching the maximum. The current default is 1 second, and the current maximum is 300 seconds. This only requires [9 requeue events] to reach the maximum. This limit can quickly be reached today, if a VolumeSnapshotContent is updated. Updates (especially re-entrant updates) trigger resync and requeue, which can quickly bump up the rate limiter retry number, resulting in long requeue wait times.
Describe the solution you'd like in detail
There are two things that should be fixed here:
Prevent updates from bumping the requeue rate limiter limit: Ideally, a additional call to contentQueue.AddRateLimited() should not increase the rate limiter exponent if an item is already scheduled to be requeued. It should either maintain the same requeue schedule, or be adjusted to requeue further into the future, but with the same backoff exponent.
Reduce the number of re-entrant updates. This can reduce the number of requeues (which can lead to the problem above). Some updates are necessary for tracking the lifecycle VolumeSnapshotContent. However it appears that the snapshot.storage.kubernetes.io/volumesnapshot-being-created annotation can be removed early during, prior to the snapshot actually being marked as readyToUse.
Describe alternatives you've considered
A quick fix alternative is just to decrease the max exponential backoff of contentQueue to a lower default (eg: 30 seconds, 60 seconds). This can be used by a CO to reduce the likelihood of higher latency VolumeSnapshotContent reconciliation.
Additional context
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?/Why is this needed
This enhancement is to improve the
requeue
behavior for syncing VolumeSnapshotContent resources.VolumeSnapshotContent resources are reconciled via the contentQueue. For snapshots that are long running, this can be very useful to reduce the amount of polling required to determine if a snapshot is
readyToUse=true
. However the exponential nature of this backoff can result in the contentQueue rate limiter quickly reaching the maximum. The current default is1 second
, and the current maximum is300 seconds
. This only requires [9 requeue events] to reach the maximum. This limit can quickly be reached today, if a VolumeSnapshotContent is updated. Updates (especially re-entrant updates) trigger resync and requeue, which can quickly bump up the rate limiter retry number, resulting in long requeue wait times.Describe the solution you'd like in detail
There are two things that should be fixed here:
requeue
rate limiter limit: Ideally, a additional call tocontentQueue.AddRateLimited()
should not increase the rate limiter exponent if an item is already scheduled to be requeued. It should either maintain the same requeue schedule, or be adjusted to requeue further into the future, but with the same backoff exponent.snapshot.storage.kubernetes.io/volumesnapshot-being-created
annotation can be removed early during, prior to the snapshot actually being marked asreadyToUse
.Describe alternatives you've considered
A quick fix alternative is just to decrease the max exponential backoff of
contentQueue
to a lower default (eg: 30 seconds, 60 seconds). This can be used by a CO to reduce the likelihood of higher latency VolumeSnapshotContent reconciliation.Additional context
The text was updated successfully, but these errors were encountered: