-
Notifications
You must be signed in to change notification settings - Fork 4
[14 Jan 2025] - [sidecar] logs are getting flooded as sidecar code is have endless Retry mechanism #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The sidecar is expected to be an operator of sorts, so I do expect it to retry until it's successful. That's part of the control theory that keeps Kuberentes and its systems stable. That said, normally I also expect some sort of backoff mechanism. I would propose that the fix for this issue report focus on ensuring a reaosonable backoff. We can probably start with the backoff strategy and timing recommended by controller-runtime (reference needed) and readjust as needed in the future if this continues to come up. |
One of my first thoughts when triaging/planning bugs is what sort of testing is needed. Regression tests are very important. For bugs related to reconcile retry timing and logging, I have found that timing-related log-output expectations are hard to codify. In Rook, we tend to not create regression tests for these cases and instead just try to do our best to make sure system internals are logging helpful info without frequent spam. @shanduur , my inclination here is to not require deeply involved unit/e2e tests, but I'm curious to your input here as well. |
FYI, we have discussed this also in relation to other v1alpha2 things, and we plan to focus on this for the implementation portion of v1alpha2 work. There are other places where COSI needs to wait for something to happen and does so with a wait loop rather than watching dependent resources. Switching to the controller-runtime framework will give us better tools for waiting on dependent resources without active waits like this one. This is planned work, but it will likely happen under the umbrella of switching to controller-runtime and subsequent follow-up work for v1alhpa2 implementation. I'm not sure this is necessarily a bug -- more of an optimization IMO-- but it won't hurt to classify it as one for those who view it through the bug lens. |
What happened:
logs are getting flooded as Drive is have endless Retry mechanism. WE can say spamming is happening on the failure command.
What you expected to happen:
logs should have a counter for retry when we see errors while bucket creation/access/grant.
As same error message is repeated day/night if failure is not fixed. User should have a control, how many time system should retry
How to reproduce this bug (as minimally and precisely as possible):
If the issue remains for couple of days, then it will eat all memory and space of the system
Issue is
The log handling for the COSI APIs is handled by the sidecar (https://github.com/kubernetes-sigs/container-object-storage-interface-provisioner-sidecar) where if an error is occured it endless goes on a retry mechanism. However, the sidecar does not currently stop retrying after some time. We should have a tweaking counter for the same
The text was updated successfully, but these errors were encountered: