-
Notifications
You must be signed in to change notification settings - Fork 91
Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @LukeAVanDrie. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Fixes: #719 |
/ok-to-test |
I agree with @liu-cong's general point, making this function synchronous doesn't sound to me like the right solution.
I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed. pm.StopRefreshLoop()
time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE
pmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated}) |
Nir's suggestion makes sense. Though in theory we could still run into a
flakiness due to the async nature, adding that timeout should reduce
flakiness to near zero in practice I believe.
…On Tue, May 13, 2025 at 12:34 AM Nir Rozenbaum ***@***.***> wrote:
*nirrozenbaum* left a comment
(kubernetes-sigs/gateway-api-inference-extension#824)
<#824 (comment)>
I agree with @liu-cong <https://github.com/liu-cong>'s general point,
making this function synchronous doesn't sound to me like the right
solution.
thinking forward about data layer and extensible scrapers, we might need
to run multiple routines for different scrapers.
I wouldn't touch the existing pod metrics code. this is an issue with the
test itself.
@LukeAVanDrie <https://github.com/LukeAVanDrie> your analysis of the
issue is great:
1. Signal the metrics refresh goroutine to stop via StopRefreshLoop().
2. Immediately update the mock PodMetricsClient to return different
metric values.
3. Assert that the PodMetrics instance still held the original metrics.
Occasionally, the refresh goroutine would execute one final time after
StopRefreshLoop() was called but before it fully terminated, picking up the
new metrics. This led to the assertion failing.
I think a very easy and straight forward fix to the above problem is to
let the test wait until the final call is completed.
taking the scraping timeout as a worst case is sufficient.
so I would fix the bug by adding this single code line in the test file:
pm.StopRefreshLoop()time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINEpmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated})
—
Reply to this email directly, view it on GitHub
<#824 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUVECT6YT2YXAXQKOPUYEL26GOA5AVCNFSM6AAAAAB467IC3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNZVGM3TOMZSGQ>
.
You are receiving this because you were mentioned.Message ID:
<kubernetes-sigs/gateway-api-inference-extension/pull/824/c2875377324@
github.com>
|
this is what i was going to suggest. This solves the problem of the test flake and if we want to change how the metrics loop is handled we can do that later. For the test: we can even add another second as a grace period which would basically reduce the probability of race conditions to zero, even though it will already be very very low |
I incorporated this change and removed the wait group. I did, however, leave the |
The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a race condition. The `StopRefreshLoop` method would signal the metrics refresh goroutine to stop but did not wait for its actual termination. If the test updated the mock metrics client immediately after calling `StopRefreshLoop`, the refresh goroutine could, in rare cases, perform a final metrics fetch with the new data before fully exiting. This resulted in the test asserting against unexpected metric values. This commit resolves the issue by making adding a sleep for the metrics refresh interval in TestMetricsRefresh. Additionally, it adds the following for robustness in `StopRefreshLoop`. - `stopOnce` is used to ensure the `done` channel is only closed once (for idempotency and protection against concurrent calls). This change ensures that the refresh goroutine is guaranteed to have stopped before any test assertions are made, eliminating the race condition.
/lgtm |
/approve Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kfswain, LukeAVanDrie The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ah, yeah no problem! Most of our slash commands are a signal for the Either way, I manually closed the issue, thanks for doin this |
The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a race condition. The `StopRefreshLoop` method would signal the metrics refresh goroutine to stop but did not wait for its actual termination. If the test updated the mock metrics client immediately after calling `StopRefreshLoop`, the refresh goroutine could, in rare cases, perform a final metrics fetch with the new data before fully exiting. This resulted in the test asserting against unexpected metric values. This commit resolves the issue by making adding a sleep for the metrics refresh interval in TestMetricsRefresh. Additionally, it adds the following for robustness in `StopRefreshLoop`. - `stopOnce` is used to ensure the `done` channel is only closed once (for idempotency and protection against concurrent calls). This change ensures that the refresh goroutine is guaranteed to have stopped before any test assertions are made, eliminating the race condition.
This PR addresses flakiness observed in the TestMetricsRefresh test within the pkg/epp/backend/metrics package.
Problem
The root cause of the flakiness was a race condition stemming from the asynchronous nature of the
StopRefreshLoop
method. The test would:StopRefreshLoop()
.PodMetricsClient
to return different metric values.PodMetrics
instance still held the original metrics.Occasionally, the refresh goroutine would execute one final time after
StopRefreshLoop()
was called but before it fully terminated, picking up the new metrics. This led to the assertion failing.Solution
To resolve this, I added a
time.Sleep(pmf.refreshMetricsInterval * 2 /* small buffer for robustness */)
before any test assertions are made. This causes the test to sleep for just 2ms. I also added astopOnce
around closing thedone
channel for robustness.