Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

LukeAVanDrie · 2025-05-12T21:40:18Z

This PR addresses flakiness observed in the TestMetricsRefresh test within the pkg/epp/backend/metrics package.

Problem

The root cause of the flakiness was a race condition stemming from the asynchronous nature of the StopRefreshLoop method. The test would:

Signal the metrics refresh goroutine to stop via StopRefreshLoop().
Immediately update the mock PodMetricsClient to return different metric values.
Assert that the PodMetrics instance still held the original metrics.

Occasionally, the refresh goroutine would execute one final time after StopRefreshLoop() was called but before it fully terminated, picking up the new metrics. This led to the assertion failing.

Solution

To resolve this, I added a time.Sleep(pmf.refreshMetricsInterval * 2 /* small buffer for robustness */) before any test assertions are made. This causes the test to sleep for just 2ms. I also added a stopOnce around closing the done channel for robustness.

k8s-ci-robot · 2025-05-12T21:40:27Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-05-12T21:40:27Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`5e482aa`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/6823b93e719617000835a30c
😎 Deploy Preview	https://deploy-preview-824--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

kfswain · 2025-05-12T22:11:48Z

Fixes: #719

kfswain · 2025-05-12T22:12:35Z

/ok-to-test

pkg/epp/backend/metrics/pod_metrics.go

nirrozenbaum · 2025-05-13T07:34:15Z

I agree with @liu-cong's general point, making this function synchronous doesn't sound to me like the right solution.
thinking forward about data layer and extensible scrapers, we might need to run multiple routines for different scrapers.
I wouldn't touch the existing pod metrics code. this is an issue with the test itself.
@LukeAVanDrie your analysis of the issue is great:

Signal the metrics refresh goroutine to stop via StopRefreshLoop().

Immediately update the mock PodMetricsClient to return different metric values.

Assert that the PodMetrics instance still held the original metrics.

Occasionally, the refresh goroutine would execute one final time after StopRefreshLoop() was called but before it fully terminated, picking up the new metrics. This led to the assertion failing.

I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed.
taking the scraping timeout as a worst case is sufficient.
so I would fix the flakiness by adding this single code line in the test file:

pm.StopRefreshLoop()
time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE
pmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated})

liu-cong · 2025-05-13T15:52:44Z

Nir's suggestion makes sense. Though in theory we could still run into a flakiness due to the async nature, adding that timeout should reduce flakiness to near zero in practice I believe.

…

On Tue, May 13, 2025 at 12:34 AM Nir Rozenbaum ***@***.***> wrote: *nirrozenbaum* left a comment (kubernetes-sigs/gateway-api-inference-extension#824) <#824 (comment)> I agree with @liu-cong <https://github.com/liu-cong>'s general point, making this function synchronous doesn't sound to me like the right solution. thinking forward about data layer and extensible scrapers, we might need to run multiple routines for different scrapers. I wouldn't touch the existing pod metrics code. this is an issue with the test itself. @LukeAVanDrie <https://github.com/LukeAVanDrie> your analysis of the issue is great: 1. Signal the metrics refresh goroutine to stop via StopRefreshLoop(). 2. Immediately update the mock PodMetricsClient to return different metric values. 3. Assert that the PodMetrics instance still held the original metrics. Occasionally, the refresh goroutine would execute one final time after StopRefreshLoop() was called but before it fully terminated, picking up the new metrics. This led to the assertion failing. I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed. taking the scraping timeout as a worst case is sufficient. so I would fix the bug by adding this single code line in the test file: pm.StopRefreshLoop()time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINEpmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated}) — Reply to this email directly, view it on GitHub <#824 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABUVECT6YT2YXAXQKOPUYEL26GOA5AVCNFSM6AAAAAB467IC3WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQNZVGM3TOMZSGQ> . You are receiving this because you were mentioned.Message ID: <kubernetes-sigs/gateway-api-inference-extension/pull/824/c2875377324@ github.com>

kfswain · 2025-05-13T16:04:58Z

time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE

this is what i was going to suggest. This solves the problem of the test flake and if we want to change how the metrics loop is handled we can do that later. For the test: we can even add another second as a grace period which would basically reduce the probability of race conditions to zero, even though it will already be very very low

LukeAVanDrie · 2025-05-13T21:20:31Z

I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed. taking the scraping timeout as a worst case is sufficient. so I would fix the flakiness by adding this single code line in the test file:
pm.StopRefreshLoop()
time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE
pmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated})

I incorporated this change and removed the wait group. I did, however, leave the stopOnce for robustness, but I can also remove this if you feel it is unnecessary.

The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a race condition. The `StopRefreshLoop` method would signal the metrics refresh goroutine to stop but did not wait for its actual termination. If the test updated the mock metrics client immediately after calling `StopRefreshLoop`, the refresh goroutine could, in rare cases, perform a final metrics fetch with the new data before fully exiting. This resulted in the test asserting against unexpected metric values. This commit resolves the issue by making adding a sleep for the metrics refresh interval in TestMetricsRefresh. Additionally, it adds the following for robustness in `StopRefreshLoop`. - `stopOnce` is used to ensure the `done` channel is only closed once (for idempotency and protection against concurrent calls). This change ensures that the refresh goroutine is guaranteed to have stopped before any test assertions are made, eliminating the race condition.

liu-cong · 2025-05-13T22:00:11Z

/lgtm

kfswain · 2025-05-13T22:15:16Z

/approve

Thanks!

k8s-ci-robot · 2025-05-13T22:15:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2025-05-13T23:02:46Z

Resolves #719

@kfswain I cannot seem to get this to work.

kfswain · 2025-05-13T23:11:44Z

Ah, yeah no problem! Most of our slash commands are a signal for the k8s-ci-robot to take some sort of action. the fixes or resolves keyword is a GH concept so no slash: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue

Either way, I manually closed the issue, thanks for doin this

The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a race condition. The `StopRefreshLoop` method would signal the metrics refresh goroutine to stop but did not wait for its actual termination. If the test updated the mock metrics client immediately after calling `StopRefreshLoop`, the refresh goroutine could, in rare cases, perform a final metrics fetch with the new data before fully exiting. This resulted in the test asserting against unexpected metric values. This commit resolves the issue by making adding a sleep for the metrics refresh interval in TestMetricsRefresh. Additionally, it adds the following for robustness in `StopRefreshLoop`. - `stopOnce` is used to ensure the `done` channel is only closed once (for idempotency and protection against concurrent calls). This change ensures that the refresh goroutine is guaranteed to have stopped before any test assertions are made, eliminating the race condition.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 12, 2025

k8s-ci-robot requested review from ahg-g and robscott May 12, 2025 21:40

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 12, 2025

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 12, 2025

LukeAVanDrie mentioned this pull request May 12, 2025

Introduce SaturationDetector component #808

Merged

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 12, 2025

liu-cong reviewed May 12, 2025

View reviewed changes

pkg/epp/backend/metrics/pod_metrics.go Outdated Show resolved Hide resolved

LukeAVanDrie force-pushed the flakes branch from b85f8bb to 1bd149c Compare May 13, 2025 21:18

LukeAVanDrie force-pushed the flakes branch from 1bd149c to f283133 Compare May 13, 2025 21:25

LukeAVanDrie force-pushed the flakes branch from f283133 to 5e482aa Compare May 13, 2025 21:27

LukeAVanDrie changed the title ~~Fix Test Flakiness by Making Pod Metrics StopRefreshLoop Synchronous~~ Fix Test Flakiness by adding short sleep in TestMetricsRefresh May 13, 2025

k8s-ci-robot assigned liu-cong May 13, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025

k8s-ci-robot merged commit 8baf74c into kubernetes-sigs:main May 13, 2025
8 checks passed

LukeAVanDrie deleted the flakes branch May 13, 2025 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

Uh oh!

LukeAVanDrie commented May 12, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 12, 2025

Uh oh!

netlify bot commented May 12, 2025 •

edited

Loading

Uh oh!

kfswain commented May 12, 2025

Uh oh!

kfswain commented May 12, 2025

Uh oh!

Uh oh!

nirrozenbaum commented May 13, 2025 •

edited

Loading

Uh oh!

liu-cong commented May 13, 2025 via email

Uh oh!

kfswain commented May 13, 2025 •

edited

Loading

Uh oh!

LukeAVanDrie commented May 13, 2025 •

edited

Loading

Uh oh!

liu-cong commented May 13, 2025

Uh oh!

kfswain commented May 13, 2025

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented May 13, 2025 •

edited

Loading

Uh oh!

kfswain commented May 13, 2025

Uh oh!

Uh oh!

Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

Uh oh!

Conversation

LukeAVanDrie commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Uh oh!

k8s-ci-robot commented May 12, 2025

Uh oh!

netlify bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

kfswain commented May 12, 2025

Uh oh!

kfswain commented May 12, 2025

Uh oh!

Uh oh!

nirrozenbaum commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liu-cong commented May 13, 2025 via email

Uh oh!

kfswain commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeAVanDrie commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liu-cong commented May 13, 2025

Uh oh!

kfswain commented May 13, 2025

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfswain commented May 13, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented May 12, 2025 •

edited

Loading

netlify bot commented May 12, 2025 •

edited

Loading

nirrozenbaum commented May 13, 2025 •

edited

Loading

kfswain commented May 13, 2025 •

edited

Loading

LukeAVanDrie commented May 13, 2025 •

edited

Loading

LukeAVanDrie commented May 13, 2025 •

edited

Loading