Skip to content

Fix Test Flakiness by adding short sleep in TestMetricsRefresh #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 13, 2025

Conversation

LukeAVanDrie
Copy link
Contributor

@LukeAVanDrie LukeAVanDrie commented May 12, 2025

This PR addresses flakiness observed in the TestMetricsRefresh test within the pkg/epp/backend/metrics package.

Problem

The root cause of the flakiness was a race condition stemming from the asynchronous nature of the StopRefreshLoop method. The test would:

  1. Signal the metrics refresh goroutine to stop via StopRefreshLoop().
  2. Immediately update the mock PodMetricsClient to return different metric values.
  3. Assert that the PodMetrics instance still held the original metrics.

Occasionally, the refresh goroutine would execute one final time after StopRefreshLoop() was called but before it fully terminated, picking up the new metrics. This led to the assertion failing.

Solution

To resolve this, I added a time.Sleep(pmf.refreshMetricsInterval * 2 /* small buffer for robustness */) before any test assertions are made. This causes the test to sleep for just 2ms. I also added a stopOnce around closing the done channel for robustness.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 12, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott May 12, 2025 21:40
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 12, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

netlify bot commented May 12, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 5e482aa
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/6823b93e719617000835a30c
😎 Deploy Preview https://deploy-preview-824--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 12, 2025
@kfswain
Copy link
Collaborator

kfswain commented May 12, 2025

Fixes: #719

@kfswain
Copy link
Collaborator

kfswain commented May 12, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 12, 2025
@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented May 13, 2025

I agree with @liu-cong's general point, making this function synchronous doesn't sound to me like the right solution.
thinking forward about data layer and extensible scrapers, we might need to run multiple routines for different scrapers.
I wouldn't touch the existing pod metrics code. this is an issue with the test itself.
@LukeAVanDrie your analysis of the issue is great:

  1. Signal the metrics refresh goroutine to stop via StopRefreshLoop().
  2. Immediately update the mock PodMetricsClient to return different metric values.
  3. Assert that the PodMetrics instance still held the original metrics.

Occasionally, the refresh goroutine would execute one final time after StopRefreshLoop() was called but before it fully terminated, picking up the new metrics. This led to the assertion failing.

I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed.
taking the scraping timeout as a worst case is sufficient.
so I would fix the flakiness by adding this single code line in the test file:

pm.StopRefreshLoop()
time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE
pmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated})

@liu-cong
Copy link
Contributor

liu-cong commented May 13, 2025 via email

@kfswain
Copy link
Collaborator

kfswain commented May 13, 2025

time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE

this is what i was going to suggest. This solves the problem of the test flake and if we want to change how the metrics loop is handled we can do that later. For the test: we can even add another second as a grace period which would basically reduce the probability of race conditions to zero, even though it will already be very very low

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented May 13, 2025

I think a very easy and straight forward fix to the above problem is to let the test wait until the final call is completed. taking the scraping timeout as a worst case is sufficient. so I would fix the flakiness by adding this single code line in the test file:

pm.StopRefreshLoop()
time.Sleep(fetchMetricsTimeout) // THIS IS THE ONLY ADDED LINE
pmc.SetRes(map[types.NamespacedName]*MetricsState{namespacedName: updated})

I incorporated this change and removed the wait group. I did, however, leave the stopOnce for robustness, but I can also remove this if you feel it is unnecessary.

The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a
race condition. The `StopRefreshLoop` method would signal the metrics
refresh goroutine to stop but did not wait for its actual termination.
If the test updated the mock metrics client immediately after calling
`StopRefreshLoop`, the refresh goroutine could, in rare cases, perform
a final metrics fetch with the new data before fully exiting. This
resulted in the test asserting against unexpected metric values.

This commit resolves the issue by making adding a sleep for the metrics
refresh interval in TestMetricsRefresh. Additionally, it adds the
following for robustness in `StopRefreshLoop`.
- `stopOnce` is used to ensure the `done` channel is only closed once
  (for idempotency and protection against concurrent calls).

This change ensures that the refresh goroutine is guaranteed to have
stopped before any test assertions are made, eliminating the race
condition.
@LukeAVanDrie LukeAVanDrie changed the title Fix Test Flakiness by Making Pod Metrics StopRefreshLoop Synchronous Fix Test Flakiness by adding short sleep in TestMetricsRefresh May 13, 2025
@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025
@kfswain
Copy link
Collaborator

kfswain commented May 13, 2025

/approve

Thanks!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025
@k8s-ci-robot k8s-ci-robot merged commit 8baf74c into kubernetes-sigs:main May 13, 2025
8 checks passed
@LukeAVanDrie LukeAVanDrie deleted the flakes branch May 13, 2025 22:55
@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented May 13, 2025

Resolves #719

@kfswain I cannot seem to get this to work.

@kfswain
Copy link
Collaborator

kfswain commented May 13, 2025

Ah, yeah no problem! Most of our slash commands are a signal for the k8s-ci-robot to take some sort of action. the fixes or resolves keyword is a GH concept so no slash: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/using-keywords-in-issues-and-pull-requests#linking-a-pull-request-to-an-issue

Either way, I manually closed the issue, thanks for doin this

kaushikmitr pushed a commit to kaushikmitr/llm-instance-gateway that referenced this pull request May 15, 2025
The TestMetricsRefresh test in pod_metrics_test.go was flaky due to a
race condition. The `StopRefreshLoop` method would signal the metrics
refresh goroutine to stop but did not wait for its actual termination.
If the test updated the mock metrics client immediately after calling
`StopRefreshLoop`, the refresh goroutine could, in rare cases, perform
a final metrics fetch with the new data before fully exiting. This
resulted in the test asserting against unexpected metric values.

This commit resolves the issue by making adding a sleep for the metrics
refresh interval in TestMetricsRefresh. Additionally, it adds the
following for robustness in `StopRefreshLoop`.
- `stopOnce` is used to ensure the `done` channel is only closed once
  (for idempotency and protection against concurrent calls).

This change ensures that the refresh goroutine is guaranteed to have
stopped before any test assertions are made, eliminating the race
condition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants