Skip to content

Cache and print devices for debugging future outages #2097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

julianKatz
Copy link
Contributor

@julianKatz julianKatz commented May 21, 2025

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug

/kind cleanup # Is this right??

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

This PR adds a cache that periodically (configured to every minute currently) looks at the /dev/disk/by-id/ directory and evaluates the symlinks there. It maintains a cache of the symlink and the real path it points to.

This will help with debugging future filesystem issues. In a past OMG, we found that our insight into changes in symlinks for specific disks hampered our ability to debug. Logging marked the real path of the disk at mount and unmount, but the change in between couldn't be detected.

This PR will print those links every minute, also logging when elements of the cache change.

An example:

I0521 00:23:51.418261      12 cache.go:62] periodic symlink cache read: /dev/disk/by-id/google-persistent-disk-0 -> /dev/sda; /dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb; /dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda; /dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb

The cache will also note if a symlink is broken.

NOTE: Currently this filters out any thing in by-id/ that ends with -part[0-9]*$. This removes partitions, which are noise. Mounting partitions directly isn't well supported in GKE, but we may want to test that in the future.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

None

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: julianKatz
Once this PR has been reviewed and has the lgtm label, please assign saikat-royc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from tonyzhc May 21, 2025 00:29
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 21, 2025
@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from 3a8aba1 to 8f34a1f Compare May 21, 2025 18:01
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 21, 2025
@julianKatz
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 21, 2025
@julianKatz julianKatz marked this pull request as ready for review May 21, 2025 20:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2025
@julianKatz julianKatz changed the title Cache and print devices Cache and print devices for debugging future outages May 21, 2025
@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from d4bde0a to e38ef72 Compare May 21, 2025 22:33
@tonyzhc
Copy link
Contributor

tonyzhc commented May 21, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 21, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from 5330d24 to ae96677 Compare May 28, 2025 17:40
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 30, 2025
changes and printing the full list.

example: periodic symlink cache read:
/dev/disk/by-id/google-persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/google-pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb;
/dev/disk/by-id/scsi-0Google_PersistentDisk_persistent-disk-0 -> /dev/sda;
/dev/disk/by-id/scsi-0Google_PersistentDisk_pvc-f5418f78-dc07-4d69-9487-6c4a7232dd67 -> /dev/sdb
@julianKatz julianKatz force-pushed the logs-for-device-mappings branch from 0644523 to e58b7e5 Compare June 6, 2025 21:39
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 6, 2025
@k8s-ci-robot
Copy link
Contributor

@julianKatz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e 701d3f9 link true /test pull-gcp-compute-persistent-disk-csi-driver-e2e
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2022 701d3f9 link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2022
pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019 701d3f9 link false /test pull-gcp-compute-persistent-disk-csi-driver-e2e-windows-2019

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 13, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants