pull-npd-e2e-test failing ssh handshake #970

wangzhen127 · 2024-10-09T16:59:08Z

https://testgrid.k8s.io/presubmits-node-problem-detector#pull-npd-e2e-test starts to fail recently.

[1] NPD should export Prometheus metrics. When OOM kills and docker hung happen 
[1]   NPD should update problem_counter and problem_gauge
[1]   /home/prow/go/src/k8s.io/node-problem-detector/test/e2e/metriconly/metrics_test.go:158
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:54804->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52980->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:53002->35.184.209.153:22: read: connection reset by peer', retrying
[2] error dialing [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44696->35.184.209.153:22: read: connection reset by peer', retrying
[2] Error storing debugging data to test artifacts: [Error running command: {prow 35.184.209.153 curl http://localhost:20257/metrics   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:52990->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -u node-problem-detector.service   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44688->35.184.209.153:22: read: connection reset by peer'}
[2]  Error running command: {prow 35.184.209.153 sudo journalctl -k   0 error getting SSH client to [email protected]:22: 'ssh: handshake failed: read tcp 10.32.2.7:44708->35.184.209.153:22: read: connection reset by peer'}
[2] ]

This is affecting several different PRs: #955, #961, #969.

The text was updated successfully, but these errors were encountered:

wangzhen127 · 2024-10-09T17:00:52Z

This looks like an infra issue. @BenTheElder Do you know who should we talk to?

CC @hakman

BenTheElder · 2024-10-09T17:03:48Z

It's a problem with the jobs. SIG K8S infra does not create your test VMs. The test is attempting to SSH to a disposable test VM created by your job.

seems like the VM is not serving SSH or something similar

wangzhen127 · 2024-10-09T17:38:05Z

CC @DigitalVeer

BenTheElder · 2024-10-09T19:21:12Z

If these are like node e2e tests, folks in SIG node might be familiar

SIG Testing strongly discourages ssh usage in cluster e2e tests, relying instead on hostexec pods when necessary, but for some node style testing that's not sufficient, and mostly folks in SIG Node work with this.

ameukam · 2024-10-10T14:00:00Z

It's possible there is with an issue with the GCP projects rented by this test. It's unclear to me why the SSH connection is not working but I'll try to debug with @hakman.

hakman · 2024-10-13T20:00:49Z

This is an issue with cos-stable-117. SSH works pretty well in all other tests (which are similar).
I tried to reproduce what happens with the ext4 test and found out that the command used in the test is:

echo "fake filesystem error from problem-maker" > /sys/fs/ext4/sda1/trigger_fs_error

Once this runs, the filesystem is mounted as read-only and SSH stops working with Connection reset by peer:

[  169.101160] EXT4-fs error (device sda1): trigger_test_error:127: comm bash: fake filesystem error from problem-maker
[  169.108852] Aborting journal on device sda1-8.
[  169.115130] EXT4-fs (sda1): Remounting filesystem read-only

There may be some recent changes that affect the behaviour of trigger_fs_error.
https://lore.kernel.org/all/[email protected]/t/#u

wangzhen127 · 2024-11-06T19:34:29Z

New updates:

Talked to COS team and found the root cause: https://www.spinics.net/lists/linux-ext4/msg90066.html

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

This is an intentional change from upstream kernel so on COS side they won't change it. The path forward would be updating the NPD test case for newer kernel versions (>=6.5.0-rc3).

hakman · 2024-11-07T04:25:25Z

The kernel commit changes EXT4_MF_FS_ABORTED to EXT4_FLAGS_SHUTDOWN when fs error happens so though the fs is remounted as read-only, files can't be read by anyone and SSH connections will fail.

@wangzhen127 I don't think SSH failing after this is an intended behaviour.

wangzhen127 · 2024-11-07T19:47:12Z

Yeah, this is from COS team's perspective, because the change in upstream. So there is not much they can do. So they recommend us to update tests. Sorry for the confusion.

hakman · 2024-11-07T20:21:20Z

No worries, I just meant that maybe they can configure the SSH server to not fail completely. I agree that the FS should become read-only, but not accepting SSH connections is quite unexpected.

k8s-triage-robot · 2025-02-05T21:09:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2025-03-07T22:05:57Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

hakman · 2025-03-08T02:12:21Z

/remove-lifecycle rotten
/lifecycle frozen

DigitalVeer · 2025-03-11T12:31:38Z

I'll take this up. Instead of relying on ProblemMaker for this test, is creating a temporary read-only filesystem and remounting a viable alternative here? If the node won't accept SSH connections after the FS becomes read-only, I'm not quite sure how to proceed with the test assertions.

wangzhen127 mentioned this issue Oct 9, 2024

CVE found with v0.8.19 #926

Closed

hakman mentioned this issue Oct 11, 2024

Test using "k8s-infra-e2e-boskos-157" #971

Closed

hakman mentioned this issue Oct 14, 2024

Skip ext4 e2e tests #974

Merged

BenTheElder mentioned this issue Nov 13, 2024

[Failing test][sig-node] ci-crio-cgroupv1-node-e2e-conformance kubernetes/kubernetes#128774

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 5, 2025

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 7, 2025

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pull-npd-e2e-test failing ssh handshake #970

pull-npd-e2e-test failing ssh handshake #970

wangzhen127 commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

ameukam commented Oct 10, 2024

hakman commented Oct 13, 2024

wangzhen127 commented Nov 6, 2024

hakman commented Nov 7, 2024

wangzhen127 commented Nov 7, 2024

hakman commented Nov 7, 2024

k8s-triage-robot commented Feb 5, 2025

k8s-triage-robot commented Mar 7, 2025

hakman commented Mar 8, 2025

DigitalVeer commented Mar 11, 2025

pull-npd-e2e-test failing ssh handshake #970

pull-npd-e2e-test failing ssh handshake #970

Comments

wangzhen127 commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

wangzhen127 commented Oct 9, 2024

BenTheElder commented Oct 9, 2024

ameukam commented Oct 10, 2024

hakman commented Oct 13, 2024

wangzhen127 commented Nov 6, 2024

hakman commented Nov 7, 2024

wangzhen127 commented Nov 7, 2024

hakman commented Nov 7, 2024

k8s-triage-robot commented Feb 5, 2025

k8s-triage-robot commented Mar 7, 2025

hakman commented Mar 8, 2025

DigitalVeer commented Mar 11, 2025