Skip to content

dial timeouts causing test flakes #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mattcary opened this issue Jun 10, 2021 · 6 comments
Closed

dial timeouts causing test flakes #789

mattcary opened this issue Jun 10, 2021 · 6 comments
Assignees
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saikat-royc
Copy link
Member

starting with one of the basic tests:
Log:
https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/sigs.k8s.io_gcp-compute-persistent-disk-csi-driver/788/pull-gcp-compute-persistent-disk-csi-driver-kubernetes-integration/1403383839518101504/

test: Kubernetes e2e suite: External Storage [Driver: csi-gcepd-sc-standard] [Testpattern: Dynamic PV (ext3)] volumes should store data

Persistent Volume: pvc-bfb47f28-5884-427a-b1a9-095692b9fc74
on node: csi-gce-pd-node-94p2r-gce-pd-driver

first writer pod on node-94p2r (node stage/publish success):

I0611 16:44:43.516018       1 utils.go:67] /csi.v1.Node/NodeStageVolume called with request: volume_id:"projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount" volume_capability:<mount:<fs_type:"ext3" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1623429450560-8081-pd.csi.storage.gke.io" > 
I0611 16:44:44.018461       1 node.go:287] Successfully found attached GCE PD "pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" at device path /dev/disk/by-id/google-pvc-bfb47f28-5884-427a-b1a9-095692b9fc74.
I0611 16:44:44.018602       1 node.go:74] NodePublishVolume check volume path /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount is mounted false: error <nil>
I0611 16:44:44.029781       1 mount_linux.go:390] Disk "/dev/disk/by-id/google-pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" appears to be unformatted, attempting to format as type: "ext3" with options: [-F -m0 /dev/disk/by-id/google-pvc-bfb47f28-5884-427a-b1a9-095692b9fc74]
I0611 16:44:44.473564       1 mount_linux.go:400] Disk successfully formatted (mkfs): ext3 - /dev/disk/by-id/google-pvc-bfb47f28-5884-427a-b1a9-095692b9fc74 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount
I0611 16:44:44.531676       1 node.go:321] NodeStageVolume succeeded on projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74 to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount
I0611 16:44:44.531704       1 utils.go:72] /csi.v1.Node/NodeStageVolume returned with response: 
I0611 16:44:44.543615       1 utils.go:67] /csi.v1.Node/NodeGetCapabilities called with request: 
I0611 16:44:44.543676       1 utils.go:72] /csi.v1.Node/NodeGetCapabilities returned with response: capabilities:<rpc:<type:STAGE_UNSTAGE_VOLUME > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:GET_VOLUME_STATS > > 
I0611 16:44:44.554693       1 utils.go:67] /csi.v1.Node/NodePublishVolume called with request: volume_id:"projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount" target_path:"/var/lib/kubelet/pods/ebc9bc9b-125d-47a3-a1ea-779a6f96e920/volumes/kubernetes.io~csi/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/mount" volume_capability:<mount:<fs_type:"ext3" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1623429450560-8081-pd.csi.storage.gke.io" > 

second validator pod (node stage/publish success):

I0611 16:45:31.512322       1 utils.go:67] /csi.v1.Node/NodeStageVolume called with request: volume_id:"projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount" volume_capability:<mount:<fs_type:"ext3" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1623429450560-8081-pd.csi.storage.gke.io" > 
I0611 16:45:32.014601       1 node.go:287] Successfully found attached GCE PD "pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" at device path /dev/disk/by-id/google-pvc-bfb47f28-5884-427a-b1a9-095692b9fc74.
I0611 16:45:32.014660       1 node.go:74] NodePublishVolume check volume path /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount is mounted false: error <nil>
I0611 16:45:32.289570       1 node.go:321] NodeStageVolume succeeded on projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74 to /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount
I0611 16:45:32.289608       1 utils.go:72] /csi.v1.Node/NodeStageVolume returned with response: 
I0611 16:45:32.300365       1 utils.go:67] /csi.v1.Node/NodeGetCapabilities called with request: 
I0611 16:45:32.300445       1 utils.go:72] /csi.v1.Node/NodeGetCapabilities returned with response: capabilities:<rpc:<type:STAGE_UNSTAGE_VOLUME > > capabilities:<rpc:<type:EXPAND_VOLUME > > capabilities:<rpc:<type:GET_VOLUME_STATS > > 
I0611 16:45:32.306553       1 utils.go:67] /csi.v1.Node/NodePublishVolume called with request: volume_id:"projects/gce-up-c1-3-glat-up-clu-n/zones/us-central1-b/disks/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74" staging_target_path:"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/globalmount" target_path:"/var/lib/kubelet/pods/2e19bbb8-1e74-4d7e-802e-4aaaf64c0a9e/volumes/kubernetes.io~csi/pvc-bfb47f28-5884-427a-b1a9-095692b9fc74/mount" volume_capability:<mount:<fs_type:"ext3" > access_mode:<mode:SINGLE_NODE_WRITER > > volume_context:<key:"storage.kubernetes.io/csiProvisionerIdentity" value:"1623429450560-8081-pd.csi.storage.gke.io" > 

Right after that the kubectl exec reported dial timeout:

Jun 11 16:45:35.790: INFO: ExecWithOptions {Command:[/bin/sh -c test -d /opt/0] Namespace:volume-9719 PodName:external-client ContainerName:external-client Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:false Quiet:false}
Jun 11 16:45:35.790: INFO: >>> kubeConfig: /root/.kube/config
Jun 11 16:46:05.808: FAIL: "test -d /opt/0" should succeed, but failed with error message "error dialing backend: dial timeout"
stdout: 
stderr: 
Unexpected error:
    <*errors.StatusError | 0xc000c36640>: {
        ErrStatus: {
            TypeMeta: {Kind: "", APIVersion: ""},
            ListMeta: {
                SelfLink: "",
                ResourceVersion: "",
                Continue: "",
                RemainingItemCount: nil,
            },
            Status: "Failure",
            Message: "error dialing backend: dial timeout",
            Reason: "",
            Details: nil,
            Code: 500,
        },
    }
    error dialing backend: dial timeout
occurred

The persistent volume workflows are working as expected and this seems like a real dial timeout error.
Suggestion from the GKE master team is:
use verbosity level -v6 in the kubectl exec and also investigate konnectivity-server:

it could be one of two things 
1. there's a hightened latency in konnectivity's setup sporadically (leading to the konnectivity. server and clients not establishing connectitions before your test runs and there not being any "backends" to connect to from the konnectivity server's point of view)
2. The konnectivity server/client connection isn't robust

Next steps would be try out the test in my local setup and dump the konnectivity server logs
FYI @mattcary

@saikat-royc
Copy link
Member

saikat-royc commented Jun 14, 2021

Running the tests locally I do see a lot of failures in the konnectivity server:

While the pdcsi tests was running with ginkgo parallelism of 10,

First I tried the kubectl logs command which itself timedout:

saikatroyc@saikatroyc:~$ k logs konnectivity-server-e2e-test-saikatroyc-master  -n kube-system
Unable to connect to the server: dial tcp 34.123.40.157:443: connect: connection timed out

Then I logged into the node and check the logs (sudo cat /var/log/pods/kube-system_konnectivity-server-e2e-test-saikatroyc-master_4e890bd5a46419844f64db6c409b0337/konnectivity-server-container/0.log ). The logs reported a lot of following errors:

2021-06-14T20:38:20.445052833Z stderr F E0614 20:38:20.444895       1 server.go:355] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
2021-06-14T20:38:20.445158169Z stderr F E0614 20:38:20.445044       1 server.go:711] "could not get frontend client" err="can't find connID 2 in the frontends[2ba47144-02bc-420a-b114-62f08c3593f8]" connectionID=2
2021-06-14T20:38:20.447858494Z stderr F E0614 20:38:20.447546       1 server.go:711] "could not get frontend client" err="can't find agentID d86c3bc3-cf02-4501-b9bb-b42d0f6aebff in the frontends" connectionID=1
2021-06-14T20:38:20.469056106Z stderr F E0614 20:38:20.468419       1 server.go:355] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
2021-06-14T20:38:20.473663156Z stderr F E0614 20:38:20.473491       1 server.go:711] "could not get frontend client" err="can't find connID 1 in the frontends[01aca44a-53f6-4816-abf3-6b1bf897b493]" connectionID=1
2021-06-14T20:38:20.475114051Z stderr F E0614 20:38:20.474933       1 server.go:355] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
2021-06-14T20:38:20.47585891Z stderr F E0614 20:38:20.475763       1 server.go:711] "could not get frontend client" err="can't find agentID 01aca44a-53f6-4816-abf3-6b1bf897b493 in the frontends" connectionID=2
2021-06-14T20:38:39.373851054Z stderr F E0614 20:38:39.373557       1 server.go:672] "DIAL_RSP contains failure" err="dial tcp 10.64.3.4:443: connect: connection refused"
2021-06-14T20:38:39.374044045Z stderr F E0614 20:38:39.373946       1 server.go:672] "DIAL_RSP contains failure" err="dial tcp 10.64.3.4:443: connect: connection refused"
2021-06-14T20:38:39.374364357Z stderr F E0614 20:38:39.374281       1 server.go:355] "Stream read from frontend failure" err="rpc error: code = Canceled desc = context canceled"
2021-06-14T20:38:39.375002577Z stderr F E0614 20:38:39.374913       1 server.go:672] "DIAL_RSP contains failure" err="dial tcp 10.64.3.4:443: connect: connection refused"
2021-06-14T20:38:39.375213934Z stderr F E0614 20:38:39.375136       1 server.go:672] "DIAL_RSP contains failure" err="dial tcp 10.64.3.4:443: connect: connection refused"

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants