Skip to content

CI tests aren't using the e2e.test binary compiled from the latest kubernetes master codebase #892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mauriciopoppe opened this issue Jan 11, 2022 · 10 comments · Fixed by #899

Comments

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Jan 11, 2022

We're trying to fix a flaky VolumeSnapshot test, after we reordered some of the test statements in kubernetes/kubernetes#107173 we checked testgrid and saw that the test was still flaky (for example this run), moreover, after analyzing the logs we saw that it was logging the statements in a different order than the latest in Kubernetes master, we found out the following:

  • the k8s-integration binary downloads the latest code of Kubernetes master and calls make quick-release ref, in the logs we saw that e2e.test was generated.
  • kube-up is called with these binaries
  • after the CSI Driver is installed it looks like another release of kubernetes is downloaded again, we suspect that the e2e.test binary taken is from this release and not the one that was recently compiled

Logs for the last point:

I0111 05:09:49.697] Running Tests
I0111 05:09:49.697] [kubetest2 gce --test=ginkgo --legacy-mode --repo-root=/tmp/gcp-pd-driver-tmp2860208485/kubernetes --artifacts=/workspace/_artifacts/sc-balanced -- --focus-regex=External.Storage --skip-regex=\[Disruptive\]|\[Serial\] --parallel=4 --test-args=--storage.testdriver=/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/test/k8s-integration/config/test-config.yaml ]
W0111 05:09:51.890] Copying gs://kubernetes-release/release/v1.24.0-alpha.1/kubernetes-test-linux-amd64.tar.gz...
W0111 05:09:51.892] / [0 files][    0.0 B/250.5 MiB]                                                
==> NOTE: You are downloading one or more large file(s), which would
W0111 05:09:51.892] run significantly faster if you enabled sliced object downloads. This
W0111 05:09:51.892] feature is enabled by default but requires that compiled crcmod be
W0111 05:09:51.892] installed (see "gsutil help crcmod").
W0111 05:09:51.893] 
W0111 05:09:55.154] -
- [0 files][111.9 MiB/250.5 MiB]                                                
\
|
| [0 files][215.8 MiB/250.5 MiB]                                                
| [1 files][250.5 MiB/250.5 MiB]                                                
/
W0111 05:09:55.154] Operation completed over 1 objects/250.5 MiB.                                    
W0111 05:10:01.477] Copying gs://kubernetes-release/release/v1.24.0-alpha.1/bin/linux/amd64/kubectl...
W0111 05:10:02.124] / [0 files][    0.0 B/ 44.4 MiB]                                                
/ [1 files][ 44.4 MiB/ 44.4 MiB]                                                
-
W0111 05:10:02.125] Operation completed over 1 objects/44.4 MiB.                                     
W0111 05:10:02.337] I0111 05:10:02.336492   91477 ginkgo.go:90] Running ginkgo test as /workspace/_artifacts/sc-balanced/83ff6ea8-7298-11ec-8d10-025796222358/ginkgo [--nodes=4 /workspace/_artifacts/sc-balanced/83ff6ea8-7298-11ec-8d10-025796222358/e2e.test -- --kubeconfig=/root/.kube/config --kubectl-path=/workspace/_artifacts/sc-balanced/83ff6ea8-7298-11ec-8d10-025796222358/kubectl --ginkgo.flakeAttempts=1 --ginkgo.skip=\[Disruptive\]|\[Serial\] --ginkgo.focus=External.Storage --report-dir=/workspace/_artifacts/sc-balanced/83ff6ea8-7298-11ec-8d10-025796222358 --storage.testdriver=/go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/test/k8s-integration/config/test-config.yaml]
@mauriciopoppe
Copy link
Member Author

mauriciopoppe commented Jan 13, 2022

looks I broke Windows with #893, the Windows Prow job does the following:

  • the prow job is different, instead of calling the k8s-integration binary directly, it provisions a kubernetes cluster with kubetest
  • after the cluster is provisioned k8s-integration doesn't download a copy of kubernetes (because --kube-version is not set in run-windows-k8s-integration.sh), however --test-version=master is set
  • When the kubetest2 args are built, it sees that we're trying to test master so assumes that it downloaded kubernetes and compiled it, however this didn't happen
  • When the program tries to copy the binaries, they don't exist at the right location, kubetest downloads the compiled binaries it writes them to kubernetes/platforms/<os>/<arch> instead of the usual kubernetes/_output/dockerized/bin/<os>/<arch>

@mattcary
Copy link
Contributor

Ah. We only recently updated k8s-integration to use kubetest2, must have missed windows. Would it be easier to update the windows test to use kubetest2 so that we're consistent?

@mauriciopoppe
Copy link
Member Author

/reopen

There's one startup error in the Windows nodes:

2022/01/25 20:03:33 GCEMetadataScripts: windows-startup-script-ps1: C:\flb-exporter\flb-exporter.exe corrupted, SHA256 @{Algorithm=SHA256; Hash=C808C9645D84B06B89932BD707D51A9D1D0B451B5A702A5F9B2B4462C8BE6502; Path=C:\flb-exporter\flb-exporter.exe} doesn't match expected f84bc732c9078421930cf9791c52066a56e692836e13857133d9927a35663a6b
2022/01/25 20:03:33 GCEMetadataScripts: windows-startup-script-ps1: Hash validation of https://storage.googleapis.com/gke-release/winnode/fluentbit-exporter/v0.17.0/flb-exporter-v0.17.0.exe failed. Will retry. Error: System.Management.Automation.RuntimeException: C:\flb-exporter\flb-exporter.exe corrupted, SHA256 @{Algorithm=SHA256; Hash=C808C9645D84B06B89932BD707D51A9D1D0B451B5A702A5F9B2B4462C8BE6502; Path=C:\flb-exporter\flb-exporter.exe} doesn't match expected f84bc732c9078421930cf9791c52066a56e692836e13857133d9927a35663a6b

@k8s-ci-robot
Copy link
Contributor

@mauriciopoppe: Reopened this issue.

In response to this:

/reopen

There's one startup error in the Windows nodes:

2022/01/25 20:03:33 GCEMetadataScripts: windows-startup-script-ps1: C:\flb-exporter\flb-exporter.exe corrupted, SHA256 @{Algorithm=SHA256; Hash=C808C9645D84B06B89932BD707D51A9D1D0B451B5A702A5F9B2B4462C8BE6502; Path=C:\flb-exporter\flb-exporter.exe} doesn't match expected f84bc732c9078421930cf9791c52066a56e692836e13857133d9927a35663a6b
2022/01/25 20:03:33 GCEMetadataScripts: windows-startup-script-ps1: Hash validation of https://storage.googleapis.com/gke-release/winnode/fluentbit-exporter/v0.17.0/flb-exporter-v0.17.0.exe failed. Will retry. Error: System.Management.Automation.RuntimeException: C:\flb-exporter\flb-exporter.exe corrupted, SHA256 @{Algorithm=SHA256; Hash=C808C9645D84B06B89932BD707D51A9D1D0B451B5A702A5F9B2B4462C8BE6502; Path=C:\flb-exporter\flb-exporter.exe} doesn't match expected f84bc732c9078421930cf9791c52066a56e692836e13857133d9927a35663a6b

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mauriciopoppe
Copy link
Member Author

The startup error was fixed in kubernetes/kubernetes#107769

@mauriciopoppe
Copy link
Member Author

Two Windows CI test for ltsc2019 and 20H2 are now passing, the ones that are pending are the migration tests

@mauriciopoppe
Copy link
Member Author

mauriciopoppe commented Apr 27, 2022

Current status:

W0422 19:14:30.779] F0422 19:14:30.779148    6551 main.go:195] Failed to run integration test: failed to prepull images: + readonly prepull_daemonset=prepull-test-containers
W0422 19:14:30.779] + prepull_daemonset=prepull-test-containers
W0422 19:14:30.783] + [[ -z /go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/test/k8s-integration/prepull.yaml ]]
W0422 19:14:30.784] + kubectl create -f /go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/test/k8s-integration/prepull.yaml
W0422 19:14:30.784] daemonset.apps/prepull-test-containers created
W0422 19:14:30.784] + wait_on_prepull
W0422 19:14:30.784] + retries=90
W0422 19:14:30.784] + [[ 90 -ge 0 ]]
W0422 19:14:30.785] ++ kubectl get daemonset prepull-test-containers -o 'jsonpath={.status.numberReady}'
W0422 19:14:30.785] + ready=0
W0422 19:14:30.785] ++ kubectl get daemonset prepull-test-containers -o 'jsonpath={.status.desiredNumberScheduled}'
W0422 19:14:30.785] + required=3
W0422 19:14:30.785] + [[ 0 -eq 3 ]]
W0422 19:14:30.785] + (( retries-- ))
W0422 19:14:30.785] + sleep 10s
W0422 19:14:30.785] + [[ 89 -ge 0 ]]
...
W0427 19:36:13.593] ++ kubectl get daemonset prepull-test-containers -o 'jsonpath={.status.numberReady}'
W0427 19:36:13.593] + ready=0
W0427 19:36:13.593] ++ kubectl get daemonset prepull-test-containers -o 'jsonpath={.status.desiredNumberScheduled}'
W0427 19:36:13.593] + required=3
W0427 19:36:13.593] + [[ 0 -eq 3 ]]
W0427 19:36:13.593] + (( retries-- ))
W0427 19:36:13.593] + sleep 10s
W0427 19:36:13.593] + [[ -1 -ge 0 ]]
W0427 19:36:13.593] + echo 'Timeout waiting for daemonset prepull-test-containers'
W0427 19:36:13.594] Timeout waiting for daemonset prepull-test-containers
  • Migration tests are failing with this error:
W0427 19:46:26.713] + readonly 'GCE_PD_TEST_FOCUS=PersistentVolumes\sGCEPD|[V|v]olume\sexpand|\[sig-storage\]\sIn-tree\sVolumes\s\[Driver:\swindows-gcepd\]|allowedTopologies|Pod\sDisks|PersistentVolumes\sDefault'
W0427 19:46:26.713] + GCE_PD_TEST_FOCUS='PersistentVolumes\sGCEPD|[V|v]olume\sexpand|\[sig-storage\]\sIn-tree\sVolumes\s\[Driver:\swindows-gcepd\]|allowedTopologies|Pod\sDisks|PersistentVolumes\sDefault'
W0427 19:46:26.714] + /go/src/sigs.k8s.io/gcp-compute-persistent-disk-csi-driver/bin/k8s-integration-test --run-in-prow=true --service-account-file=/etc/service-account/service-account.json --boskos-resource-type=gce-project --deployment-strategy=gce --gce-zone=us-central1-b --platform=windows --bringup-cluster=true --teardown-cluster=true --num-nodes=1 --migration-test=true --num-windows-nodes=3 --teardown-driver=true --do-driver-build=true --deploy-overlay-name=stable-master --test-version=master --kube-version=master --kube-feature-gates=CSIMigration=true,CSIMigrationGCE=true,ExpandCSIVolumes=true --storageclass-files=sc-windows.yaml --snapshotclass-files=pd-volumesnapshotclass.yaml '--test-focus=PersistentVolumes\sGCEPD|[V|v]olume\sexpand|\[sig-storage\]\sIn-tree\sVolumes\s\[Driver:\swindows-gcepd\]|allowedTopologies|Pod\sDisks|PersistentVolumes\sDefault' --use-kubetest2=false
W0427 19:46:26.714] F0427 19:46:26.702990    3647 utils.go:51] storage-class-file and migration-test cannot both be set

@mattcary
Copy link
Contributor

I'm working on the timeout error at #966.

Since this is blocking let me make a quick change to remove the snapshot disk image from testing.

@mauriciopoppe
Copy link
Member Author

mauriciopoppe commented May 4, 2022

The timeout error in https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-gce-pd-csi-driver-latest-k8s-master-windows-2019/1521250789048193024 is different, this script https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/test/k8s-integration/prepull-image.sh is attempting to deploy this manifest https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/test/k8s-integration/prepull.yaml but it times out because the replicas are never ready.

The error in https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/test/k8s-integration/main.go#L385 is:

W0502 23:01:21.176] I0502 23:01:21.175671    6503 main.go:377] Prepulling test images.
W0502 23:16:46.401] I0502 23:16:46.399573    6503 utils.go:16] Bringing Down E2E Cluster on GCE

...

W0502 23:23:02.210] F0502 23:23:02.210329    6503 main.go:195] Failed to run integration test: failed to prepull images: + readonly prepull_daemonset=prepull-test-containers

@mauriciopoppe
Copy link
Member Author

we finally have green runs for both ltsc2019 and 20H2, thanks for the help @mattcary!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants