Update gce operation timeout #662

jingxu97 · 2020-11-09T17:19:21Z

Update gce operation timeout to 2 mins instead of 5 mins. The 90% of
attach/detach disk is less than 15 second, and create volume is less
than 1 min.

Reduce the timeout of volume operation so that it can recover from
previous operation quicker in case operation is dropped.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Update gce volume operation timeout.

k8s-ci-robot · 2020-11-09T17:19:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jingxu97]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mattcary · 2020-11-09T17:24:33Z

pkg/gce-cloud-provider/compute/gce-compute.go

@@ -610,6 +610,8 @@ func (cloud *CloudProvider) AttachDisk(ctx context.Context, volKey *meta.Key, re
 	op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do()
 	if err != nil {
 		return fmt.Errorf("failed cloud service attach disk call: %v", err)
+	} else {
+		klog.V(5).Infof("Attaching disk %s operation id is ", volKey, op.Name)


Also log op.Name at line 618? There may be lots of logging between these two line so it would be easier to debug a failed waitForZonalOp if we had the op name.

added, thanks!

jingxu97 · 2020-11-09T17:38:36Z

cc @verult @saad-ali

mattcary · 2020-11-09T18:48:21Z

Thanks!

/lgtm

saikat-royc · 2020-11-09T19:16:56Z

pkg/gce-cloud-provider/compute/gce-compute.go

@@ -681,7 +683,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z
 	// The v1 API can query for v1, alpha, or beta operations.
 	svc := cloud.service
 	project := cloud.project
-	return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) {
+	return wait.Poll(3*time.Second, 2*time.Minute, func() (bool, error) {


can we create constants for these

sure, updated. thanks!

k8s-ci-robot · 2020-11-09T20:08:50Z

New changes are detected. LGTM label has been removed.

msau42 · 2020-11-09T22:47:38Z

pkg/gce-cloud-provider/compute/gce-compute.go

@@ -681,7 +687,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z
 	// The v1 API can query for v1, alpha, or beta operations.
 	svc := cloud.service
 	project := cloud.project
-	return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) {
+	return wait.Poll(OpInterval, OpTimeout, func() (bool, error) {


Should we honor the client context.Timeout value instead?

Attacher side set context timeout to 15 second. https://github.com/kubernetes-csi/external-attacher/blob/298cb965f909f76f2d2dfa2062cb5e142f14b9d5/cmd/csi-attacher/main.go#L58

my understanding is this context timeout will be used for each individual operation for calling gce api. So we need use another set up timeout for this polling.

op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do() pollOp, err := cloud.service.RegionOperations.Get(cloud.project, region, opName).Context(ctx).Do()

We override the attacher timeout: https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/master/deploy/kubernetes/base/controller/controller.yaml#L55

ah, that makes sense now. The test failure I am checking, the first operation on AttachDisk call timed out after 250 seconds. It is not created successfully at all (not sure about the reason. If it is caused by rate limit, will it error back immediately?).

So just change here is not enough, we need to reduce the timeout defined in controller.yaml. I think 250s is too long based on normal attach/detach volume operation. I propose to use 60 seconds or even less.

I think adjusting the sidecar timeout based off of the metrics we have is fine.

But I think the waitForOp method here still needs to use the csi context's timeout instead of a hardcoded one.

I changed to use csi context timeout, but the poll interval will still be hardcoded for now.

jingxu97 · 2020-11-10T18:23:19Z

comments were addressed. PTAL

msau42 · 2020-11-10T19:00:04Z

deploy/kubernetes/base/controller/controller.yaml

@@ -52,7 +52,7 @@ spec:
            - "--metrics-address=:22012"
            - "--leader-election"
            - "--leader-election-namespace=$(PDCSI_NAMESPACE)"
-            - "--timeout=250s"
+            - "--timeout=60s"


I think we should keep the value at the 99% percentile

95%: 25s
99%: 50-100

I can change it to 90s, which I think is sufficient. On the other hand, also want to make recovery quicker in case of timeout.

how does a smaller timeout improve recovery if the underlying operation doesn't complete? The main thing I see is we will create an error event more frequently

the problem is not operation does not complete. The issue I am seeing (guessing) from logs that the first operation is not created successfully, but it does not return anything until timeout. After that, the second attach is called, and this time operation is created successfully and finished in a few second. I was suspecting it is due to rate limit and operation is dropped. But in this case, I am not sure operation creation returns an error immediately?

Having a short timeout means we have a chance to create another operation.

Jiawei0227 · 2020-11-12T01:52:57Z

Oh yeah you are right. 5min should not be used here. Instead we should use the context time to indicate the deadline. Good finding!
I dont think this has anything to do with the proposal of separating the get operation call to a separate wait loop, does it? Or do I misunderstand somehing

jingxu97 · 2020-11-12T03:39:54Z

Oh yeah you are right. 5min should not be used here. Instead we should use the context time to indicate the deadline. Good finding!
I dont think this has anything to do with the proposal of separating the get operation call to a separate wait loop, does it? Or do I misunderstand somehing

you mean the create operation call in a separate wait loop? since the get operation call is already in a separate loop now in waitForZoneOp()..

Jiawei0227 · 2020-11-12T05:13:53Z

I think if we put

pollOp, err := svc.ZoneOperations.Get(project, zone, opName).Context(ctx).Do()

this call into a separate wait loop with a smaller timeout like 30s, then we can return err early if for any reason the operation creation get dropped. So that the external-attacher can issue an attach again.

I think if we put create operation call in a wait loop util it makes sure the operation is created in gcp is also okay. I am fine with both implemention.

saikat-royc · 2020-11-12T23:54:30Z

this call into a separate wait loop with a smaller timeout like 30s, then we can return err early if for any reason the operation creation get dropped. So that the external-attacher can issue an attach again.

Agree.
Actually all the calls which return an operation, and the op.get() calls should have a short timeout. And the overall wait for the op to reach a done status (success/fail), should honor the csi timeout provided by the sidecars.

e.g.
wait.Poll(3time.Second, 5time.Minute, func() (bool, error) -> this i think should be the same as the csi timeout - some delta.
things like this
op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do(),
and
pollOp, err := svc.ZoneOperations.Get(project, zone, opName).Context(ctx).Do() should have a fixed short timeout in-order to react faster.

jingxu97 · 2020-11-15T07:58:59Z

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

k8s-ci-robot · 2020-11-16T03:13:47Z

@jingxu97: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019	`28d5cc1`	link	`/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Update gce operation timeout to 2 mins instead of 5 mins. The 90% of attach/detach disk is less than 15 second, and create volume is less than 1 min. Reduce the timeout of volume operation so that it can recover from previous operation quicker in case operation is dropped.

jingxu97 · 2021-01-21T00:27:11Z

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win1909

fejta-bot · 2021-04-21T02:03:23Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-05-21T02:31:56Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

k8s-ci-robot · 2021-06-02T10:15:23Z

@jingxu97: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mattcary · 2021-06-23T19:00:23Z

is this pr still relevant?

k8s-triage-robot · 2021-07-26T18:13:42Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-26T18:13:46Z

@k8s-triage-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 9, 2020

k8s-ci-robot requested review from mattcary and saikat-royc November 9, 2020 17:19

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 9, 2020

mattcary reviewed Nov 9, 2020

View reviewed changes

jingxu97 force-pushed the nov/timeout branch from 0d3764a to d036a3d Compare November 9, 2020 17:50

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 9, 2020

jingxu97 force-pushed the nov/timeout branch from d036a3d to db11323 Compare November 9, 2020 18:08

k8s-ci-robot assigned mattcary Nov 9, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020

saikat-royc reviewed Nov 9, 2020

View reviewed changes

jingxu97 force-pushed the nov/timeout branch from db11323 to 35d3805 Compare November 9, 2020 20:08

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020

msau42 reviewed Nov 9, 2020

View reviewed changes

jingxu97 force-pushed the nov/timeout branch 2 times, most recently from 8b0a9cf to 158297e Compare November 10, 2020 06:49

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 10, 2020

msau42 reviewed Nov 10, 2020

View reviewed changes

jingxu97 force-pushed the nov/timeout branch from 158297e to aa1e558 Compare November 11, 2020 03:54

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 11, 2020

jingxu97 force-pushed the nov/timeout branch from 3863711 to 28d5cc1 Compare November 15, 2020 07:03

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 15, 2020

jingxu97 force-pushed the nov/timeout branch 3 times, most recently from 7707505 to 93ff983 Compare November 15, 2020 22:21

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2020

jingxu97 force-pushed the nov/timeout branch 2 times, most recently from 106be3f to 1f35fd3 Compare November 16, 2020 03:05

jingxu97 force-pushed the nov/timeout branch from 1f35fd3 to bdd9262 Compare November 16, 2020 07:02

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 2, 2021

k8s-ci-robot closed this Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update gce operation timeout #662

Update gce operation timeout #662

jingxu97 commented Nov 9, 2020 •

edited

Loading

k8s-ci-robot commented Nov 9, 2020

mattcary Nov 9, 2020

jingxu97 Nov 9, 2020

jingxu97 commented Nov 9, 2020

mattcary commented Nov 9, 2020

saikat-royc Nov 9, 2020

jingxu97 Nov 9, 2020

k8s-ci-robot commented Nov 9, 2020

msau42 Nov 9, 2020

jingxu97 Nov 10, 2020

msau42 Nov 10, 2020

jingxu97 Nov 10, 2020 •

edited

Loading

msau42 Nov 10, 2020

jingxu97 Nov 10, 2020 •

edited

Loading

jingxu97 commented Nov 10, 2020

msau42 Nov 10, 2020

jingxu97 Nov 11, 2020

msau42 Nov 11, 2020

jingxu97 Nov 12, 2020

Jiawei0227 commented Nov 12, 2020 •

edited

Loading

jingxu97 commented Nov 12, 2020

Jiawei0227 commented Nov 12, 2020

saikat-royc commented Nov 12, 2020

jingxu97 commented Nov 15, 2020

k8s-ci-robot commented Nov 16, 2020 •

edited

Loading

jingxu97 commented Jan 21, 2021

fejta-bot commented Apr 21, 2021

fejta-bot commented May 21, 2021

k8s-ci-robot commented Jun 2, 2021

mattcary commented Jun 23, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-ci-robot commented Jul 26, 2021

Update gce operation timeout #662

Update gce operation timeout #662

Conversation

jingxu97 commented Nov 9, 2020 • edited Loading

k8s-ci-robot commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 commented Nov 9, 2020

mattcary commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Nov 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jingxu97 Nov 10, 2020 • edited Loading

Choose a reason for hiding this comment

jingxu97 commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jiawei0227 commented Nov 12, 2020 • edited Loading

jingxu97 commented Nov 12, 2020

Jiawei0227 commented Nov 12, 2020

saikat-royc commented Nov 12, 2020

jingxu97 commented Nov 15, 2020

k8s-ci-robot commented Nov 16, 2020 • edited Loading

jingxu97 commented Jan 21, 2021

fejta-bot commented Apr 21, 2021

fejta-bot commented May 21, 2021

k8s-ci-robot commented Jun 2, 2021

mattcary commented Jun 23, 2021

k8s-triage-robot commented Jul 26, 2021

k8s-ci-robot commented Jul 26, 2021

jingxu97 commented Nov 9, 2020 •

edited

Loading

jingxu97 Nov 10, 2020 •

edited

Loading

jingxu97 Nov 10, 2020 •

edited

Loading

Jiawei0227 commented Nov 12, 2020 •

edited

Loading

k8s-ci-robot commented Nov 16, 2020 •

edited

Loading