-
Notifications
You must be signed in to change notification settings - Fork 159
Update gce operation timeout #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jingxu97 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -610,6 +610,8 @@ func (cloud *CloudProvider) AttachDisk(ctx context.Context, volKey *meta.Key, re | |||
op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do() | |||
if err != nil { | |||
return fmt.Errorf("failed cloud service attach disk call: %v", err) | |||
} else { | |||
klog.V(5).Infof("Attaching disk %s operation id is ", volKey, op.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also log op.Name at line 618? There may be lots of logging between these two line so it would be easier to debug a failed waitForZonalOp if we had the op name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added, thanks!
Thanks! /lgtm |
@@ -681,7 +683,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z | |||
// The v1 API can query for v1, alpha, or beta operations. | |||
svc := cloud.service | |||
project := cloud.project | |||
return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) { | |||
return wait.Poll(3*time.Second, 2*time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we create constants for these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, updated. thanks!
New changes are detected. LGTM label has been removed. |
@@ -681,7 +687,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z | |||
// The v1 API can query for v1, alpha, or beta operations. | |||
svc := cloud.service | |||
project := cloud.project | |||
return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) { | |||
return wait.Poll(OpInterval, OpTimeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we honor the client context.Timeout value instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attacher side set context timeout to 15 second. https://github.com/kubernetes-csi/external-attacher/blob/298cb965f909f76f2d2dfa2062cb5e142f14b9d5/cmd/csi-attacher/main.go#L58
my understanding is this context timeout will be used for each individual operation for calling gce api. So we need use another set up timeout for this polling.
op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do()
pollOp, err := cloud.service.RegionOperations.Get(cloud.project, region, opName).Context(ctx).Do()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, that makes sense now. The test failure I am checking, the first operation on AttachDisk call timed out after 250 seconds. It is not created successfully at all (not sure about the reason. If it is caused by rate limit, will it error back immediately?).
So just change here is not enough, we need to reduce the timeout defined in controller.yaml. I think 250s is too long based on normal attach/detach volume operation. I propose to use 60 seconds or even less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adjusting the sidecar timeout based off of the metrics we have is fine.
But I think the waitForOp method here still needs to use the csi context's timeout instead of a hardcoded one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to use csi context timeout, but the poll interval will still be hardcoded for now.
8b0a9cf
to
158297e
Compare
comments were addressed. PTAL |
@@ -52,7 +52,7 @@ spec: | |||
- "--metrics-address=:22012" | |||
- "--leader-election" | |||
- "--leader-election-namespace=$(PDCSI_NAMESPACE)" | |||
- "--timeout=250s" | |||
- "--timeout=60s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the value at the 99% percentile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95%: 25s
99%: 50-100
I can change it to 90s, which I think is sufficient. On the other hand, also want to make recovery quicker in case of timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does a smaller timeout improve recovery if the underlying operation doesn't complete? The main thing I see is we will create an error event more frequently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is not operation does not complete. The issue I am seeing (guessing) from logs that the first operation is not created successfully, but it does not return anything until timeout. After that, the second attach is called, and this time operation is created successfully and finished in a few second. I was suspecting it is due to rate limit and operation is dropped. But in this case, I am not sure operation creation returns an error immediately?
Having a short timeout means we have a chance to create another operation.
158297e
to
aa1e558
Compare
Oh yeah you are right. 5min should not be used here. Instead we should use the context time to indicate the deadline. Good finding! |
you mean the create operation call in a separate wait loop? since the get operation call is already in a separate loop now in waitForZoneOp().. |
I think if we put pollOp, err := svc.ZoneOperations.Get(project, zone, opName).Context(ctx).Do() this call into a separate wait loop with a smaller timeout like 30s, then we can return err early if for any reason the operation creation get dropped. So that the external-attacher can issue an attach again. I think if we put create operation call in a wait loop util it makes sure the operation is created in gcp is also okay. I am fine with both implemention. |
Agree. e.g. |
3863711
to
28d5cc1
Compare
/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019 |
7707505
to
93ff983
Compare
106be3f
to
1f35fd3
Compare
@jingxu97: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Update gce operation timeout to 2 mins instead of 5 mins. The 90% of attach/detach disk is less than 15 second, and create volume is less than 1 min. Reduce the timeout of volume operation so that it can recover from previous operation quicker in case operation is dropped.
1f35fd3
to
bdd9262
Compare
/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win1909 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
@jingxu97: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
is this pr still relevant? |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Update gce operation timeout to 2 mins instead of 5 mins. The 90% of
attach/detach disk is less than 15 second, and create volume is less
than 1 min.
Reduce the timeout of volume operation so that it can recover from
previous operation quicker in case operation is dropped.
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: