Skip to content

Update gce operation timeout #662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jingxu97
Copy link
Contributor

@jingxu97 jingxu97 commented Nov 9, 2020

Update gce operation timeout to 2 mins instead of 5 mins. The 90% of
attach/detach disk is less than 15 second, and create volume is less
than 1 min.

Reduce the timeout of volume operation so that it can recover from
previous operation quicker in case operation is dropped.

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Update gce volume operation timeout.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 9, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingxu97

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 9, 2020
@@ -610,6 +610,8 @@ func (cloud *CloudProvider) AttachDisk(ctx context.Context, volKey *meta.Key, re
op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do()
if err != nil {
return fmt.Errorf("failed cloud service attach disk call: %v", err)
} else {
klog.V(5).Infof("Attaching disk %s operation id is ", volKey, op.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also log op.Name at line 618? There may be lots of logging between these two line so it would be easier to debug a failed waitForZonalOp if we had the op name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thanks!

@jingxu97
Copy link
Contributor Author

jingxu97 commented Nov 9, 2020

cc @verult @saad-ali

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 9, 2020
@mattcary
Copy link
Contributor

mattcary commented Nov 9, 2020

Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020
@@ -681,7 +683,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z
// The v1 API can query for v1, alpha, or beta operations.
svc := cloud.service
project := cloud.project
return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) {
return wait.Poll(3*time.Second, 2*time.Minute, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create constants for these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, updated. thanks!

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 9, 2020
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@@ -681,7 +687,7 @@ func (cloud *CloudProvider) waitForZonalOp(ctx context.Context, opName string, z
// The v1 API can query for v1, alpha, or beta operations.
svc := cloud.service
project := cloud.project
return wait.Poll(3*time.Second, 5*time.Minute, func() (bool, error) {
return wait.Poll(OpInterval, OpTimeout, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we honor the client context.Timeout value instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attacher side set context timeout to 15 second. https://github.com/kubernetes-csi/external-attacher/blob/298cb965f909f76f2d2dfa2062cb5e142f14b9d5/cmd/csi-attacher/main.go#L58

my understanding is this context timeout will be used for each individual operation for calling gce api. So we need use another set up timeout for this polling.

op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do()
pollOp, err := cloud.service.RegionOperations.Get(cloud.project, region, opName).Context(ctx).Do()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jingxu97 jingxu97 Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, that makes sense now. The test failure I am checking, the first operation on AttachDisk call timed out after 250 seconds. It is not created successfully at all (not sure about the reason. If it is caused by rate limit, will it error back immediately?).

So just change here is not enough, we need to reduce the timeout defined in controller.yaml. I think 250s is too long based on normal attach/detach volume operation. I propose to use 60 seconds or even less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adjusting the sidecar timeout based off of the metrics we have is fine.

But I think the waitForOp method here still needs to use the csi context's timeout instead of a hardcoded one.

Copy link
Contributor Author

@jingxu97 jingxu97 Nov 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to use csi context timeout, but the poll interval will still be hardcoded for now.

@jingxu97 jingxu97 force-pushed the nov/timeout branch 2 times, most recently from 8b0a9cf to 158297e Compare November 10, 2020 06:49
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 10, 2020
@jingxu97
Copy link
Contributor Author

comments were addressed. PTAL

@@ -52,7 +52,7 @@ spec:
- "--metrics-address=:22012"
- "--leader-election"
- "--leader-election-namespace=$(PDCSI_NAMESPACE)"
- "--timeout=250s"
- "--timeout=60s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the value at the 99% percentile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

95%: 25s
99%: 50-100

I can change it to 90s, which I think is sufficient. On the other hand, also want to make recovery quicker in case of timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does a smaller timeout improve recovery if the underlying operation doesn't complete? The main thing I see is we will create an error event more frequently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is not operation does not complete. The issue I am seeing (guessing) from logs that the first operation is not created successfully, but it does not return anything until timeout. After that, the second attach is called, and this time operation is created successfully and finished in a few second. I was suspecting it is due to rate limit and operation is dropped. But in this case, I am not sure operation creation returns an error immediately?

Having a short timeout means we have a chance to create another operation.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 11, 2020
@Jiawei0227
Copy link
Contributor

Jiawei0227 commented Nov 12, 2020

Oh yeah you are right. 5min should not be used here. Instead we should use the context time to indicate the deadline. Good finding!
I dont think this has anything to do with the proposal of separating the get operation call to a separate wait loop, does it? Or do I misunderstand somehing

@jingxu97
Copy link
Contributor Author

Oh yeah you are right. 5min should not be used here. Instead we should use the context time to indicate the deadline. Good finding!
I dont think this has anything to do with the proposal of separating the get operation call to a separate wait loop, does it? Or do I misunderstand somehing

you mean the create operation call in a separate wait loop? since the get operation call is already in a separate loop now in waitForZoneOp()..

@Jiawei0227
Copy link
Contributor

I think if we put

pollOp, err := svc.ZoneOperations.Get(project, zone, opName).Context(ctx).Do()

this call into a separate wait loop with a smaller timeout like 30s, then we can return err early if for any reason the operation creation get dropped. So that the external-attacher can issue an attach again.

I think if we put create operation call in a wait loop util it makes sure the operation is created in gcp is also okay. I am fine with both implemention.

@saikat-royc
Copy link
Member

this call into a separate wait loop with a smaller timeout like 30s, then we can return err early if for any reason the operation creation get dropped. So that the external-attacher can issue an attach again.

Agree.
Actually all the calls which return an operation, and the op.get() calls should have a short timeout. And the overall wait for the op to reach a done status (success/fail), should honor the csi timeout provided by the sidecars.

e.g.
wait.Poll(3time.Second, 5time.Minute, func() (bool, error) -> this i think should be the same as the csi timeout - some delta.
things like this
op, err := cloud.service.Instances.AttachDisk(cloud.project, instanceZone, instanceName, attachedDiskV1).Context(ctx).Do(),
and
pollOp, err := svc.ZoneOperations.Get(project, zone, opName).Context(ctx).Do() should have a fixed short timeout in-order to react faster.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 15, 2020
@jingxu97
Copy link
Contributor Author

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

@jingxu97 jingxu97 force-pushed the nov/timeout branch 3 times, most recently from 7707505 to 93ff983 Compare November 15, 2020 22:21
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2020
@jingxu97 jingxu97 force-pushed the nov/timeout branch 2 times, most recently from 106be3f to 1f35fd3 Compare November 16, 2020 03:05
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Nov 16, 2020

@jingxu97: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019 28d5cc1 link /test pull-gcp-compute-persistent-disk-csi-driver-e2e-win2019

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Update gce operation timeout to 2 mins instead of 5 mins. The 90% of
attach/detach disk is less than 15 second, and create volume is less
than 1 min.

Reduce the timeout of volume operation so that it can recover from
previous operation quicker in case operation is dropped.
@jingxu97
Copy link
Contributor Author

/test pull-gcp-compute-persistent-disk-csi-driver-e2e-win1909

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 21, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 21, 2021
@k8s-ci-robot
Copy link
Contributor

@jingxu97: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 2, 2021
@mattcary
Copy link
Contributor

is this pr still relevant?

@k8s-triage-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants