-
Notifications
You must be signed in to change notification settings - Fork 934
Retry WebStorage operations #2879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
d3c8e55
4d7eb94
b79b9ca
d2fa82a
7b538dc
00af1f3
aba251d
0f07ebe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,8 +17,12 @@ | |
|
||
import { debugAssert, fail } from './assert'; | ||
import { Code, FirestoreError } from './error'; | ||
import { logError } from './log'; | ||
import { logDebug, logError } from './log'; | ||
import { CancelablePromise, Deferred } from './promise'; | ||
import { ExponentialBackoff } from '../remote/backoff'; | ||
import { PlatformSupport } from '../platform/platform'; | ||
|
||
const LOG_TAG = 'AsyncQueue'; | ||
|
||
// Accept any return type from setTimeout(). | ||
// eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
|
@@ -66,7 +70,13 @@ export const enum TimerId { | |
* A timer used to retry transactions. Since there can be multiple concurrent | ||
* transactions, multiple of these may be in the queue at a given time. | ||
*/ | ||
RetryTransaction = 'retry_transaction' | ||
RetryTransaction = 'retry_transaction', | ||
|
||
/** | ||
* A timer used to retry operations scheduled via retryable AsyncQueue | ||
* operations. | ||
*/ | ||
AsyncQueueRetry = 'async_queue_retry' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: the inconsistency between "retry" at the beginning vs end of the enumeration entry is a little glaring. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Renamed |
||
} | ||
|
||
/** | ||
|
@@ -213,6 +223,29 @@ export class AsyncQueue { | |
// List of TimerIds to fast-forward delays for. | ||
private timerIdsToSkip: TimerId[] = []; | ||
|
||
// Backoff timer used to schedule retries for retryable operations | ||
private backoff = new ExponentialBackoff(this, TimerId.AsyncQueueRetry); | ||
|
||
// If set, points to the first retryable operation. The first retryable | ||
// operation is the only operation that is retried with backoff. If this | ||
// operation succeeds, all other retryable operations are run right away. | ||
private firstRetryableOperation?: Promise<void>; | ||
|
||
// Visibility handler that triggers an immediate retry of all retryable | ||
// operations. Meant to speed up recovery when we regain file system access | ||
// after page comes into foreground. | ||
private visibilityHandler = (): void => { | ||
// eslint-disable-next-line @typescript-eslint/no-floating-promises | ||
this.runDelayedOperationsEarly(TimerId.AsyncQueueRetry); | ||
}; | ||
|
||
constructor() { | ||
const window = PlatformSupport.getPlatform().window; | ||
if (window) { | ||
window.addEventListener('visibilitychange', this.visibilityHandler); | ||
} | ||
} | ||
|
||
// Is this AsyncQueue being shut down? If true, this instance will not enqueue | ||
// any new operations, Promises from enqueue requests will not resolve. | ||
get isShuttingDown(): boolean { | ||
|
@@ -262,6 +295,10 @@ export class AsyncQueue { | |
this.verifyNotFailed(); | ||
if (!this._isShuttingDown) { | ||
this._isShuttingDown = true; | ||
const window = PlatformSupport.getPlatform().window; | ||
if (window) { | ||
window.removeEventListener('visibilitychange', this.visibilityHandler); | ||
} | ||
await this.enqueueEvenAfterShutdown(op); | ||
} | ||
} | ||
|
@@ -279,6 +316,41 @@ export class AsyncQueue { | |
return this.enqueueInternal(op); | ||
} | ||
|
||
enqueueRetryable(op: () => Promise<void>): void { | ||
this.verifyNotFailed(); | ||
|
||
if (this._isShuttingDown) { | ||
return; | ||
} | ||
|
||
if (this.firstRetryableOperation) { | ||
// If there is already a retryable operation, enqueue the current | ||
// operation right after. We want to successfully run the first retryable | ||
// operation before running all others. | ||
// eslint-disable-next-line @typescript-eslint/no-floating-promises | ||
this.firstRetryableOperation.then(() => { | ||
this.enqueueRetryable(op); | ||
}); | ||
} else { | ||
const deferred = new Deferred<void>(); | ||
this.firstRetryableOperation = deferred.promise; | ||
|
||
const retryingOp = async (): Promise<void> => { | ||
try { | ||
await op(); | ||
deferred.resolve(); | ||
this.backoff.reset(); | ||
this.firstRetryableOperation = undefined; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This chain from the first retryable operation creates a weird circumstance where we could have a train of retryable operations where we wake up, succeed in performing some operation, fail in some later one, and lose the rest of the chain. It seems like maintaining an explicit queue of these (with a head and a tail) would make this more resistant to arbitrary hiccups like this. Or maybe I'm not understand what this code is doing exactly, in which case please add comments about how this is supposed to work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I rewrote this code to use the same I do think that the previous code did not drop any operations - even if we reset the beginning of a promise chain, all other operations are still evaluated in the order they were scheduled. The optimization here was that we were able to reset |
||
} catch (e) { | ||
logDebug(LOG_TAG, 'Retryable operation failed: ' + e.message); | ||
this.backoff.backoffAndRun(retryingOp); | ||
} | ||
}; | ||
|
||
this.backoff.backoffAndRun(retryingOp); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The streams try to perform their initial operation and only call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would simplify my implementation, but it seems like that is not the correct behavior. The first FWIW, Do you know what the desired behavior here is? I think the easy fix for my usage and the usage in the streams is to backoff on the first attempt. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We made it The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while. TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions. In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think for IndexedDB failures, if would be "feel better" if we retried with backoff right away. We already retry aborted transactions immediately, since these are likely to succeed even without backoff. Thus, I would prefer if the secondary error-code unaware retry mechanism was different and used some sort of backoff. |
||
} | ||
} | ||
|
||
private enqueueInternal<T extends unknown>(op: () => Promise<T>): Promise<T> { | ||
const newTail = this.tail.then(() => { | ||
this.operationInProgress = true; | ||
|
@@ -399,12 +471,6 @@ export class AsyncQueue { | |
runDelayedOperationsEarly(lastTimerId: TimerId): Promise<void> { | ||
// Note that draining may generate more delayed ops, so we do that first. | ||
return this.drain().then(() => { | ||
debugAssert( | ||
lastTimerId === TimerId.All || | ||
this.containsDelayedOperation(lastTimerId), | ||
`Attempted to drain to missing operation ${lastTimerId}` | ||
); | ||
|
||
// Run ops in the same order they'd run if they ran naturally. | ||
this.delayedOperations.sort((a, b) => a.targetTimeMs - b.targetTimeMs); | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
/** | ||
* @license | ||
* Copyright 2020 Google LLC | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
|
||
import { describeSpec, specTest } from './describe_spec'; | ||
import { client } from './spec_builder'; | ||
import { TimerId } from '../../../src/util/async_queue'; | ||
import { Query } from '../../../src/core/query'; | ||
import { path } from '../../util/helpers'; | ||
|
||
describeSpec( | ||
'Persistence Recovery', | ||
['durable-persistence', 'no-ios', 'no-android'], | ||
() => { | ||
specTest( | ||
'Write is acknowledged by primary client (with recovery)', | ||
['multi-client'], | ||
() => { | ||
return client(0) | ||
.expectPrimaryState(true) | ||
.client(1) | ||
.expectPrimaryState(false) | ||
.userSets('collection/a', { v: 1 }) | ||
.failDatabase() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are fail and recover instructions issued from client 1? Does it matter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Client 0 is the primary client and writes into WebStorage. Client1 is the secondary client, reads from WebStorage and fails to apply the changes. The code in this PR only deals with the behavior in secondary clients. |
||
.client(0) | ||
.writeAcks('collection/a', 1, { expectUserCallback: false }) | ||
.client(1) | ||
.recoverDatabase() | ||
.runTimer(TimerId.AsyncQueueRetry) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be useful to see that retry can happen multiple times (i.e. that running the timer before recovery does not raise an event but still does raise an event after recovery). |
||
.expectUserCallbacks({ | ||
acknowledged: ['collection/a'] | ||
}); | ||
} | ||
); | ||
|
||
specTest( | ||
'Query raises events in secondary client (with recovery)', | ||
['multi-client'], | ||
() => { | ||
const query = Query.atPath(path('collection')); | ||
|
||
return client(0) | ||
.expectPrimaryState(true) | ||
.client(1) | ||
.expectPrimaryState(false) | ||
.userListens(query) | ||
.failDatabase() | ||
.client(0) | ||
.expectListen(query) | ||
.watchAcksFull(query, 1000) | ||
.client(1) | ||
.recoverDatabase() | ||
.runTimer(TimerId.AsyncQueueRetry) | ||
.expectEvents(query, {}); | ||
} | ||
); | ||
} | ||
); |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -418,6 +418,22 @@ export class SpecBuilder { | |
return this; | ||
} | ||
|
||
failDatabase(): this { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add comments here about how this works? I suspect this works by modifying all steps going forward until recoverDatabase is called, but the code here suggests this merely modifies the current step only. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This ends up calling |
||
this.nextStep(); | ||
this.currentStep = { | ||
failDatabase: true | ||
}; | ||
return this; | ||
} | ||
|
||
recoverDatabase(): this { | ||
this.nextStep(); | ||
this.currentStep = { | ||
failDatabase: false | ||
}; | ||
return this; | ||
} | ||
|
||
expectIsShutdown(): this { | ||
this.assertStep('Active target expectation requires previous step'); | ||
const currentStep = this.currentStep!; | ||
|
@@ -716,19 +732,14 @@ export class SpecBuilder { | |
return this; | ||
} | ||
|
||
watchStreamCloses(error: Code, opts?: { runBackoffTimer: boolean }): this { | ||
if (!opts) { | ||
opts = { runBackoffTimer: true }; | ||
} | ||
|
||
watchStreamCloses(error: Code): this { | ||
this.nextStep(); | ||
this.currentStep = { | ||
watchStreamClose: { | ||
error: { | ||
code: mapRpcCodeFromCode(error), | ||
message: 'Simulated Backend Error' | ||
}, | ||
runBackoffTimer: opts.runBackoffTimer | ||
} | ||
} | ||
}; | ||
return this; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a subtly different behavior from before that seems like it has the chance to break cancellation.
The issue is that previously,
enqueueAfterDelay
would schedule a callback at the desired time which would submit the operation to the queue wrapped in a check that the timer hasn't been canceled (see https://github.com/firebase/firebase-js-sdk/blob/master/packages/firestore/src/util/async_queue.ts#L173).This means that any operation still in the queue has a chance to cancel the timer before it actually runs.
I'm not sure this ends up being a problem in practice, but it seems like a more straightforward way to handle this would be to change
AsyncQueue.drain
to include those delayed operations that have a zero delay.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried different approaches here, and they all failed. If drain() waits for retryable operations, then the WebStorageTests pass with flying colors. I, however, lose the ability to inject failures in the spec tests and not wait for them.
I solved this by scheduling the first operation with
enqueueAndForget
. Note that this means that the first retry is not scheduled with backoff, but at least it is pushed to the end of the queue.