Skip to content

Retry WebStorage operations #2879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 15, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions packages/firestore/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,8 @@
# Unreleased
- [fixed] Fixed a source of IndexedDB-related crashes for tabs that receive
multi-tab notifications while the file system is locked.

# 1.10.2
- [fixed] Temporarily reverted the use of window.crypto to generate document
IDs to address compatibility issues with IE 11, WebWorkers, and React Native.
- [changed] Firestore now limits the number of concurrent document lookups it
Expand Down
18 changes: 0 additions & 18 deletions packages/firestore/src/core/component_provider.ts
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,6 @@ import { IndexFreeQueryEngine } from '../local/index_free_query_engine';
import { IndexedDbPersistence } from '../local/indexeddb_persistence';
import {
MemoryEagerDelegate,
MemoryLruDelegate,
MemoryPersistence
} from '../local/memory_persistence';

Expand Down Expand Up @@ -189,23 +188,6 @@ export class MemoryComponentProvider {
}
}

/**
* Provides all components needed for Firestore with in-memory persistence.
* Uses LRU garbage collection.
*/
export class MemoryLruComponentProvider extends MemoryComponentProvider {
createPersistence(cfg: ComponentConfiguration): Persistence {
debugAssert(
!cfg.persistenceSettings.durable,
'Can only start memory persistence'
);
return new MemoryPersistence(
cfg.clientId,
p => new MemoryLruDelegate(p, LruParams.DEFAULT)
);
}
}

/**
* Provides all components needed for Firestore with IndexedDB persistence.
*/
Expand Down
2 changes: 1 addition & 1 deletion packages/firestore/src/local/shared_client_state.ts
Original file line number Diff line number Diff line change
Expand Up @@ -755,7 +755,7 @@ export class WebStorageSharedClientState implements SharedClientState {
return;
}

this.queue.enqueueAndForget(async () => {
this.queue.enqueueRetryable(async () => {
if (!this.started) {
this.earlyEvents.push(event);
return;
Expand Down
22 changes: 13 additions & 9 deletions packages/firestore/src/remote/backoff.ts
Original file line number Diff line number Diff line change
Expand Up @@ -119,24 +119,28 @@ export class ExponentialBackoff {
desiredDelayWithJitterMs - delaySoFarMs
);

if (this.currentBaseMs > 0) {
if (remainingDelayMs > 0) {
logDebug(
LOG_TAG,
`Backing off for ${remainingDelayMs} ms ` +
`(base delay: ${this.currentBaseMs} ms, ` +
`delay with jitter: ${desiredDelayWithJitterMs} ms, ` +
`last attempt: ${delaySoFarMs} ms ago)`
);
}

this.timerPromise = this.queue.enqueueAfterDelay(
this.timerId,
remainingDelayMs,
() => {
this.timerPromise = this.queue.enqueueAfterDelay(
this.timerId,
remainingDelayMs,
() => {
this.lastAttemptTime = Date.now();
return op();
}
);
} else {
this.queue.enqueueAndForget(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtly different behavior from before that seems like it has the chance to break cancellation.

The issue is that previously, enqueueAfterDelay would schedule a callback at the desired time which would submit the operation to the queue wrapped in a check that the timer hasn't been canceled (see https://github.com/firebase/firebase-js-sdk/blob/master/packages/firestore/src/util/async_queue.ts#L173).

This means that any operation still in the queue has a chance to cancel the timer before it actually runs.

I'm not sure this ends up being a problem in practice, but it seems like a more straightforward way to handle this would be to change AsyncQueue.drain to include those delayed operations that have a zero delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried different approaches here, and they all failed. If drain() waits for retryable operations, then the WebStorageTests pass with flying colors. I, however, lose the ability to inject failures in the spec tests and not wait for them.

I solved this by scheduling the first operation with enqueueAndForget. Note that this means that the first retry is not scheduled with backoff, but at least it is pushed to the end of the queue.

this.lastAttemptTime = Date.now();
return op();
}
);
});
}

// Apply backoff factor to determine next delay and ensure it is within
// bounds.
Expand Down
82 changes: 74 additions & 8 deletions packages/firestore/src/util/async_queue.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,12 @@

import { debugAssert, fail } from './assert';
import { Code, FirestoreError } from './error';
import { logError } from './log';
import { logDebug, logError } from './log';
import { CancelablePromise, Deferred } from './promise';
import { ExponentialBackoff } from '../remote/backoff';
import { PlatformSupport } from '../platform/platform';

const LOG_TAG = 'AsyncQueue';

// Accept any return type from setTimeout().
// eslint-disable-next-line @typescript-eslint/no-explicit-any
Expand Down Expand Up @@ -66,7 +70,13 @@ export const enum TimerId {
* A timer used to retry transactions. Since there can be multiple concurrent
* transactions, multiple of these may be in the queue at a given time.
*/
RetryTransaction = 'retry_transaction'
RetryTransaction = 'retry_transaction',

/**
* A timer used to retry operations scheduled via retryable AsyncQueue
* operations.
*/
AsyncQueueRetry = 'async_queue_retry'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the inconsistency between "retry" at the beginning vs end of the enumeration entry is a little glaring. RetryAsyncQueue sounds a little weird. Maybe rename RetryTransaction to TransactionRetry to match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed RetryTransaction.

}

/**
Expand Down Expand Up @@ -213,6 +223,29 @@ export class AsyncQueue {
// List of TimerIds to fast-forward delays for.
private timerIdsToSkip: TimerId[] = [];

// Backoff timer used to schedule retries for retryable operations
private backoff = new ExponentialBackoff(this, TimerId.AsyncQueueRetry);

// If set, points to the first retryable operation. The first retryable
// operation is the only operation that is retried with backoff. If this
// operation succeeds, all other retryable operations are run right away.
private firstRetryableOperation?: Promise<void>;

// Visibility handler that triggers an immediate retry of all retryable
// operations. Meant to speed up recovery when we regain file system access
// after page comes into foreground.
private visibilityHandler = (): void => {
// eslint-disable-next-line @typescript-eslint/no-floating-promises
this.runDelayedOperationsEarly(TimerId.AsyncQueueRetry);
};

constructor() {
const window = PlatformSupport.getPlatform().window;
if (window) {
window.addEventListener('visibilitychange', this.visibilityHandler);
}
}

// Is this AsyncQueue being shut down? If true, this instance will not enqueue
// any new operations, Promises from enqueue requests will not resolve.
get isShuttingDown(): boolean {
Expand Down Expand Up @@ -262,6 +295,10 @@ export class AsyncQueue {
this.verifyNotFailed();
if (!this._isShuttingDown) {
this._isShuttingDown = true;
const window = PlatformSupport.getPlatform().window;
if (window) {
window.removeEventListener('visibilitychange', this.visibilityHandler);
}
await this.enqueueEvenAfterShutdown(op);
}
}
Expand All @@ -279,6 +316,41 @@ export class AsyncQueue {
return this.enqueueInternal(op);
}

enqueueRetryable(op: () => Promise<void>): void {
this.verifyNotFailed();

if (this._isShuttingDown) {
return;
}

if (this.firstRetryableOperation) {
// If there is already a retryable operation, enqueue the current
// operation right after. We want to successfully run the first retryable
// operation before running all others.
// eslint-disable-next-line @typescript-eslint/no-floating-promises
this.firstRetryableOperation.then(() => {
this.enqueueRetryable(op);
});
} else {
const deferred = new Deferred<void>();
this.firstRetryableOperation = deferred.promise;

const retryingOp = async (): Promise<void> => {
try {
await op();
deferred.resolve();
this.backoff.reset();
this.firstRetryableOperation = undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chain from the first retryable operation creates a weird circumstance where we could have a train of retryable operations where we wake up, succeed in performing some operation, fail in some later one, and lose the rest of the chain.

It seems like maintaining an explicit queue of these (with a head and a tail) would make this more resistant to arbitrary hiccups like this.

Or maybe I'm not understand what this code is doing exactly, in which case please add comments about how this is supposed to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this code to use the same tail logic that we already have in the AsyncQueue. This simplifies the logic here a bit, which I hope is better than adding a comment.

I do think that the previous code did not drop any operations - even if we reset the beginning of a promise chain, all other operations are still evaluated in the order they were scheduled. The optimization here was that we were able to reset firstRetryableOperation to its initial state which unnecessarily complicates this logic.

} catch (e) {
logDebug(LOG_TAG, 'Retryable operation failed: ' + e.message);
this.backoff.backoffAndRun(retryingOp);
}
};

this.backoff.backoffAndRun(retryingOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streams try to perform their initial operation and only call backoffAndRun if there's been an initial failure. I wonder if your testing difficulties could be resolved if you called this.enqueueAndForget or similar here for the first iteration? That way you'd only be going through the backoff after a failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would simplify my implementation, but it seems like that is not the correct behavior. The first backoff.backoffAndRun() doesn't incur any delays, so the first backoff in case of a stream failure doesn't perform any backoff.

FWIW, transaction_runner seems to the correct thing and uses backoffAndRun() even for the first attempt.

Do you know what the desired behavior here is? I think the easy fix for my usage and the usage in the streams is to backoff on the first attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made it backoffAndRun rather than runWithBackoff specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.

The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.

TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.

In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for IndexedDB failures, if would be "feel better" if we retried with backoff right away. We already retry aborted transactions immediately, since these are likely to succeed even without backoff. Thus, I would prefer if the secondary error-code unaware retry mechanism was different and used some sort of backoff.
It does seem like this unnecessarily complicates our tests though. I would suggest we defer this behavior change until we have actual evidence that backing off is an improvement.

}
}

private enqueueInternal<T extends unknown>(op: () => Promise<T>): Promise<T> {
const newTail = this.tail.then(() => {
this.operationInProgress = true;
Expand Down Expand Up @@ -399,12 +471,6 @@ export class AsyncQueue {
runDelayedOperationsEarly(lastTimerId: TimerId): Promise<void> {
// Note that draining may generate more delayed ops, so we do that first.
return this.drain().then(() => {
debugAssert(
lastTimerId === TimerId.All ||
this.containsDelayedOperation(lastTimerId),
`Attempted to drain to missing operation ${lastTimerId}`
);

// Run ops in the same order they'd run if they ran naturally.
this.delayedOperations.sort((a, b) => a.targetTimeMs - b.targetTimeMs);

Expand Down
71 changes: 71 additions & 0 deletions packages/firestore/test/unit/specs/recovery_spec.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
/**
* @license
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

import { describeSpec, specTest } from './describe_spec';
import { client } from './spec_builder';
import { TimerId } from '../../../src/util/async_queue';
import { Query } from '../../../src/core/query';
import { path } from '../../util/helpers';

describeSpec(
'Persistence Recovery',
['durable-persistence', 'no-ios', 'no-android'],
() => {
specTest(
'Write is acknowledged by primary client (with recovery)',
['multi-client'],
() => {
return client(0)
.expectPrimaryState(true)
.client(1)
.expectPrimaryState(false)
.userSets('collection/a', { v: 1 })
.failDatabase()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are fail and recover instructions issued from client 1? Does it matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client 0 is the primary client and writes into WebStorage. Client1 is the secondary client, reads from WebStorage and fails to apply the changes. The code in this PR only deals with the behavior in secondary clients.

.client(0)
.writeAcks('collection/a', 1, { expectUserCallback: false })
.client(1)
.recoverDatabase()
.runTimer(TimerId.AsyncQueueRetry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to see that retry can happen multiple times (i.e. that running the timer before recovery does not raise an event but still does raise an event after recovery).

.expectUserCallbacks({
acknowledged: ['collection/a']
});
}
);

specTest(
'Query raises events in secondary client (with recovery)',
['multi-client'],
() => {
const query = Query.atPath(path('collection'));

return client(0)
.expectPrimaryState(true)
.client(1)
.expectPrimaryState(false)
.userListens(query)
.failDatabase()
.client(0)
.expectListen(query)
.watchAcksFull(query, 1000)
.client(1)
.recoverDatabase()
.runTimer(TimerId.AsyncQueueRetry)
.expectEvents(query, {});
}
);
}
);
34 changes: 1 addition & 33 deletions packages/firestore/test/unit/specs/remote_store_spec.test.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/**
* @license
* Copyright 2017 Google Inc.
* Copyright 2017 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -84,36 +84,4 @@ describeSpec('Remote store:', [], () => {
.expectEvents(query, { added: [doc1] })
);
});

// TODO(b/72313632): This test is web-only because the Android / iOS spec
// tests exclude backoff entirely.
specTest(
'Handles user changes while offline (b/74749605).',
['no-android', 'no-ios'],
() => {
const query = Query.atPath(path('collection'));
return (
spec()
.userListens(query)

// close the stream (this should trigger retry with backoff; but don't
// run it in an attempt to reproduce b/74749605).
.watchStreamCloses(Code.UNAVAILABLE, { runBackoffTimer: false })
.expectEvents(query, { fromCache: true })

// Because we didn't let the backoff timer run and restart the watch
// stream, there will be no active targets.
.expectActiveTargets()

// Change user (will shut down existing streams and start new ones).
.changeUser('abc')
// Our query should be sent to the new stream.
.expectActiveTargets({ query, resumeToken: '' })

// Close the (newly-created) stream as if it too failed (should trigger
// retry with backoff, potentially reproducing the crash in b/74749605).
.watchStreamCloses(Code.UNAVAILABLE)
);
}
);
});
25 changes: 18 additions & 7 deletions packages/firestore/test/unit/specs/spec_builder.ts
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,22 @@ export class SpecBuilder {
return this;
}

failDatabase(): this {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments here about how this works? I suspect this works by modifying all steps going forward until recoverDatabase is called, but the code here suggests this merely modifies the current step only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up calling doFailDatabase() which flips injectFailures. injectFailures is long-lived and tied to the lifetime of the persistence implementation. I added a short comment.

this.nextStep();
this.currentStep = {
failDatabase: true
};
return this;
}

recoverDatabase(): this {
this.nextStep();
this.currentStep = {
failDatabase: false
};
return this;
}

expectIsShutdown(): this {
this.assertStep('Active target expectation requires previous step');
const currentStep = this.currentStep!;
Expand Down Expand Up @@ -716,19 +732,14 @@ export class SpecBuilder {
return this;
}

watchStreamCloses(error: Code, opts?: { runBackoffTimer: boolean }): this {
if (!opts) {
opts = { runBackoffTimer: true };
}

watchStreamCloses(error: Code): this {
this.nextStep();
this.currentStep = {
watchStreamClose: {
error: {
code: mapRpcCodeFromCode(error),
message: 'Simulated Backend Error'
},
runBackoffTimer: opts.runBackoffTimer
}
}
};
return this;
Expand Down
Loading