Retry WebStorage operations #2879

schmidt-sebastian · 2020-04-07T01:16:45Z

This PR allows us to recover from IndexedDB failures when a Multi-Tab operation scheduled via a WebStorage notification fails.

To do this:

It introduces the concept of "retryable" operations. These are directly retried on the AsyncQueue.
All retryable operations use the same retry mechanism. It is assumed that if the first retryable operation fails, the second would fail to.
Retries use the same exponential backoff that we use for our streams (down to the delays).

Since a operation that is schedule via a DelayedOperation is always run after all other operations, the PR also changes the backoff handler to use AsyncQueue.enqueue directly if the backoff is zero. This is not required, but not doing this breaks all tests that rely on WebStorage since AsyncQueue.drain() does not drain delayed operations.
I tried to make this change in the AsyncQueue directly, but it would require to return us to create a new way to create CancellableOperations from enqueue().
A side effect of this is that the spec tests way to fast-forward the backoff timer doesn't work as it did before - the first backoff timer cannot be fast-forwarded and any subsequent attempts are not always scheduled. This causes some tests to time out as we sometimes wait for the backoff to expire. I changed the SpecTests to always skip the backoff delays, which speeds up all tests considerably (faster than on master).

Note: I wasn't able to figure out how to make the test "handles user changes while offline (b/74749605)" work, since the test relies on a backoff for the first attempt.

google-oss-bot · 2020-04-07T01:26:07Z

Binary Size Report

Affected SDKs

@firebase/auth

Type Base (ce3addb) Head (b2cfbf2) Diff

main 177 kB 177 kB +63 B (+0.0%)

module 177 kB 177 kB +63 B (+0.0%)
@firebase/database

Type Base (ce3addb) Head (b2cfbf2) Diff

browser 268 kB 268 kB -1 B (-0.0%)

esm2017 235 kB 235 kB -1 B (-0.0%)

main 269 kB 269 kB -1 B (-0.0%)

module 266 kB 266 kB -1 B (-0.0%)

`@firebase/firestore`

Type	Base (`ce3addb`)	Head (`b2cfbf2`)	Diff
browser	252 kB	253 kB	+739 B (+0.3%)
esm2017	199 kB	200 kB	+491 B (+0.2%)
main	486 kB	487 kB	+1.05 kB (+0.2%)
module	250 kB	251 kB	+717 B (+0.3%)

`@firebase/firestore/memory`

Type	Base (`ce3addb`)	Head (`b2cfbf2`)	Diff
browser	199 kB	200 kB	+739 B (+0.4%)
esm2017	156 kB	156 kB	+491 B (+0.3%)
main	376 kB	377 kB	+1.05 kB (+0.3%)
module	197 kB	198 kB	+717 B (+0.4%)

@firebase/performance

Type Base (ce3addb) Head (b2cfbf2) Diff

browser 28.1 kB 28.1 kB -1 B (-0.0%)

esm2017 26.1 kB 26.1 kB -1 B (-0.0%)

main 28.1 kB 28.1 kB -1 B (-0.0%)

module 27.9 kB 27.9 kB -1 B (-0.0%)
@firebase/storage

Type Base (ce3addb) Head (b2cfbf2) Diff

esm2017 56.6 kB 56.4 kB -222 B (-0.4%)

main 62.9 kB 62.7 kB -220 B (-0.3%)

module 62.7 kB 62.5 kB -220 B (-0.4%)
@firebase/testing

Type Base (ce3addb) Head (b2cfbf2) Diff

main 6.61 kB 6.67 kB +67 B (+1.0%)

`firebase`

Type	Base (`ce3addb`)	Head (`b2cfbf2`)	Diff
firebase-auth.js	173 kB	173 kB	+63 B (+0.0%)
firebase-database.js	187 kB	187 kB	-1 B (-0.0%)
firebase-firestore.js	294 kB	295 kB	+712 B (+0.2%)
firebase-firestore.memory.js	242 kB	243 kB	+712 B (+0.3%)
firebase-performance-standalone.es2017.js	72.7 kB	72.7 kB	-3 B (-0.0%)
firebase-performance-standalone.js	48.1 kB	48.1 kB	-2 B (-0.0%)
firebase-performance.js	38.5 kB	38.5 kB	-2 B (-0.0%)
firebase-storage.js	41.0 kB	40.9 kB	-117 B (-0.3%)
firebase.js	828 kB	828 kB	+680 B (+0.1%)

Test Logs

Base (ce3addba): https://github.com/firebase/firebase-js-sdk/actions/runs/74766216
Head (b2cfbf26): https://github.com/firebase/firebase-js-sdk/actions/runs/79275487

wilhuff · 2020-04-13T23:55:33Z

packages/firestore/src/remote/backoff.ts

+        }
+      );
+    } else {
+      this.queue.enqueueAndForget(() => {


This is a subtly different behavior from before that seems like it has the chance to break cancellation.

The issue is that previously, enqueueAfterDelay would schedule a callback at the desired time which would submit the operation to the queue wrapped in a check that the timer hasn't been canceled (see https://github.com/firebase/firebase-js-sdk/blob/master/packages/firestore/src/util/async_queue.ts#L173).

This means that any operation still in the queue has a chance to cancel the timer before it actually runs.

I'm not sure this ends up being a problem in practice, but it seems like a more straightforward way to handle this would be to change AsyncQueue.drain to include those delayed operations that have a zero delay.

I tried different approaches here, and they all failed. If drain() waits for retryable operations, then the WebStorageTests pass with flying colors. I, however, lose the ability to inject failures in the spec tests and not wait for them.

I solved this by scheduling the first operation with enqueueAndForget. Note that this means that the first retry is not scheduled with backoff, but at least it is pushed to the end of the queue.

wilhuff · 2020-04-14T00:04:41Z

packages/firestore/src/util/async_queue.ts

+          await op();
+          deferred.resolve();
+          this.backoff.reset();
+          this.firstRetryableOperation = undefined;


This chain from the first retryable operation creates a weird circumstance where we could have a train of retryable operations where we wake up, succeed in performing some operation, fail in some later one, and lose the rest of the chain.

It seems like maintaining an explicit queue of these (with a head and a tail) would make this more resistant to arbitrary hiccups like this.

Or maybe I'm not understand what this code is doing exactly, in which case please add comments about how this is supposed to work.

I rewrote this code to use the same tail logic that we already have in the AsyncQueue. This simplifies the logic here a bit, which I hope is better than adding a comment.

I do think that the previous code did not drop any operations - even if we reset the beginning of a promise chain, all other operations are still evaluated in the order they were scheduled. The optimization here was that we were able to reset firstRetryableOperation to its initial state which unnecessarily complicates this logic.

wilhuff · 2020-04-14T00:19:57Z

packages/firestore/src/util/async_queue.ts

+        }
+      };
+
+      this.backoff.backoffAndRun(retryingOp);


The streams try to perform their initial operation and only call backoffAndRun if there's been an initial failure. I wonder if your testing difficulties could be resolved if you called this.enqueueAndForget or similar here for the first iteration? That way you'd only be going through the backoff after a failure.

That would simplify my implementation, but it seems like that is not the correct behavior. The first backoff.backoffAndRun() doesn't incur any delays, so the first backoff in case of a stream failure doesn't perform any backoff.

FWIW, transaction_runner seems to the correct thing and uses backoffAndRun() even for the first attempt.

Do you know what the desired behavior here is? I think the easy fix for my usage and the usage in the streams is to backoff on the first attempt.

We made it backoffAndRun rather than runWithBackoff specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.

The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.

TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.

In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.

I think for IndexedDB failures, if would be "feel better" if we retried with backoff right away. We already retry aborted transactions immediately, since these are likely to succeed even without backoff. Thus, I would prefer if the secondary error-code unaware retry mechanism was different and used some sort of backoff.
It does seem like this unnecessarily complicates our tests though. I would suggest we defer this behavior change until we have actual evidence that backing off is an improvement.

wilhuff · 2020-04-15T18:17:36Z

packages/firestore/src/util/async_queue.ts

+   * A timer used to retry operations scheduled via retryable AsyncQueue
+   * operations.
+   */
+  AsyncQueueRetry = 'async_queue_retry'


nit: the inconsistency between "retry" at the beginning vs end of the enumeration entry is a little glaring. RetryAsyncQueue sounds a little weird. Maybe rename RetryTransaction to TransactionRetry to match?

Renamed RetryTransaction.

wilhuff · 2020-04-15T18:18:21Z

packages/firestore/src/util/async_queue.ts

@@ -195,6 +205,10 @@ export class AsyncQueue {
  // The last promise in the queue.
  private tail: Promise<unknown> = Promise.resolve();

+  // The last retryable operation. Retryable operation are run in order and
+  // retried with backoff.
+  private retryableTail = Promise.resolve();


Does this need a type of Promise<unknown>, as on tail, above?

All retryable operations return void (to make sure users don't wait for results). Promise.resolve() returns Promise<void as a type implicitly, but I made it explicit by adding the type here.

wilhuff · 2020-04-15T18:21:28Z

packages/firestore/test/unit/util/async_queue.test.ts

+    queue.enqueueRetryable(async () => {
+      doStep(1);
+      if (completedSteps.length === 1) {
+        throw new Error("Let's retry");


When combing through through test logs, it's useful for simulated errors to clearly label themselves so they can be easily ignored. Consider prefixing this message with "Simulated error" or similar.

wilhuff · 2020-04-15T18:32:29Z

packages/firestore/test/unit/specs/recovery_spec.test.ts

+          .client(1)
+          .expectPrimaryState(false)
+          .userSets('collection/a', { v: 1 })
+          .failDatabase()


Why are fail and recover instructions issued from client 1? Does it matter?

Client 0 is the primary client and writes into WebStorage. Client1 is the secondary client, reads from WebStorage and fails to apply the changes. The code in this PR only deals with the behavior in secondary clients.

wilhuff · 2020-04-15T18:33:45Z

packages/firestore/test/unit/specs/recovery_spec.test.ts

+          .writeAcks('collection/a', 1, { expectUserCallback: false })
+          .client(1)
+          .recoverDatabase()
+          .runTimer(TimerId.AsyncQueueRetry)


It would be useful to see that retry can happen multiple times (i.e. that running the timer before recovery does not raise an event but still does raise an event after recovery).

wilhuff · 2020-04-15T18:35:14Z

packages/firestore/test/unit/specs/spec_builder.ts

@@ -418,6 +418,22 @@ export class SpecBuilder {
    return this;
  }

+  failDatabase(): this {


Could you add comments here about how this works? I suspect this works by modifying all steps going forward until recoverDatabase is called, but the code here suggests this merely modifies the current step only.

This ends up calling doFailDatabase() which flips injectFailures. injectFailures is long-lived and tied to the lifetime of the persistence implementation. I added a short comment.

wilhuff · 2020-04-15T18:45:27Z

packages/firestore/src/util/async_queue.ts

+        }
+      };
+
+      this.backoff.backoffAndRun(retryingOp);


We made it backoffAndRun rather than runWithBackoff specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.

The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.

TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.

In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.

wilhuff

LGTM

schmidt-sebastian added the api: firestore label Apr 7, 2020

schmidt-sebastian force-pushed the mrschmidt/runretryable branch from 0ccccb5 to c895862 Compare April 9, 2020 05:04

schmidt-sebastian mentioned this pull request Apr 9, 2020

ComponentProvider v2 #2889

Merged

schmidt-sebastian added 3 commits April 9, 2020 14:48

Retry WebStorage operations

d3c8e55

[AUTOMATED]: Prettier Code Styling

4d7eb94

[AUTOMATED]: License Headers

b79b9ca

schmidt-sebastian force-pushed the mrschmidt/runretryable branch from c895862 to b79b9ca Compare April 9, 2020 21:49

schmidt-sebastian changed the title ~~Retry WebStorage operations WIP~~ Retry WebStorage operations Apr 9, 2020

schmidt-sebastian requested a review from wilhuff April 9, 2020 22:19

schmidt-sebastian assigned wilhuff Apr 9, 2020

schmidt-sebastian added api: firestore and removed api: firestore labels Apr 9, 2020

wilhuff reviewed Apr 14, 2020

View reviewed changes

wilhuff assigned schmidt-sebastian and unassigned wilhuff Apr 14, 2020

Review

d2fa82a

schmidt-sebastian assigned wilhuff and unassigned schmidt-sebastian Apr 14, 2020

wilhuff reviewed Apr 15, 2020

View reviewed changes

wilhuff assigned schmidt-sebastian and unassigned wilhuff Apr 15, 2020

schmidt-sebastian added 4 commits April 15, 2020 13:57

Review

7b538dc

[AUTOMATED]: Prettier Code Styling

00af1f3

[AUTOMATED]: License Headers

aba251d

Comment

0f07ebe

schmidt-sebastian assigned wilhuff and unassigned schmidt-sebastian Apr 15, 2020

wilhuff approved these changes Apr 15, 2020

View reviewed changes

wilhuff removed their assignment Apr 15, 2020

wilhuff assigned schmidt-sebastian Apr 15, 2020

schmidt-sebastian merged commit a82af58 into master Apr 15, 2020

schmidt-sebastian mentioned this pull request Apr 15, 2020

Offline Mode Improvements #2755

Closed

rogueg mentioned this pull request Apr 17, 2020

Firestore silently stops saving changes #2930

Closed

schmidt-sebastian mentioned this pull request Apr 23, 2020

Move code to spec_test_components #2974

Merged

firebase locked and limited conversation to collaborators May 16, 2020

schmidt-sebastian deleted the mrschmidt/runretryable branch July 9, 2020 17:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry WebStorage operations #2879

Retry WebStorage operations #2879

schmidt-sebastian commented Apr 7, 2020 •

edited

Loading

google-oss-bot commented Apr 7, 2020 •

edited

Loading

`@firebase/auth`

`@firebase/database`

`@firebase/firestore`

`@firebase/firestore/memory`

`@firebase/performance`

`@firebase/storage`

`@firebase/testing`

`firebase`

wilhuff Apr 13, 2020

schmidt-sebastian Apr 14, 2020

wilhuff Apr 14, 2020

schmidt-sebastian Apr 14, 2020

wilhuff Apr 14, 2020

schmidt-sebastian Apr 14, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

wilhuff Apr 15, 2020

schmidt-sebastian Apr 15, 2020

wilhuff Apr 15, 2020

wilhuff left a comment

Retry WebStorage operations #2879

Retry WebStorage operations #2879

Conversation

schmidt-sebastian commented Apr 7, 2020 • edited Loading

google-oss-bot commented Apr 7, 2020 • edited Loading

Binary Size Report

Affected SDKs

@firebase/auth

@firebase/database

@firebase/firestore

@firebase/firestore/memory

@firebase/performance

@firebase/storage

@firebase/testing

firebase

Test Logs

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilhuff left a comment

Choose a reason for hiding this comment

schmidt-sebastian commented Apr 7, 2020 •

edited

Loading

google-oss-bot commented Apr 7, 2020 •

edited

Loading

`@firebase/auth`

`@firebase/database`

`@firebase/firestore`

`@firebase/firestore/memory`

`@firebase/performance`

`@firebase/storage`

`@firebase/testing`

`firebase`