Skip to content

Retry WebStorage operations #2879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 15, 2020
Merged

Conversation

schmidt-sebastian
Copy link
Contributor

@schmidt-sebastian schmidt-sebastian commented Apr 7, 2020

This PR allows us to recover from IndexedDB failures when a Multi-Tab operation scheduled via a WebStorage notification fails.

To do this:

  • It introduces the concept of "retryable" operations. These are directly retried on the AsyncQueue.
  • All retryable operations use the same retry mechanism. It is assumed that if the first retryable operation fails, the second would fail to.
  • Retries use the same exponential backoff that we use for our streams (down to the delays).

Since a operation that is schedule via a DelayedOperation is always run after all other operations, the PR also changes the backoff handler to use AsyncQueue.enqueue directly if the backoff is zero. This is not required, but not doing this breaks all tests that rely on WebStorage since AsyncQueue.drain() does not drain delayed operations.
I tried to make this change in the AsyncQueue directly, but it would require to return us to create a new way to create CancellableOperations from enqueue().
A side effect of this is that the spec tests way to fast-forward the backoff timer doesn't work as it did before - the first backoff timer cannot be fast-forwarded and any subsequent attempts are not always scheduled. This causes some tests to time out as we sometimes wait for the backoff to expire. I changed the SpecTests to always skip the backoff delays, which speeds up all tests considerably (faster than on master).

Note: I wasn't able to figure out how to make the test "handles user changes while offline (b/74749605)" work, since the test relies on a backoff for the first attempt.

@google-oss-bot
Copy link
Contributor

google-oss-bot commented Apr 7, 2020

Binary Size Report

Affected SDKs

  • @firebase/auth

    Type Base (ce3addb) Head (b2cfbf2) Diff
    main 177 kB 177 kB +63 B (+0.0%)
    module 177 kB 177 kB +63 B (+0.0%)
  • @firebase/database

    Type Base (ce3addb) Head (b2cfbf2) Diff
    browser 268 kB 268 kB -1 B (-0.0%)
    esm2017 235 kB 235 kB -1 B (-0.0%)
    main 269 kB 269 kB -1 B (-0.0%)
    module 266 kB 266 kB -1 B (-0.0%)
  • @firebase/firestore

    Type Base (ce3addb) Head (b2cfbf2) Diff
    browser 252 kB 253 kB +739 B (+0.3%)
    esm2017 199 kB 200 kB +491 B (+0.2%)
    main 486 kB 487 kB +1.05 kB (+0.2%)
    module 250 kB 251 kB +717 B (+0.3%)
  • @firebase/firestore/memory

    Type Base (ce3addb) Head (b2cfbf2) Diff
    browser 199 kB 200 kB +739 B (+0.4%)
    esm2017 156 kB 156 kB +491 B (+0.3%)
    main 376 kB 377 kB +1.05 kB (+0.3%)
    module 197 kB 198 kB +717 B (+0.4%)
  • @firebase/performance

    Type Base (ce3addb) Head (b2cfbf2) Diff
    browser 28.1 kB 28.1 kB -1 B (-0.0%)
    esm2017 26.1 kB 26.1 kB -1 B (-0.0%)
    main 28.1 kB 28.1 kB -1 B (-0.0%)
    module 27.9 kB 27.9 kB -1 B (-0.0%)
  • @firebase/storage

    Type Base (ce3addb) Head (b2cfbf2) Diff
    esm2017 56.6 kB 56.4 kB -222 B (-0.4%)
    main 62.9 kB 62.7 kB -220 B (-0.3%)
    module 62.7 kB 62.5 kB -220 B (-0.4%)
  • @firebase/testing

    Type Base (ce3addb) Head (b2cfbf2) Diff
    main 6.61 kB 6.67 kB +67 B (+1.0%)
  • firebase

    Type Base (ce3addb) Head (b2cfbf2) Diff
    firebase-auth.js 173 kB 173 kB +63 B (+0.0%)
    firebase-database.js 187 kB 187 kB -1 B (-0.0%)
    firebase-firestore.js 294 kB 295 kB +712 B (+0.2%)
    firebase-firestore.memory.js 242 kB 243 kB +712 B (+0.3%)
    firebase-performance-standalone.es2017.js 72.7 kB 72.7 kB -3 B (-0.0%)
    firebase-performance-standalone.js 48.1 kB 48.1 kB -2 B (-0.0%)
    firebase-performance.js 38.5 kB 38.5 kB -2 B (-0.0%)
    firebase-storage.js 41.0 kB 40.9 kB -117 B (-0.3%)
    firebase.js 828 kB 828 kB +680 B (+0.1%)

Test Logs

@schmidt-sebastian schmidt-sebastian changed the title Retry WebStorage operations WIP Retry WebStorage operations Apr 9, 2020
}
);
} else {
this.queue.enqueueAndForget(() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a subtly different behavior from before that seems like it has the chance to break cancellation.

The issue is that previously, enqueueAfterDelay would schedule a callback at the desired time which would submit the operation to the queue wrapped in a check that the timer hasn't been canceled (see https://github.com/firebase/firebase-js-sdk/blob/master/packages/firestore/src/util/async_queue.ts#L173).

This means that any operation still in the queue has a chance to cancel the timer before it actually runs.

I'm not sure this ends up being a problem in practice, but it seems like a more straightforward way to handle this would be to change AsyncQueue.drain to include those delayed operations that have a zero delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried different approaches here, and they all failed. If drain() waits for retryable operations, then the WebStorageTests pass with flying colors. I, however, lose the ability to inject failures in the spec tests and not wait for them.

I solved this by scheduling the first operation with enqueueAndForget. Note that this means that the first retry is not scheduled with backoff, but at least it is pushed to the end of the queue.

await op();
deferred.resolve();
this.backoff.reset();
this.firstRetryableOperation = undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chain from the first retryable operation creates a weird circumstance where we could have a train of retryable operations where we wake up, succeed in performing some operation, fail in some later one, and lose the rest of the chain.

It seems like maintaining an explicit queue of these (with a head and a tail) would make this more resistant to arbitrary hiccups like this.

Or maybe I'm not understand what this code is doing exactly, in which case please add comments about how this is supposed to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this code to use the same tail logic that we already have in the AsyncQueue. This simplifies the logic here a bit, which I hope is better than adding a comment.

I do think that the previous code did not drop any operations - even if we reset the beginning of a promise chain, all other operations are still evaluated in the order they were scheduled. The optimization here was that we were able to reset firstRetryableOperation to its initial state which unnecessarily complicates this logic.

}
};

this.backoff.backoffAndRun(retryingOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streams try to perform their initial operation and only call backoffAndRun if there's been an initial failure. I wonder if your testing difficulties could be resolved if you called this.enqueueAndForget or similar here for the first iteration? That way you'd only be going through the backoff after a failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would simplify my implementation, but it seems like that is not the correct behavior. The first backoff.backoffAndRun() doesn't incur any delays, so the first backoff in case of a stream failure doesn't perform any backoff.

FWIW, transaction_runner seems to the correct thing and uses backoffAndRun() even for the first attempt.

Do you know what the desired behavior here is? I think the easy fix for my usage and the usage in the streams is to backoff on the first attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made it backoffAndRun rather than runWithBackoff specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.

The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.

TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.

In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for IndexedDB failures, if would be "feel better" if we retried with backoff right away. We already retry aborted transactions immediately, since these are likely to succeed even without backoff. Thus, I would prefer if the secondary error-code unaware retry mechanism was different and used some sort of backoff.
It does seem like this unnecessarily complicates our tests though. I would suggest we defer this behavior change until we have actual evidence that backing off is an improvement.

@wilhuff wilhuff assigned schmidt-sebastian and unassigned wilhuff Apr 14, 2020
* A timer used to retry operations scheduled via retryable AsyncQueue
* operations.
*/
AsyncQueueRetry = 'async_queue_retry'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the inconsistency between "retry" at the beginning vs end of the enumeration entry is a little glaring. RetryAsyncQueue sounds a little weird. Maybe rename RetryTransaction to TransactionRetry to match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed RetryTransaction.

@@ -195,6 +205,10 @@ export class AsyncQueue {
// The last promise in the queue.
private tail: Promise<unknown> = Promise.resolve();

// The last retryable operation. Retryable operation are run in order and
// retried with backoff.
private retryableTail = Promise.resolve();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a type of Promise<unknown>, as on tail, above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All retryable operations return void (to make sure users don't wait for results). Promise.resolve() returns Promise<void as a type implicitly, but I made it explicit by adding the type here.

queue.enqueueRetryable(async () => {
doStep(1);
if (completedSteps.length === 1) {
throw new Error("Let's retry");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When combing through through test logs, it's useful for simulated errors to clearly label themselves so they can be easily ignored. Consider prefixing this message with "Simulated error" or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.client(1)
.expectPrimaryState(false)
.userSets('collection/a', { v: 1 })
.failDatabase()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are fail and recover instructions issued from client 1? Does it matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client 0 is the primary client and writes into WebStorage. Client1 is the secondary client, reads from WebStorage and fails to apply the changes. The code in this PR only deals with the behavior in secondary clients.

.writeAcks('collection/a', 1, { expectUserCallback: false })
.client(1)
.recoverDatabase()
.runTimer(TimerId.AsyncQueueRetry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to see that retry can happen multiple times (i.e. that running the timer before recovery does not raise an event but still does raise an event after recovery).

@@ -418,6 +418,22 @@ export class SpecBuilder {
return this;
}

failDatabase(): this {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add comments here about how this works? I suspect this works by modifying all steps going forward until recoverDatabase is called, but the code here suggests this merely modifies the current step only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up calling doFailDatabase() which flips injectFailures. injectFailures is long-lived and tied to the lifetime of the persistence implementation. I added a short comment.

}
};

this.backoff.backoffAndRun(retryingOp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made it backoffAndRun rather than runWithBackoff specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.

The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.

TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.

In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.

@wilhuff wilhuff assigned schmidt-sebastian and unassigned wilhuff Apr 15, 2020
Copy link
Contributor

@wilhuff wilhuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wilhuff wilhuff removed their assignment Apr 15, 2020
@schmidt-sebastian schmidt-sebastian merged commit a82af58 into master Apr 15, 2020
@firebase firebase locked and limited conversation to collaborators May 16, 2020
@schmidt-sebastian schmidt-sebastian deleted the mrschmidt/runretryable branch July 9, 2020 17:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants