-
Notifications
You must be signed in to change notification settings - Fork 934
Retry WebStorage operations #2879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Binary Size ReportAffected SDKs
Test Logs |
0ccccb5
to
c895862
Compare
c895862
to
b79b9ca
Compare
} | ||
); | ||
} else { | ||
this.queue.enqueueAndForget(() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a subtly different behavior from before that seems like it has the chance to break cancellation.
The issue is that previously, enqueueAfterDelay
would schedule a callback at the desired time which would submit the operation to the queue wrapped in a check that the timer hasn't been canceled (see https://github.com/firebase/firebase-js-sdk/blob/master/packages/firestore/src/util/async_queue.ts#L173).
This means that any operation still in the queue has a chance to cancel the timer before it actually runs.
I'm not sure this ends up being a problem in practice, but it seems like a more straightforward way to handle this would be to change AsyncQueue.drain
to include those delayed operations that have a zero delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried different approaches here, and they all failed. If drain() waits for retryable operations, then the WebStorageTests pass with flying colors. I, however, lose the ability to inject failures in the spec tests and not wait for them.
I solved this by scheduling the first operation with enqueueAndForget
. Note that this means that the first retry is not scheduled with backoff, but at least it is pushed to the end of the queue.
await op(); | ||
deferred.resolve(); | ||
this.backoff.reset(); | ||
this.firstRetryableOperation = undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chain from the first retryable operation creates a weird circumstance where we could have a train of retryable operations where we wake up, succeed in performing some operation, fail in some later one, and lose the rest of the chain.
It seems like maintaining an explicit queue of these (with a head and a tail) would make this more resistant to arbitrary hiccups like this.
Or maybe I'm not understand what this code is doing exactly, in which case please add comments about how this is supposed to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote this code to use the same tail
logic that we already have in the AsyncQueue. This simplifies the logic here a bit, which I hope is better than adding a comment.
I do think that the previous code did not drop any operations - even if we reset the beginning of a promise chain, all other operations are still evaluated in the order they were scheduled. The optimization here was that we were able to reset firstRetryableOperation
to its initial state which unnecessarily complicates this logic.
} | ||
}; | ||
|
||
this.backoff.backoffAndRun(retryingOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streams try to perform their initial operation and only call backoffAndRun
if there's been an initial failure. I wonder if your testing difficulties could be resolved if you called this.enqueueAndForget
or similar here for the first iteration? That way you'd only be going through the backoff after a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would simplify my implementation, but it seems like that is not the correct behavior. The first backoff.backoffAndRun()
doesn't incur any delays, so the first backoff in case of a stream failure doesn't perform any backoff.
FWIW, transaction_runner
seems to the correct thing and uses backoffAndRun()
even for the first attempt.
Do you know what the desired behavior here is? I think the easy fix for my usage and the usage in the streams is to backoff on the first attempt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made it backoffAndRun
rather than runWithBackoff
specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.
The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.
TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.
In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for IndexedDB failures, if would be "feel better" if we retried with backoff right away. We already retry aborted transactions immediately, since these are likely to succeed even without backoff. Thus, I would prefer if the secondary error-code unaware retry mechanism was different and used some sort of backoff.
It does seem like this unnecessarily complicates our tests though. I would suggest we defer this behavior change until we have actual evidence that backing off is an improvement.
* A timer used to retry operations scheduled via retryable AsyncQueue | ||
* operations. | ||
*/ | ||
AsyncQueueRetry = 'async_queue_retry' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the inconsistency between "retry" at the beginning vs end of the enumeration entry is a little glaring. RetryAsyncQueue
sounds a little weird. Maybe rename RetryTransaction
to TransactionRetry
to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed RetryTransaction
.
@@ -195,6 +205,10 @@ export class AsyncQueue { | |||
// The last promise in the queue. | |||
private tail: Promise<unknown> = Promise.resolve(); | |||
|
|||
// The last retryable operation. Retryable operation are run in order and | |||
// retried with backoff. | |||
private retryableTail = Promise.resolve(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need a type of Promise<unknown>
, as on tail
, above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All retryable operations return void
(to make sure users don't wait for results). Promise.resolve()
returns Promise<void
as a type implicitly, but I made it explicit by adding the type here.
queue.enqueueRetryable(async () => { | ||
doStep(1); | ||
if (completedSteps.length === 1) { | ||
throw new Error("Let's retry"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When combing through through test logs, it's useful for simulated errors to clearly label themselves so they can be easily ignored. Consider prefixing this message with "Simulated error" or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.client(1) | ||
.expectPrimaryState(false) | ||
.userSets('collection/a', { v: 1 }) | ||
.failDatabase() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are fail and recover instructions issued from client 1? Does it matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Client 0 is the primary client and writes into WebStorage. Client1 is the secondary client, reads from WebStorage and fails to apply the changes. The code in this PR only deals with the behavior in secondary clients.
.writeAcks('collection/a', 1, { expectUserCallback: false }) | ||
.client(1) | ||
.recoverDatabase() | ||
.runTimer(TimerId.AsyncQueueRetry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to see that retry can happen multiple times (i.e. that running the timer before recovery does not raise an event but still does raise an event after recovery).
@@ -418,6 +418,22 @@ export class SpecBuilder { | |||
return this; | |||
} | |||
|
|||
failDatabase(): this { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add comments here about how this works? I suspect this works by modifying all steps going forward until recoverDatabase is called, but the code here suggests this merely modifies the current step only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up calling doFailDatabase()
which flips injectFailures
. injectFailures
is long-lived and tied to the lifetime of the persistence implementation. I added a short comment.
} | ||
}; | ||
|
||
this.backoff.backoffAndRun(retryingOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We made it backoffAndRun
rather than runWithBackoff
specifically because we were intending that the first invocation is after a failure, and the action taken is to backoff then run.
The backoff delay of zero for the first iteration is intentional, at least in the stream case. We found that e.g. connections could get in a stale state or watch could return spurious errors after a connection or database had been idle for a while.
TransactionRunner seems like it's using the API wrong, at least according to the design intent. We have the same problem there where connections can be stale and fail so it's worth avoiding delay on the first failure, even for transactions.
In the case of IndexedDB failures, the initial zero delay doesn't seem as useful, but it also doesn't seem harmful. If you wanted to make the zero delay on first retry optional that would be fine with me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This PR allows us to recover from IndexedDB failures when a Multi-Tab operation scheduled via a WebStorage notification fails.
To do this:
Since a operation that is schedule via a
DelayedOperation
is always run after all other operations, the PR also changes the backoff handler to useAsyncQueue.enqueue
directly if the backoff is zero. This is not required, but not doing this breaks all tests that rely on WebStorage sinceAsyncQueue.drain()
does not drain delayed operations.I tried to make this change in the AsyncQueue directly, but it would require to return us to create a new way to create CancellableOperations from
enqueue()
.A side effect of this is that the spec tests way to fast-forward the backoff timer doesn't work as it did before - the first backoff timer cannot be fast-forwarded and any subsequent attempts are not always scheduled. This causes some tests to time out as we sometimes wait for the backoff to expire. I changed the SpecTests to always skip the backoff delays, which speeds up all tests considerably (faster than on master).
Note: I wasn't able to figure out how to make the test "handles user changes while offline (b/74749605)" work, since the test relies on a backoff for the first attempt.