-
Notifications
You must be signed in to change notification settings - Fork 940
[Multi-Tab] Recover from unexpected primary lease loss #984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
}, | ||
err => { | ||
if (err.code === Code.FAILED_PRECONDITION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Instead of relying on the error code, I tried to make this more specific and added a new error class at one point. I wanted to do an instanceof CustomPrimaryLeaseLostError
check here. It turns out that instanceof checks don't work with errors though, and so I abandoned this approach.
354e01b
to
f9f4d9c
Compare
f9f4d9c
to
c40ff09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This mostly LGTM but a few nits / questions / comments.
* Marks the client as secondary if an IndexedDb operation fails because the | ||
* primary lease has been taken by another client. This can happen when the | ||
* client is temporarily CPU throttled and fails to renew its lease in time, | ||
* in which we treat the current client as secondary. We can always revert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* in which we treat the current client as secondary. We can always revert | ||
* back to primary status via the lease refresh in our persistence layer. | ||
* | ||
* @param err An error returned by an IndexedDb operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"IndexedDb operation" seems more specific than reality. It could be any "persistence operation" or even "LocalStore operation", no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this can get called with a number of errors. All errors get plumbed through LocalStore, so I changed "IndexedDb" to "LocalStore" as suggested.
* @return A Promise that resolves after we recovered, or the original error. | ||
*/ | ||
private async tryRecoverClient(err: FirestoreError): Promise<void> { | ||
if (err.code === Code.FAILED_PRECONDITION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you think about adding an IndexedDbPersistence.isLostLeaseError() helper that checks both the code and the message?
Also, it may be useful to add a log message here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not too big of a fan of checking the error string and even if we were to re-use the FAILED_PRECONDITION code somewhere else in our persistence layer, flipping the primary state to secondary seems like a reasonable action. Without this, I think we can just do the check here inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already re-use FAILED_PRECONDITION for the PRIMARY_LEASE_EXCLUSIVE_ERROR_MSG error, and I don't think it makes sense to fall back to secondary in that case.
I'd rather pessimistically assume that this error handling should not apply to future unknown errors and only expand it if/when we have specific cases we want to let it handle.
If we do keep this error handling fairly inclusive, we should definitely include the error in the log message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the error check to IndexedDb and included the error message. Another option would be to statically define the error and compare against the instance, but that messes with the stacktraces.
} | ||
this.isPrimary = canActAsPrimary; | ||
// Always call the primary state listener, since SyncEngine may have | ||
// changed the primary state to 'false'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this feels a hacky to me, but if we're going to do this and rely on it, you should document it on PrimaryStateListener
interface and/or the setPrimaryStateListener()
method ("Called when the primary state changes and again every N seconds while primary")
A cleaner (though perhaps more code?) approach would be to aggressively give up our primary lease in tryRecoverClient() so that we are definitely not the primary anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you here, but I don't think the plumbing to make this work is necessarily a worthwhile investment. For now, I changed the method comment in setPrimaryStateListener
.
It might even be possible to remove the primary flag from the persistence layer. It is currently only used at startup and at shutdown.
/*keepPersistedQueryData=*/ false | ||
); | ||
await this.localStore.collectGarbage(); | ||
return this.localStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
.expectPrimaryState(false) | ||
// TODO(multitab): Suppres the additional `fromCache`, which is raised | ||
// locally because we disable the network when we lose the primary | ||
// lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm not sure if this is wrong / harmful. When a primary switch happens we collectively will be without a network connection for a period of time and it seems correct for views to trigger fromCache=true events... but we should return to fromCache=false once the primary switch completes and a new primary establishes the listens...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still undecided what the best desired behavior should be. I don't want clients flip-flopping between fromCache states. But on the other hand, the client will be offline at that point, so it technically correct to raise this event. I removed this todo, which also meant that I didn't have to fix the typo in Suppres
. Win win.
.stealPrimaryLease() | ||
.client(0) | ||
// Send a watch update to client 0, who is longer primary (but doesn't | ||
// it yet). The watch update gets ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing "know"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
options = options || {}; | ||
|
||
this.currentStep = { | ||
writeAck: { version, keepInQueue: !!options.keepInQueue } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW it may be worth adding a TODO(multitab) to update the spec schemas on the other clients with the changes you've made... Just so we don't forget to do that before / soon-after merging multitab into master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes that I made in Master are already ported to the other platforms (firebase/firebase-ios-sdk#1498, https://github.com/FirebasePrivate/firebase-android-sdk/pull/95). The rest should hopefully be obvious when I merge this branch.
Please do remind me of this comment when I spend a week on this merge.
if (state.visibility) { | ||
this.platform.raiseVisibilityEvent(state.visibility!); | ||
} | ||
|
||
if (state.primary) { | ||
await writeOwnerToIndexedDb(this.clientId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Would it be more realistic to just clear (or expire?) the current owner and then let ClientMetadataRefresh detect that it needs to steal the lease? I don't care strongly though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be. That takes a way the deterministic order of these tests though, as any secondary could then become primary.
if (this.expectedActiveTargets) { | ||
if (!obj.isEmpty(this.expectedActiveTargets)) { | ||
await this.connection.waitForWatchOpen(); | ||
} | ||
await this.queue.drain(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er, why do we need to call/wait-for queue.drain() even when we don't call waitForWatchOpen()? validateActiveTargets() is called by validateStateExpectations() which is called by run(step) which already calls queue.drain()...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm. This was causing test failures for me before when this.expectedActiveTargets
was set but empty. It's passing now though and you raised a very good point.
Just to double check... should this be assigned back to me? Or are you still working on review feedback? |
Back to you :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wary of race conditions... I'm not 100% sure how likely it is that they can actually manifest or how best to guard against them. Feel free to ping me if you want to chat through this together.
@@ -490,6 +493,11 @@ export class SyncEngine implements RemoteSyncer, SharedClientStateSyncer { | |||
// connection is disabled. | |||
await this.remoteStore.fillWritePipeline(); | |||
} else if (batchState === 'acknowledged' || batchState === 'rejected') { | |||
// If we receive a notification of an `acknowledged` or `rejected` batch | |||
// via Web Storage, we are either already secondary or another tab has | |||
// taken the primary lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow / agree-with your argument. I don't think we can deterministically "finish these notifications in the state it was supposed to be in".
At the end of the day it's possible for two tabs to send the same write to the backend and broadcast LocalStorage notifications for it. E.g. Tab A is primary, performs a write, gets the response, and gets paused / throttled right before sending the LocalStorage notification. Tab B takes over as primary, re-sends the write, and processes the response (sending a LocalStorage notification). Tab A resumes and its LocalStorage notification (for the same write) now goes out.
I think this means that handling write acknowledgement notifications must be idempotent.
But I also don't understand where the extra query events you mentioned are coming from... It may be worth talking through this or something.
if (this.isPrimary) { | ||
// If we receive a target state notification via Web Storage, we are | ||
// either already secondary or another tab has taken the primary lease. | ||
await this.applyPrimaryState(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that too, but I don't think we can guarantee it. The only thing we can guarantee is that only one tab can make IndexedDb writes as primary at a time (via transactions that require the primary lease). But we could have two tabs with WriteStreams open for some period of time.
Ideally our design would allow for this. If it's infeasible, then I guess we should acknowledge that there are race conditions but convince ourselves that it'll be sufficiently unlikely and/or non-catastrophic or something...
// taken the primary lease. | ||
log.debug( | ||
LOG_TAG, | ||
'Unexpectedly received mutation batch notification when already primary. Releasing primary lease.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep this logic, we should rephrase this comment since we're not actually releasing the primary lease. We're just pretending we don't have it. :-)
* @return A Promise that resolves after we recovered, or the original error. | ||
*/ | ||
private async tryRecoverClient(err: FirestoreError): Promise<void> { | ||
if (err.code === Code.FAILED_PRECONDITION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already re-use FAILED_PRECONDITION for the PRIMARY_LEASE_EXCLUSIVE_ERROR_MSG error, and I don't think it makes sense to fall back to secondary in that case.
I'd rather pessimistically assume that this error handling should not apply to future unknown errors and only expand it if/when we have specific cases we want to let it handle.
If we do keep this error handling fairly inclusive, we should definitely include the error in the log message.
b786739
to
cb49563
Compare
cb49563
to
ee94d57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR with the result of our discussion on Tuesday. The tests are the same, but the logic is pretty different. IndexedDb is now driving all primary state transitions and might call the primaryStateListener
when it detects that the lease has gone missing.
@@ -490,6 +493,11 @@ export class SyncEngine implements RemoteSyncer, SharedClientStateSyncer { | |||
// connection is disabled. | |||
await this.remoteStore.fillWritePipeline(); | |||
} else if (batchState === 'acknowledged' || batchState === 'rejected') { | |||
// If we receive a notification of an `acknowledged` or `rejected` batch | |||
// via Web Storage, we are either already secondary or another tab has | |||
// taken the primary lease. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR to ignore events that were not supposed to happen. As discussed on Tuesday, I then reconcile the query state when we get the notification that the client is secondary (using the same code that I use to reconcile the query views when we transition to primary).
Write notifications are actually idempotent and it doesn't matter how often we process the same write, with the exception that only the first call triggers the user callback.
* @return A Promise that resolves after we recovered, or the original error. | ||
*/ | ||
private async tryRecoverClient(err: FirestoreError): Promise<void> { | ||
if (err.code === Code.FAILED_PRECONDITION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the error check to IndexedDb and included the error message. Another option would be to statically define the error and compare against the instance, but that messes with the stacktraces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think this basically looks good though I'm a little fuzzy on the reset-limbo stuff going on...
.remoteDocumentKeys(targetId) | ||
.then(async remoteKeys => { | ||
const queryView = this.queryViewsByTarget[targetId]; | ||
assert(!!queryView, 'Cannot reconcile missing view'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I think it'd be pretty straightforward to just pass in the queryView (as the only argument since it contains the query, targetId, and view) and then we don't need this assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This also helped me realize that I am doing one extra IndexedDb lookup for the query when I already have the QueryView (in synchronizeQueryViewsAndRaiseSnapshots
).
docs, | ||
remoteKeys, | ||
/* resetLimboDocuments= */ this.isPrimary, | ||
/* resetCurrent= */ this.isPrimary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble understanding what these flags do. I wonder if "reset" should be "recompute" or "update" (reset makes it sound like they'll just be cleared, which I don't think is the case)... And if we pass false does that mean current and limboDocuments will be left 100% as-is? I guess ultimately I'm wondering why we can't just unconditionally recompute these. I also wonder if we could sensibly combine these into a single argument. Since current depends on limboDocuments, it probably never makes sense to recompute current without computing limbo documents?
Anyway, at minimum it will probably take some comments to appease me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this a little to unconditionally re-compute the list of limbo documents. In clients that are in-sync this is a no-op, and in clients that are out of sync, this will bring them back into sync. The only bit that I still need to provide is whether to set current
to false. I added some documentation to clear this up a little.
Notice that I also added a TODO to mark the queries as non-current in SharedClientState. That's probably a three line change but I wanted to keep this PR from exploding even further.
log.debug( | ||
LOG_TAG, | ||
'Unexpectedly lost primary lease, reverting to secondary' | ||
'Unexpectedly lost primary lease, attempting to recover' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We log "attempting to recover" and then do nothing... Seems like maybe this method should now be ignorePrimaryLeaseLoss() and the log message should say "ignoring"
... alternatively, as-is will there be a long (up to 4s) delay before we officially notice we're not primary? Maybe this should schedule a lease refresh or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed "attempting to recover" altogether and didn't replace it with anything.
Note that the persistence layer schedules a lease refresh when it throws this error, so the delay should be fairly minimal.
} | ||
|
||
/** | ||
* Reconcile the local query views with those the state from persistence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those the?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesssss. Fixed.
// For queries that never executed on this client, we need to | ||
// allocate the query in LocalStore and initialize a new View. | ||
queryData = await this.localStore.allocateQuery(query); | ||
await this.initializeViewAndComputeSnapshot(queryData, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/current=/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
queryData.targetId | ||
); | ||
if (viewChange.snapshot) { | ||
this.syncEngineListener!.onWatchChange([viewChange.snapshot]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we batch up all the snapshots and call onWatchChange() only once?
In practice I'm not sure how much it matters right now, but since all of the snapshots were the result of a single event, I think we should send them to onWatchChange() as a single batch. In the future we might expose an API like:
firebase.firestore().onGlobalSnapshot(() => { /* I know that all listeners are in a consistent state. */ });
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing this at an earlier stage of this PR, but then ultimately decided to remove it to save one line of code. I put it back for now. I am wondering though if there are ever non-current snapshots for active queries that are not globally consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your question, but I'm hoping it's a response to my poor job of explaining the onGlobalSnapshot() thing.
The basic idea is that if you had lots of listeners that were related (e.g. for whose turn it is in a turn-based game and for the board state), you could delay your handling of updates until an onGlobalSnapshot() callback [e.g. so you don't try to take your turn until you know it's your turn and you have the updated board state]. Said differently: if somebody does multiple writes atomically in a batch write, you could use onGlobalSnapshot() to make sure all of your listeners have seen all of the writes.
So that's the motivation for onWatchChange() accepting a list of snapshots rather than just a single one... I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, got it. It would allow us to know that we have received all events from the last consistent global snapshot.
|
||
/** | ||
* Reconcile the local query views with those the state from persistence. | ||
* Raises snapshots for any changes that affect the current client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and "Returns the set of queries for which we have local views." or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method actually returns the query data for the targets that are passed in. I updated the comment to:
* Reconcile the query views of the provided query targets with the state from
* persistence. Raises snapshots for any changes that affect the current
* client and returns the updated state of all target's query data.
packages/firestore/src/util/obj.ts
Outdated
@@ -39,6 +39,15 @@ export function size<V>(obj: Dict<V>): number { | |||
return count; | |||
} | |||
|
|||
/** Extracts the numeric indices from a dictionary. */ | |||
export function indices<V>(obj: Dict<V>): number[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use [numberKey: number]: V
instead of Dict<V>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. This even gives us type safety instead of the "i!sNaN" check below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback addressed.
.remoteDocumentKeys(targetId) | ||
.then(async remoteKeys => { | ||
const queryView = this.queryViewsByTarget[targetId]; | ||
assert(!!queryView, 'Cannot reconcile missing view'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This also helped me realize that I am doing one extra IndexedDb lookup for the query when I already have the QueryView (in synchronizeQueryViewsAndRaiseSnapshots
).
log.debug( | ||
LOG_TAG, | ||
'Unexpectedly lost primary lease, reverting to secondary' | ||
'Unexpectedly lost primary lease, attempting to recover' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed "attempting to recover" altogether and didn't replace it with anything.
Note that the persistence layer schedules a lease refresh when it throws this error, so the delay should be fairly minimal.
} | ||
|
||
/** | ||
* Reconcile the local query views with those the state from persistence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yesssss. Fixed.
|
||
/** | ||
* Reconcile the local query views with those the state from persistence. | ||
* Raises snapshots for any changes that affect the current client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method actually returns the query data for the targets that are passed in. I updated the comment to:
* Reconcile the query views of the provided query targets with the state from
* persistence. Raises snapshots for any changes that affect the current
* client and returns the updated state of all target's query data.
// For queries that never executed on this client, we need to | ||
// allocate the query in LocalStore and initialize a new View. | ||
queryData = await this.localStore.allocateQuery(query); | ||
await this.initializeViewAndComputeSnapshot(queryData, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
queryData.targetId | ||
); | ||
if (viewChange.snapshot) { | ||
this.syncEngineListener!.onWatchChange([viewChange.snapshot]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing this at an earlier stage of this PR, but then ultimately decided to remove it to save one line of code. I put it back for now. I am wondering though if there are ever non-current snapshots for active queries that are not globally consistent?
packages/firestore/src/util/obj.ts
Outdated
@@ -39,6 +39,15 @@ export function size<V>(obj: Dict<V>): number { | |||
return count; | |||
} | |||
|
|||
/** Extracts the numeric indices from a dictionary. */ | |||
export function indices<V>(obj: Dict<V>): number[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. This even gives us type safety instead of the "i!sNaN" check below.
docs, | ||
remoteKeys, | ||
/* resetLimboDocuments= */ this.isPrimary, | ||
/* resetCurrent= */ this.isPrimary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simplified this a little to unconditionally re-compute the list of limbo documents. In clients that are in-sync this is a no-op, and in clients that are out of sync, this will bring them back into sync. The only bit that I still need to provide is whether to set current
to false. I added some documentation to clear this up a little.
Notice that I also added a TODO to mark the queries as non-current in SharedClientState. That's probably a three line change but I wanted to keep this PR from exploding even further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and unfortunately I'm still struggling with the resetCurrent logic. Feel free to ping me on chat to expedite the review process.
queryData.targetId | ||
); | ||
if (viewChange.snapshot) { | ||
this.syncEngineListener!.onWatchChange([viewChange.snapshot]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your question, but I'm hoping it's a response to my poor job of explaining the onGlobalSnapshot() thing.
The basic idea is that if you had lots of listeners that were related (e.g. for whose turn it is in a turn-based game and for the board state), you could delay your handling of updates until an onGlobalSnapshot() callback [e.g. so you don't try to take your turn until you know it's your turn and you have the updated board state]. Said differently: if somebody does multiple writes atomically in a batch write, you could use onGlobalSnapshot() to make sure all of your listeners have seen all of the writes.
So that's the motivation for onWatchChange() accepting a list of snapshots rather than just a single one... I think.
assert(!!queryView, 'Cannot reconcile missing view'); | ||
const viewChange = queryView.view.synchronizeWithPersistedState( | ||
// We do not update our tracking of limbo documents since | ||
//`resetCurrent` will set `current` to false for the querie's view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space after //
and querie's
typo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
viewChange.limboChanges | ||
assert( | ||
!this.isPrimary || viewSnapshot.limboChanges.length === 0, | ||
'Received limboChanges when query is not CURRENT.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stared at the comment and this assert for a while before I fully understood what's going on. One thing that would help is if you changed the message to "Primary received ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this assert and am now tracking these documents in the if I am the primary client. I could also pass in a updateLimboDocuments
flag and not update the limbo documents for secondaries.
//`resetCurrent` will set `current` to false for the querie's view | ||
// (and won't return any limbo documents). We also explicitly don't | ||
// track limbo documents as secondary. | ||
const viewSnapshot = queryView.view.synchronizeWithPersistedState( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment still doesn't explain why we are resetting 'current' and why we only do it as primary.
I (non-confidently) think maybe the reasons are:
- We reset 'current' as primary because we expect synchronizeViewAndComputeSnapshot() to only be called immediately after we become primary and since we haven't had an active network connection, we should treat the queries as non-current (and raise fromCache=true events) until we get the initial response from Watch.
- Secondaries don't need to set current to false because it'll happen... somewhere else? (I guess where you added the TODO?)
In general I'm fuzzy on what the desired behavior is and how we're implementing it. Perhaps the desired behavior is that at the point of a primary change, all clients (secondary and master) mark all views as non-current? But if that's the case, why can't we just unconditionally pass true here for resetCurrent?
Sorry if I'm just confused. Feel free to ping me on chat to explain what's going on and we can figure out what comments/whatever might help this to make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our discussion yesterday, I changed this logic to no longer set current to false. That means that we now transition primary state without any user feedback and also that we could theoretically get limbo documents at this point.
Note that I am not quite sure how I can trigger limbo documents at this point (the new statement that I added). Any test case that I can think of has the query already marked Non-CURRENT by the previous primary tab (based on the limbo docs from the previous computation).
LOG_TAG, | ||
'Unexpectedly lost primary lease, attempting to recover' | ||
); | ||
log.debug(LOG_TAG, 'Unexpectedly lost primary lease'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You still need to update the method name and comment... Please mention that when this error is thrown, there will automatically be a lease refresh scheduled so we should very quickly get an applyPrimaryState() call from the persistence layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the method and comment. Method is now ignoreIfPrimaryLeaseLoss
.
@@ -585,10 +585,7 @@ export class RemoteStore implements TargetMetadataProvider { | |||
|
|||
private tryRecoverFromPrimaryLeaseLoss(err: FirestoreError): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename this if you rename the other one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
a6fef04
to
5c56479
Compare
This PR examines all the places where we need a Primary lease in IndexedDb and adds recovery logic. The recovery logic is pretty simple - it just flips the client state to secondary.
I tried to plumb through some checks on whether the client is actually no longer primary, but decided this is not worth it: