Skip to content

[Multi-Tab] Recover from unexpected primary lease loss #984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 25, 2018

Conversation

schmidt-sebastian
Copy link
Contributor

This PR examines all the places where we need a Primary lease in IndexedDb and adds recovery logic. The recovery logic is pretty simple - it just flips the client state to secondary.

I tried to plumb through some checks on whether the client is actually no longer primary, but decided this is not worth it:

  • Neither Sync Engine nor Remote Store have access to the persistence layer. I would have needed to plumb the primary state through to LocalStore.
  • If the client should be primary, IndexedDb will reset the state in the next lease refresh.

}
},
err => {
if (err.code === Code.FAILED_PRECONDITION) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Instead of relying on the error code, I tried to make this more specific and added a new error class at one point. I wanted to do an instanceof CustomPrimaryLeaseLostError check here. It turns out that instanceof checks don't work with errors though, and so I abandoned this approach.

@schmidt-sebastian schmidt-sebastian force-pushed the multitab-recovery branch 2 times, most recently from 354e01b to f9f4d9c Compare July 9, 2018 12:46
Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This mostly LGTM but a few nits / questions / comments.

* Marks the client as secondary if an IndexedDb operation fails because the
* primary lease has been taken by another client. This can happen when the
* client is temporarily CPU throttled and fails to renew its lease in time,
* in which we treat the current client as secondary. We can always revert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* in which we treat the current client as secondary. We can always revert
* back to primary status via the lease refresh in our persistence layer.
*
* @param err An error returned by an IndexedDb operation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IndexedDb operation" seems more specific than reality. It could be any "persistence operation" or even "LocalStore operation", no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this can get called with a number of errors. All errors get plumbed through LocalStore, so I changed "IndexedDb" to "LocalStore" as suggested.

* @return A Promise that resolves after we recovered, or the original error.
*/
private async tryRecoverClient(err: FirestoreError): Promise<void> {
if (err.code === Code.FAILED_PRECONDITION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about adding an IndexedDbPersistence.isLostLeaseError() helper that checks both the code and the message?

Also, it may be useful to add a log message here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not too big of a fan of checking the error string and even if we were to re-use the FAILED_PRECONDITION code somewhere else in our persistence layer, flipping the primary state to secondary seems like a reasonable action. Without this, I think we can just do the check here inline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already re-use FAILED_PRECONDITION for the PRIMARY_LEASE_EXCLUSIVE_ERROR_MSG error, and I don't think it makes sense to fall back to secondary in that case.

I'd rather pessimistically assume that this error handling should not apply to future unknown errors and only expand it if/when we have specific cases we want to let it handle.

If we do keep this error handling fairly inclusive, we should definitely include the error in the log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the error check to IndexedDb and included the error message. Another option would be to statically define the error and compare against the instance, but that messes with the stacktraces.

}
this.isPrimary = canActAsPrimary;
// Always call the primary state listener, since SyncEngine may have
// changed the primary state to 'false'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this feels a hacky to me, but if we're going to do this and rely on it, you should document it on PrimaryStateListener interface and/or the setPrimaryStateListener() method ("Called when the primary state changes and again every N seconds while primary")

A cleaner (though perhaps more code?) approach would be to aggressively give up our primary lease in tryRecoverClient() so that we are definitely not the primary anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you here, but I don't think the plumbing to make this work is necessarily a worthwhile investment. For now, I changed the method comment in setPrimaryStateListener.

It might even be possible to remove the primary flag from the persistence layer. It is currently only used at startup and at shutdown.

/*keepPersistedQueryData=*/ false
);
await this.localStore.collectGarbage();
return this.localStore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.expectPrimaryState(false)
// TODO(multitab): Suppres the additional `fromCache`, which is raised
// locally because we disable the network when we lose the primary
// lease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm not sure if this is wrong / harmful. When a primary switch happens we collectively will be without a network connection for a period of time and it seems correct for views to trigger fromCache=true events... but we should return to fromCache=false once the primary switch completes and a new primary establishes the listens...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still undecided what the best desired behavior should be. I don't want clients flip-flopping between fromCache states. But on the other hand, the client will be offline at that point, so it technically correct to raise this event. I removed this todo, which also meant that I didn't have to fix the typo in Suppres. Win win.

.stealPrimaryLease()
.client(0)
// Send a watch update to client 0, who is longer primary (but doesn't
// it yet). The watch update gets ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing "know"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

options = options || {};

this.currentStep = {
writeAck: { version, keepInQueue: !!options.keepInQueue }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it may be worth adding a TODO(multitab) to update the spec schemas on the other clients with the changes you've made... Just so we don't forget to do that before / soon-after merging multitab into master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes that I made in Master are already ported to the other platforms (firebase/firebase-ios-sdk#1498, https://github.com/FirebasePrivate/firebase-android-sdk/pull/95). The rest should hopefully be obvious when I merge this branch.

Please do remind me of this comment when I spend a week on this merge.

if (state.visibility) {
this.platform.raiseVisibilityEvent(state.visibility!);
}

if (state.primary) {
await writeOwnerToIndexedDb(this.clientId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it be more realistic to just clear (or expire?) the current owner and then let ClientMetadataRefresh detect that it needs to steal the lease? I don't care strongly though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be. That takes a way the deterministic order of these tests though, as any secondary could then become primary.

if (this.expectedActiveTargets) {
if (!obj.isEmpty(this.expectedActiveTargets)) {
await this.connection.waitForWatchOpen();
}
await this.queue.drain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, why do we need to call/wait-for queue.drain() even when we don't call waitForWatchOpen()? validateActiveTargets() is called by validateStateExpectations() which is called by run(step) which already calls queue.drain()...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. This was causing test failures for me before when this.expectedActiveTargets was set but empty. It's passing now though and you raised a very good point.

@mikelehen
Copy link
Contributor

Just to double check... should this be assigned back to me? Or are you still working on review feedback?

@schmidt-sebastian
Copy link
Contributor Author

Just to double check... should this be assigned back to me? Or are you still working on review feedback?

Back to you :)

Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wary of race conditions... I'm not 100% sure how likely it is that they can actually manifest or how best to guard against them. Feel free to ping me if you want to chat through this together.

@@ -490,6 +493,11 @@ export class SyncEngine implements RemoteSyncer, SharedClientStateSyncer {
// connection is disabled.
await this.remoteStore.fillWritePipeline();
} else if (batchState === 'acknowledged' || batchState === 'rejected') {
// If we receive a notification of an `acknowledged` or `rejected` batch
// via Web Storage, we are either already secondary or another tab has
// taken the primary lease.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow / agree-with your argument. I don't think we can deterministically "finish these notifications in the state it was supposed to be in".

At the end of the day it's possible for two tabs to send the same write to the backend and broadcast LocalStorage notifications for it. E.g. Tab A is primary, performs a write, gets the response, and gets paused / throttled right before sending the LocalStorage notification. Tab B takes over as primary, re-sends the write, and processes the response (sending a LocalStorage notification). Tab A resumes and its LocalStorage notification (for the same write) now goes out.

I think this means that handling write acknowledgement notifications must be idempotent.

But I also don't understand where the extra query events you mentioned are coming from... It may be worth talking through this or something.

if (this.isPrimary) {
// If we receive a target state notification via Web Storage, we are
// either already secondary or another tab has taken the primary lease.
await this.applyPrimaryState(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that too, but I don't think we can guarantee it. The only thing we can guarantee is that only one tab can make IndexedDb writes as primary at a time (via transactions that require the primary lease). But we could have two tabs with WriteStreams open for some period of time.

Ideally our design would allow for this. If it's infeasible, then I guess we should acknowledge that there are race conditions but convince ourselves that it'll be sufficiently unlikely and/or non-catastrophic or something...

// taken the primary lease.
log.debug(
LOG_TAG,
'Unexpectedly received mutation batch notification when already primary. Releasing primary lease.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep this logic, we should rephrase this comment since we're not actually releasing the primary lease. We're just pretending we don't have it. :-)

* @return A Promise that resolves after we recovered, or the original error.
*/
private async tryRecoverClient(err: FirestoreError): Promise<void> {
if (err.code === Code.FAILED_PRECONDITION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already re-use FAILED_PRECONDITION for the PRIMARY_LEASE_EXCLUSIVE_ERROR_MSG error, and I don't think it makes sense to fall back to secondary in that case.

I'd rather pessimistically assume that this error handling should not apply to future unknown errors and only expand it if/when we have specific cases we want to let it handle.

If we do keep this error handling fairly inclusive, we should definitely include the error in the log message.

Copy link
Contributor Author

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR with the result of our discussion on Tuesday. The tests are the same, but the logic is pretty different. IndexedDb is now driving all primary state transitions and might call the primaryStateListener when it detects that the lease has gone missing.

@@ -490,6 +493,11 @@ export class SyncEngine implements RemoteSyncer, SharedClientStateSyncer {
// connection is disabled.
await this.remoteStore.fillWritePipeline();
} else if (batchState === 'acknowledged' || batchState === 'rejected') {
// If we receive a notification of an `acknowledged` or `rejected` batch
// via Web Storage, we are either already secondary or another tab has
// taken the primary lease.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR to ignore events that were not supposed to happen. As discussed on Tuesday, I then reconcile the query state when we get the notification that the client is secondary (using the same code that I use to reconcile the query views when we transition to primary).
Write notifications are actually idempotent and it doesn't matter how often we process the same write, with the exception that only the first call triggers the user callback.

* @return A Promise that resolves after we recovered, or the original error.
*/
private async tryRecoverClient(err: FirestoreError): Promise<void> {
if (err.code === Code.FAILED_PRECONDITION) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the error check to IndexedDb and included the error message. Another option would be to statically define the error and compare against the instance, but that messes with the stacktraces.

Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think this basically looks good though I'm a little fuzzy on the reset-limbo stuff going on...

.remoteDocumentKeys(targetId)
.then(async remoteKeys => {
const queryView = this.queryViewsByTarget[targetId];
assert(!!queryView, 'Cannot reconcile missing view');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think it'd be pretty straightforward to just pass in the queryView (as the only argument since it contains the query, targetId, and view) and then we don't need this assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This also helped me realize that I am doing one extra IndexedDb lookup for the query when I already have the QueryView (in synchronizeQueryViewsAndRaiseSnapshots).

docs,
remoteKeys,
/* resetLimboDocuments= */ this.isPrimary,
/* resetCurrent= */ this.isPrimary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble understanding what these flags do. I wonder if "reset" should be "recompute" or "update" (reset makes it sound like they'll just be cleared, which I don't think is the case)... And if we pass false does that mean current and limboDocuments will be left 100% as-is? I guess ultimately I'm wondering why we can't just unconditionally recompute these. I also wonder if we could sensibly combine these into a single argument. Since current depends on limboDocuments, it probably never makes sense to recompute current without computing limbo documents?

Anyway, at minimum it will probably take some comments to appease me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this a little to unconditionally re-compute the list of limbo documents. In clients that are in-sync this is a no-op, and in clients that are out of sync, this will bring them back into sync. The only bit that I still need to provide is whether to set current to false. I added some documentation to clear this up a little.

Notice that I also added a TODO to mark the queries as non-current in SharedClientState. That's probably a three line change but I wanted to keep this PR from exploding even further.

log.debug(
LOG_TAG,
'Unexpectedly lost primary lease, reverting to secondary'
'Unexpectedly lost primary lease, attempting to recover'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We log "attempting to recover" and then do nothing... Seems like maybe this method should now be ignorePrimaryLeaseLoss() and the log message should say "ignoring"

... alternatively, as-is will there be a long (up to 4s) delay before we officially notice we're not primary? Maybe this should schedule a lease refresh or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed "attempting to recover" altogether and didn't replace it with anything.

Note that the persistence layer schedules a lease refresh when it throws this error, so the delay should be fairly minimal.

}

/**
* Reconcile the local query views with those the state from persistence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those the?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesssss. Fixed.

// For queries that never executed on this client, we need to
// allocate the query in LocalStore and initialize a new View.
queryData = await this.localStore.allocateQuery(query);
await this.initializeViewAndComputeSnapshot(queryData, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/current=/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

queryData.targetId
);
if (viewChange.snapshot) {
this.syncEngineListener!.onWatchChange([viewChange.snapshot]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we batch up all the snapshots and call onWatchChange() only once?

In practice I'm not sure how much it matters right now, but since all of the snapshots were the result of a single event, I think we should send them to onWatchChange() as a single batch. In the future we might expose an API like:

firebase.firestore().onGlobalSnapshot(() => { /* I know that all listeners are in a consistent state. */ });

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this at an earlier stage of this PR, but then ultimately decided to remove it to save one line of code. I put it back for now. I am wondering though if there are ever non-current snapshots for active queries that are not globally consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your question, but I'm hoping it's a response to my poor job of explaining the onGlobalSnapshot() thing.

The basic idea is that if you had lots of listeners that were related (e.g. for whose turn it is in a turn-based game and for the board state), you could delay your handling of updates until an onGlobalSnapshot() callback [e.g. so you don't try to take your turn until you know it's your turn and you have the updated board state]. Said differently: if somebody does multiple writes atomically in a batch write, you could use onGlobalSnapshot() to make sure all of your listeners have seen all of the writes.

So that's the motivation for onWatchChange() accepting a list of snapshots rather than just a single one... I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. It would allow us to know that we have received all events from the last consistent global snapshot.


/**
* Reconcile the local query views with those the state from persistence.
* Raises snapshots for any changes that affect the current client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and "Returns the set of queries for which we have local views." or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method actually returns the query data for the targets that are passed in. I updated the comment to:

  * Reconcile the query views of the provided query targets with the state from
  * persistence. Raises snapshots for any changes that affect the current
  * client and returns the updated state of all target's query data.

@@ -39,6 +39,15 @@ export function size<V>(obj: Dict<V>): number {
return count;
}

/** Extracts the numeric indices from a dictionary. */
export function indices<V>(obj: Dict<V>): number[] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use [numberKey: number]: V instead of Dict<V>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This even gives us type safety instead of the "i!sNaN" check below.

Copy link
Contributor Author

@schmidt-sebastian schmidt-sebastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback addressed.

.remoteDocumentKeys(targetId)
.then(async remoteKeys => {
const queryView = this.queryViewsByTarget[targetId];
assert(!!queryView, 'Cannot reconcile missing view');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This also helped me realize that I am doing one extra IndexedDb lookup for the query when I already have the QueryView (in synchronizeQueryViewsAndRaiseSnapshots).

log.debug(
LOG_TAG,
'Unexpectedly lost primary lease, reverting to secondary'
'Unexpectedly lost primary lease, attempting to recover'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed "attempting to recover" altogether and didn't replace it with anything.

Note that the persistence layer schedules a lease refresh when it throws this error, so the delay should be fairly minimal.

}

/**
* Reconcile the local query views with those the state from persistence.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yesssss. Fixed.


/**
* Reconcile the local query views with those the state from persistence.
* Raises snapshots for any changes that affect the current client.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method actually returns the query data for the targets that are passed in. I updated the comment to:

  * Reconcile the query views of the provided query targets with the state from
  * persistence. Raises snapshots for any changes that affect the current
  * client and returns the updated state of all target's query data.

// For queries that never executed on this client, we need to
// allocate the query in LocalStore and initialize a new View.
queryData = await this.localStore.allocateQuery(query);
await this.initializeViewAndComputeSnapshot(queryData, false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

queryData.targetId
);
if (viewChange.snapshot) {
this.syncEngineListener!.onWatchChange([viewChange.snapshot]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing this at an earlier stage of this PR, but then ultimately decided to remove it to save one line of code. I put it back for now. I am wondering though if there are ever non-current snapshots for active queries that are not globally consistent?

@@ -39,6 +39,15 @@ export function size<V>(obj: Dict<V>): number {
return count;
}

/** Extracts the numeric indices from a dictionary. */
export function indices<V>(obj: Dict<V>): number[] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. This even gives us type safety instead of the "i!sNaN" check below.

docs,
remoteKeys,
/* resetLimboDocuments= */ this.isPrimary,
/* resetCurrent= */ this.isPrimary
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this a little to unconditionally re-compute the list of limbo documents. In clients that are in-sync this is a no-op, and in clients that are out of sync, this will bring them back into sync. The only bit that I still need to provide is whether to set current to false. I added some documentation to clear this up a little.

Notice that I also added a TODO to mark the queries as non-current in SharedClientState. That's probably a three line change but I wanted to keep this PR from exploding even further.

@schmidt-sebastian schmidt-sebastian removed their assignment Jul 23, 2018
Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and unfortunately I'm still struggling with the resetCurrent logic. Feel free to ping me on chat to expedite the review process.

queryData.targetId
);
if (viewChange.snapshot) {
this.syncEngineListener!.onWatchChange([viewChange.snapshot]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your question, but I'm hoping it's a response to my poor job of explaining the onGlobalSnapshot() thing.

The basic idea is that if you had lots of listeners that were related (e.g. for whose turn it is in a turn-based game and for the board state), you could delay your handling of updates until an onGlobalSnapshot() callback [e.g. so you don't try to take your turn until you know it's your turn and you have the updated board state]. Said differently: if somebody does multiple writes atomically in a batch write, you could use onGlobalSnapshot() to make sure all of your listeners have seen all of the writes.

So that's the motivation for onWatchChange() accepting a list of snapshots rather than just a single one... I think.

assert(!!queryView, 'Cannot reconcile missing view');
const viewChange = queryView.view.synchronizeWithPersistedState(
// We do not update our tracking of limbo documents since
//`resetCurrent` will set `current` to false for the querie's view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after // and querie's typo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

viewChange.limboChanges
assert(
!this.isPrimary || viewSnapshot.limboChanges.length === 0,
'Received limboChanges when query is not CURRENT.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stared at the comment and this assert for a while before I fully understood what's going on. One thing that would help is if you changed the message to "Primary received ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this assert and am now tracking these documents in the if I am the primary client. I could also pass in a updateLimboDocuments flag and not update the limbo documents for secondaries.

//`resetCurrent` will set `current` to false for the querie's view
// (and won't return any limbo documents). We also explicitly don't
// track limbo documents as secondary.
const viewSnapshot = queryView.view.synchronizeWithPersistedState(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still doesn't explain why we are resetting 'current' and why we only do it as primary.

I (non-confidently) think maybe the reasons are:

  1. We reset 'current' as primary because we expect synchronizeViewAndComputeSnapshot() to only be called immediately after we become primary and since we haven't had an active network connection, we should treat the queries as non-current (and raise fromCache=true events) until we get the initial response from Watch.
  2. Secondaries don't need to set current to false because it'll happen... somewhere else? (I guess where you added the TODO?)

In general I'm fuzzy on what the desired behavior is and how we're implementing it. Perhaps the desired behavior is that at the point of a primary change, all clients (secondary and master) mark all views as non-current? But if that's the case, why can't we just unconditionally pass true here for resetCurrent?

Sorry if I'm just confused. Feel free to ping me on chat to explain what's going on and we can figure out what comments/whatever might help this to make sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our discussion yesterday, I changed this logic to no longer set current to false. That means that we now transition primary state without any user feedback and also that we could theoretically get limbo documents at this point.

Note that I am not quite sure how I can trigger limbo documents at this point (the new statement that I added). Any test case that I can think of has the query already marked Non-CURRENT by the previous primary tab (based on the limbo docs from the previous computation).

LOG_TAG,
'Unexpectedly lost primary lease, attempting to recover'
);
log.debug(LOG_TAG, 'Unexpectedly lost primary lease');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to update the method name and comment... Please mention that when this error is thrown, there will automatically be a lease refresh scheduled so we should very quickly get an applyPrimaryState() call from the persistence layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the method and comment. Method is now ignoreIfPrimaryLeaseLoss.

@@ -585,10 +585,7 @@ export class RemoteStore implements TargetMetadataProvider {

private tryRecoverFromPrimaryLeaseLoss(err: FirestoreError): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this if you rename the other one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@schmidt-sebastian schmidt-sebastian merged commit 7bb0b9a into firestore-multi-tab Jul 25, 2018
@schmidt-sebastian schmidt-sebastian deleted the multitab-recovery branch August 3, 2018 17:15
@firebase firebase locked and limited conversation to collaborators Oct 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants