Fix Android Connectivity Monitor (v2) #1045

thebrianchen · 2019-12-06T21:49:46Z

Continuation of #937 but instead of binding connectivity attempts to the WatchStream, this version bases connectivity attempts on grpc's ConnectivityState.

Note that this implementation also depends on two ManagedChannel APIs that are labeled as @ExperimentalApi and require a minimum grpc version v1.6.1.: ManagedChannel.getState() and ManagedChannel.notifyWhenStateChanged().

…dk into bc/reconnect

…empt_timer

thebrianchen · 2019-12-06T22:13:49Z

/retest

mikelehen

Some (mostly high-level) feedback / questions...

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

firebase-firestore/src/main/java/com/google/firebase/firestore/util/AsyncQueue.java

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/AbstractStream.java

mikelehen · 2019-12-09T00:19:37Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/AbstractStream.java

@@ -174,6 +174,9 @@ public void run() {
  /** The time a stream stays open after it is marked idle. */
  private static final long IDLE_TIMEOUT_MS = TimeUnit.MINUTES.toMillis(1);

+  /** Maximum backoff time when reconnecting. */


"when reconnecting" doesn't really do anything to distinguish this from the existing BACKOFF_MAX_DELAY_MS which is also used when reconnecting.

I think the intention is that this max backoff delay is used when we're 100% sure that the connection failed on the client-side and didn't even reach the server. In practice, I think the only case we can be sure of that is DNS failures (see my other comment about ConnectException). If that's the case, then maybe we can call this DNS_FAILURE_BACKOFF_MAX_DELAY_MS or something...

How about CLIENT_NETWORK_FAILURE_BACKOFF_MAX_DELAY_MS?

I just noticed that BACKOFF should maybe be first (to group this with the other backoff-related constants). So: BACKOFF_CLIENT_NETWORK_FAILURE_MAX_DELAY_MS?

FWIW- I don't feel strongly. Just trying to figure out how to make the name parse more sensibly...

Changed to Michael's suggestion.

thebrianchen · 2019-12-10T01:39:48Z

Did another pass through, moved logic exclusively to gRPC side and removed the markChannelIdle() code.

mikelehen

Generally speaking, I think I'm tentatively a fan!

I'd recommend doing a pass over my feedback (if there's anything that seems time consuming, you can defer for now) and then run this by Gil to see what he thinks generally. I don't 100% know if he'll be onboard with recreating the channel. If we can get timing information on that to prove to ourselves that it isn't too expensive, that would be helpful.

We might finally want to run it by the gRPC team to see if we can get their blessing on this approach (or maybe they can suggest something better). If nothing else, it may be useful feedback to them regarding what our needs are, to better inform their solution for grpc/grpc-java#1943.

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

firebase-firestore/src/main/java/com/google/firebase/firestore/util/AsyncQueue.java

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/OnlineStateTracker.java

mikelehen · 2019-12-10T02:12:18Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

+
+  private void initChannelTask() {
+    // We execute network initialization on a separate thread to not block operations that depend on
+    // the AsyncQueue.


One potential concern with recreating the channel is it could be expensive. We're already doing it on a background thread for performance reasons, so it's possible this adds some meaningful delay.

As part of your logging, I'd recommend logging the start/finish of this so that we can get timing data and see how long this takes each time.

It's taking around 15-40ms to reset the channel from what I've seen based on the logs.

Hrm. That is longer than I hoped. I don't know if it's a problem or not. I suggest getting input from Gil and gRPC folks (and perhaps point out this delay to gRPC and see if they have other suggestions for implementing a connection timeout).

FWIW, we're spending an extra 40 ms to bring the reaction time on failed reconnects down from two minutes to 15 seconds. While 40 ms isn't exactly cheap it creates a big enough win that it seems worthwhile.

Also note that this only kicks in when Android's own network transition logic isn't kicking in. That we're not being inundated with requests for this feature suggests that it's going to be fairly rare.

wilhuff · 2019-12-11T22:32:58Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

+
+  private void initChannelTask() {
+    // We execute network initialization on a separate thread to not block operations that depend on
+    // the AsyncQueue.


FWIW, we're spending an extra 40 ms to bring the reaction time on failed reconnects down from two minutes to 15 seconds. While 40 ms isn't exactly cheap it creates a big enough win that it seems worthwhile.

Also note that this only kicks in when Android's own network transition logic isn't kicking in. That we're not being inundated with requests for this feature suggests that it's going to be fairly rare.

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/AbstractStream.java

wilhuff · 2019-12-11T23:17:17Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/AbstractStream.java

@@ -174,6 +174,9 @@ public void run() {
  /** The time a stream stays open after it is marked idle. */
  private static final long IDLE_TIMEOUT_MS = TimeUnit.MINUTES.toMillis(1);

+  /** Maximum backoff time when reconnecting. */


How about CLIENT_NETWORK_FAILURE_BACKOFF_MAX_DELAY_MS?

wilhuff · 2019-12-11T23:45:19Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/FirestoreChannel.java

@@ -53,6 +55,11 @@
  private static final String X_GOOG_API_CLIENT_VALUE =
      "gl-java/ fire/" + BuildConfig.VERSION_NAME + " grpc/";

+  // This timeout is used when attempting to establish a connection. If a connection attempt
+  // does not succeed in time, we close the stream and restart the connection, rather than having


The timer callback does not seem to "close the stream" in any way that I can tell. Based on comments in markChannelIdle, the first stream continues? Does this actually close the stream, or is this a side effect?

From what I can tell, the connectivityAttemptTimer merely starts a new clientCall (through a ~recursive invocation of runBidiStreamingRpc). What prevents the observer from getting callbacks from the old stream once that fails ~2 minutes later?

I think you're looking at an old version of the PR? 😬 This code is gone in the latest version, right?

Yes, that's the outdated version. The new one can be found here.

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/OnlineStateTracker.java

wilhuff · 2019-12-11T23:57:54Z

firebase-firestore/src/main/java/com/google/firebase/firestore/util/ExponentialBackoff.java

@@ -151,6 +161,9 @@ public void backoffAndRun(Runnable task) {
    } else if (currentBaseMs > maxDelayMs) {
      currentBaseMs = maxDelayMs;
    }
+
+    // Reset max delay to the default.
+    maxDelayMs = DEFAULT_BACKOFF_MAX_DELAY_MS;


This ignored the comments I made on the last version of this PR.

The max delay was configured in the constructor and may not actually be DEFAULT_BACKOFF_MAX_DELAY_MS. You can't read this constant here.

Instead, save two max delay fields in the constructor. One should be final, as maxDelayMs is today and it should be used here as the value to which to reset. The second value should be the one you actually bounce around based on which kind of error is in effect.

Done (attempted).

wilhuff

LGTM with a final suggestion.

wilhuff · 2019-12-20T17:58:16Z

firebase-firestore/src/main/java/com/google/firebase/firestore/util/ExponentialBackoff.java

+
+  /**
+   * The maximum backoff time used when calculating the next backoff. This value can be changed for
+   * a single backoffAndRun call, after which it resets to maxDelayMs.


firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

mikelehen

LGTM with one remaining concern.

I still think it may be worth running this by the gRPC folks (on the existing email thread) before we submit it so that they have a chance to sanity-check our approach and so that they're aware that we're going to these lengths to work around gRPC's current behavior. Thanks!

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

mikelehen · 2019-12-20T18:51:04Z

firebase-firestore/src/main/java/com/google/firebase/firestore/remote/GrpcCallProvider.java

+
+  private void initChannelTask() {
+    // We execute network initialization on a separate thread to not block operations that depend on
+    // the AsyncQueue.


Brian Chen added 17 commits October 23, 2019 11:37

did the thing

11ca8a5

comment fixes

448311d

Merge branch 'master' into bc/reconnect

403f91f

resolved comments

00568be

Merge branch 'bc/reconnect' of github.com:firebase/firebase-android-s…

c865827

…dk into bc/reconnect

just kidding, had to update more comments and remove unused vars

d714378

fix onlinestatetracker constructor

55e8f46

continue, make spec tests pass

d202893

resolve comments: comments, code ordering, rename to connectivity_att…

411ed9d

…empt_timer

Merge branch 'master' into bc/reconnect

1519a40

separate online_state_timeout from connectivity_attempt_timeout

5bce0a0

update comments

ba515b7

Merge branch 'master' into bc/reconnect

110fdc3

Merge branch 'master' into bc/reconnect

c9cafd5

update comments

b77e02e

working with logging comments for future debugging

31d0ad8

ready for review

8b2ad17

thebrianchen added the api: firestore label Dec 6, 2019

thebrianchen self-assigned this Dec 6, 2019

google-oss-bot added the size/L label Dec 6, 2019

googlebot added the cla: yes Override cla label Dec 6, 2019

Merge branch 'master' into bc/reconnect-grpc

af9ed48

thebrianchen requested a review from mikelehen December 6, 2019 22:06

thebrianchen assigned mikelehen and unassigned thebrianchen Dec 6, 2019

mikelehen suggested changes Dec 9, 2019

View reviewed changes

mikelehen assigned thebrianchen and unassigned mikelehen Dec 9, 2019

resolve michael comments with runBidi, has comments

1cca6a4

change close() from protected to private

a5dba09

thebrianchen requested a review from mikelehen December 10, 2019 01:40

thebrianchen assigned mikelehen and unassigned thebrianchen Dec 10, 2019

mikelehen suggested changes Dec 10, 2019

View reviewed changes

mikelehen assigned thebrianchen and unassigned mikelehen Dec 10, 2019

added logging, fixed comments

51fac1c

thebrianchen requested a review from wilhuff December 10, 2019 21:21

thebrianchen assigned wilhuff and unassigned thebrianchen Dec 10, 2019

wilhuff suggested changes Dec 11, 2019

View reviewed changes

wilhuff assigned thebrianchen and unassigned wilhuff Dec 12, 2019

fix backoff maxDelay, add comments, some renaming

595046f

thebrianchen requested a review from wilhuff December 12, 2019 21:05

thebrianchen assigned wilhuff and unassigned thebrianchen Dec 12, 2019

wilhuff approved these changes Dec 20, 2019

View reviewed changes

wilhuff assigned mikelehen and unassigned wilhuff Dec 20, 2019

mikelehen approved these changes Dec 20, 2019

View reviewed changes

mikelehen assigned thebrianchen and unassigned mikelehen Dec 20, 2019

Brian Chen added 2 commits January 8, 2020 05:39

Merge branch 'master' into bc/reconnect-grpc

a3fc304

comment fixes and always clear connectivity timer

36e448f

thebrianchen merged commit fa9a8c7 into master Jan 8, 2020

thebrianchen deleted the bc/reconnect-grpc branch January 8, 2020 14:49

firebase locked and limited conversation to collaborators Feb 8, 2020

Fix Android Connectivity Monitor (v2) #1045

Fix Android Connectivity Monitor (v2) #1045

Uh oh!

Conversation

thebrianchen commented Dec 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thebrianchen commented Dec 6, 2019

Uh oh!

mikelehen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thebrianchen commented Dec 10, 2019

Uh oh!

mikelehen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilhuff Dec 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wilhuff left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikelehen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

thebrianchen commented Dec 6, 2019 •

edited

Loading

wilhuff Dec 11, 2019 •

edited

Loading