Skip to content

Fix Android Connectivity Monitor (v2) #1045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jan 8, 2020
Merged

Conversation

thebrianchen
Copy link

@thebrianchen thebrianchen commented Dec 6, 2019

Continuation of #937 but instead of binding connectivity attempts to the WatchStream, this version bases connectivity attempts on grpc's ConnectivityState.

Note that this implementation also depends on two ManagedChannel APIs that are labeled as @ExperimentalApi and require a minimum grpc version v1.6.1.: ManagedChannel.getState() and ManagedChannel.notifyWhenStateChanged().

@thebrianchen thebrianchen self-assigned this Dec 6, 2019
@googlebot googlebot added the cla: yes Override cla label Dec 6, 2019
@thebrianchen
Copy link
Author

/retest

Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some (mostly high-level) feedback / questions...

@@ -174,6 +174,9 @@ public void run() {
/** The time a stream stays open after it is marked idle. */
private static final long IDLE_TIMEOUT_MS = TimeUnit.MINUTES.toMillis(1);

/** Maximum backoff time when reconnecting. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"when reconnecting" doesn't really do anything to distinguish this from the existing BACKOFF_MAX_DELAY_MS which is also used when reconnecting.

I think the intention is that this max backoff delay is used when we're 100% sure that the connection failed on the client-side and didn't even reach the server. In practice, I think the only case we can be sure of that is DNS failures (see my other comment about ConnectException). If that's the case, then maybe we can call this DNS_FAILURE_BACKOFF_MAX_DELAY_MS or something...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CLIENT_NETWORK_FAILURE_BACKOFF_MAX_DELAY_MS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that BACKOFF should maybe be first (to group this with the other backoff-related constants). So: BACKOFF_CLIENT_NETWORK_FAILURE_MAX_DELAY_MS?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW- I don't feel strongly. Just trying to figure out how to make the name parse more sensibly...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to Michael's suggestion.

@mikelehen mikelehen assigned thebrianchen and unassigned mikelehen Dec 9, 2019
@thebrianchen
Copy link
Author

Did another pass through, moved logic exclusively to gRPC side and removed the markChannelIdle() code.

Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking, I think I'm tentatively a fan!

I'd recommend doing a pass over my feedback (if there's anything that seems time consuming, you can defer for now) and then run this by Gil to see what he thinks generally. I don't 100% know if he'll be onboard with recreating the channel. If we can get timing information on that to prove to ourselves that it isn't too expensive, that would be helpful.

We might finally want to run it by the gRPC team to see if we can get their blessing on this approach (or maybe they can suggest something better). If nothing else, it may be useful feedback to them regarding what our needs are, to better inform their solution for grpc/grpc-java#1943.


private void initChannelTask() {
// We execute network initialization on a separate thread to not block operations that depend on
// the AsyncQueue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential concern with recreating the channel is it could be expensive. We're already doing it on a background thread for performance reasons, so it's possible this adds some meaningful delay.

As part of your logging, I'd recommend logging the start/finish of this so that we can get timing data and see how long this takes each time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's taking around 15-40ms to reset the channel from what I've seen based on the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. That is longer than I hoped. I don't know if it's a problem or not. I suggest getting input from Gil and gRPC folks (and perhaps point out this delay to gRPC and see if they have other suggestions for implementing a connection timeout).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we're spending an extra 40 ms to bring the reaction time on failed reconnects down from two minutes to 15 seconds. While 40 ms isn't exactly cheap it creates a big enough win that it seems worthwhile.

Also note that this only kicks in when Android's own network transition logic isn't kicking in. That we're not being inundated with requests for this feature suggests that it's going to be fairly rare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mikelehen mikelehen assigned thebrianchen and unassigned mikelehen Dec 10, 2019

private void initChannelTask() {
// We execute network initialization on a separate thread to not block operations that depend on
// the AsyncQueue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we're spending an extra 40 ms to bring the reaction time on failed reconnects down from two minutes to 15 seconds. While 40 ms isn't exactly cheap it creates a big enough win that it seems worthwhile.

Also note that this only kicks in when Android's own network transition logic isn't kicking in. That we're not being inundated with requests for this feature suggests that it's going to be fairly rare.

@@ -174,6 +174,9 @@ public void run() {
/** The time a stream stays open after it is marked idle. */
private static final long IDLE_TIMEOUT_MS = TimeUnit.MINUTES.toMillis(1);

/** Maximum backoff time when reconnecting. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CLIENT_NETWORK_FAILURE_BACKOFF_MAX_DELAY_MS?

@@ -53,6 +55,11 @@
private static final String X_GOOG_API_CLIENT_VALUE =
"gl-java/ fire/" + BuildConfig.VERSION_NAME + " grpc/";

// This timeout is used when attempting to establish a connection. If a connection attempt
// does not succeed in time, we close the stream and restart the connection, rather than having
Copy link
Contributor

@wilhuff wilhuff Dec 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer callback does not seem to "close the stream" in any way that I can tell. Based on comments in markChannelIdle, the first stream continues? Does this actually close the stream, or is this a side effect?

From what I can tell, the connectivityAttemptTimer merely starts a new clientCall (through a ~recursive invocation of runBidiStreamingRpc). What prevents the observer from getting callbacks from the old stream once that fails ~2 minutes later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're looking at an old version of the PR? 😬 This code is gone in the latest version, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the outdated version. The new one can be found here.

@@ -151,6 +161,9 @@ public void backoffAndRun(Runnable task) {
} else if (currentBaseMs > maxDelayMs) {
currentBaseMs = maxDelayMs;
}

// Reset max delay to the default.
maxDelayMs = DEFAULT_BACKOFF_MAX_DELAY_MS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignored the comments I made on the last version of this PR.

The max delay was configured in the constructor and may not actually be DEFAULT_BACKOFF_MAX_DELAY_MS. You can't read this constant here.

Instead, save two max delay fields in the constructor. One should be final, as maxDelayMs is today and it should be used here as the value to which to reset. The second value should be the one you actually bounce around based on which kind of error is in effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (attempted).

@wilhuff wilhuff assigned thebrianchen and unassigned wilhuff Dec 12, 2019
Copy link
Contributor

@wilhuff wilhuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a final suggestion.


/**
* The maximum backoff time used when calculating the next backoff. This value can be changed for
* a single backoffAndRun call, after which it resets to maxDelayMs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@wilhuff wilhuff assigned mikelehen and unassigned wilhuff Dec 20, 2019
Copy link
Contributor

@mikelehen mikelehen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one remaining concern.

I still think it may be worth running this by the gRPC folks (on the existing email thread) before we submit it so that they have a chance to sanity-check our approach and so that they're aware that we're going to these lengths to work around gRPC's current behavior. Thanks!


private void initChannelTask() {
// We execute network initialization on a separate thread to not block operations that depend on
// the AsyncQueue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@mikelehen mikelehen assigned thebrianchen and unassigned mikelehen Dec 20, 2019
@thebrianchen thebrianchen merged commit fa9a8c7 into master Jan 8, 2020
@thebrianchen thebrianchen deleted the bc/reconnect-grpc branch January 8, 2020 14:49
@firebase firebase locked and limited conversation to collaborators Feb 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants