Skip to content

Fix Android Connectivity Monitor (v2) #1045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jan 8, 2020
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
11ca8a5
did the thing
Oct 23, 2019
448311d
comment fixes
Oct 23, 2019
403f91f
Merge branch 'master' into bc/reconnect
Oct 30, 2019
00568be
resolved comments
Oct 30, 2019
c865827
Merge branch 'bc/reconnect' of github.com:firebase/firebase-android-s…
Oct 30, 2019
d714378
just kidding, had to update more comments and remove unused vars
Oct 30, 2019
55e8f46
fix onlinestatetracker constructor
Oct 30, 2019
d202893
continue, make spec tests pass
Nov 1, 2019
411ed9d
resolve comments: comments, code ordering, rename to connectivity_att…
Nov 5, 2019
1519a40
Merge branch 'master' into bc/reconnect
Nov 5, 2019
5bce0a0
separate online_state_timeout from connectivity_attempt_timeout
Nov 12, 2019
ba515b7
update comments
Nov 12, 2019
110fdc3
Merge branch 'master' into bc/reconnect
Nov 20, 2019
c9cafd5
Merge branch 'master' into bc/reconnect
Dec 3, 2019
b77e02e
update comments
Dec 3, 2019
31d0ad8
working with logging comments for future debugging
Dec 6, 2019
8b2ad17
ready for review
Dec 6, 2019
af9ed48
Merge branch 'master' into bc/reconnect-grpc
Dec 6, 2019
1cca6a4
resolve michael comments with runBidi, has comments
Dec 9, 2019
86b4733
working in grpc exclusively with logs
Dec 10, 2019
873d83d
remove logs
Dec 10, 2019
d321316
remove markChannelIdle()
Dec 10, 2019
a5dba09
change close() from protected to private
Dec 10, 2019
51fac1c
added logging, fixed comments
Dec 10, 2019
595046f
fix backoff maxDelay, add comments, some renaming
Dec 12, 2019
a3fc304
Merge branch 'master' into bc/reconnect-grpc
Jan 8, 2020
36e448f
comment fixes and always clear connectivity timer
Jan 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,11 @@ public void run() {
/** The time a stream stays open after it is marked idle. */
private static final long IDLE_TIMEOUT_MS = TimeUnit.MINUTES.toMillis(1);

/**
* Maximum backoff time for reconnecting when we know the connection is failed on the client-side.
*/
private static final long DNS_FAILURE_BACKOFF_MAX_DELAY_MS = TimeUnit.SECONDS.toMillis(10);

@Nullable private DelayedTask idleTimer;

private final FirestoreChannel firestoreChannel;
Expand Down Expand Up @@ -290,18 +295,22 @@ private void close(State finalState, Status status) {
if (code == Code.OK) {
// If this is an intentional close ensure we don't delay our next connection attempt.
backoff.reset();

} else if (code == Code.RESOURCE_EXHAUSTED) {
Logger.debug(
getClass().getSimpleName(),
"(%x) Using maximum backoff delay to prevent overloading the backend.",
System.identityHashCode(this));
backoff.resetToMax();

} else if (code == Code.UNAUTHENTICATED) {
// "unauthenticated" error means the token was rejected. Try force refreshing it in case it
// just expired.
firestoreChannel.invalidateToken();
} else if (code == Code.UNAVAILABLE) {
// This exception is thrown when the gRPC connection fails on the client side, To shorten
// reconnect time, we can use a shorter max delay when reconnecting.
if (status.getCause() instanceof java.net.UnknownHostException) {
backoff.setTemporaryMaxDelay(DNS_FAILURE_BACKOFF_MAX_DELAY_MS);
}
}

if (finalState != State.Error) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,16 @@
import com.google.android.gms.tasks.Tasks;
import com.google.firebase.firestore.core.DatabaseInfo;
import com.google.firebase.firestore.util.AsyncQueue;
import com.google.firebase.firestore.util.AsyncQueue.DelayedTask;
import com.google.firebase.firestore.util.AsyncQueue.TimerId;
import com.google.firebase.firestore.util.Executors;
import com.google.firebase.firestore.util.Logger;
import com.google.firebase.firestore.util.Supplier;
import com.google.firestore.v1.FirestoreGrpc;
import io.grpc.CallCredentials;
import io.grpc.CallOptions;
import io.grpc.ClientCall;
import io.grpc.ConnectivityState;
import io.grpc.ManagedChannel;
import io.grpc.ManagedChannelBuilder;
import io.grpc.MethodDescriptor;
Expand All @@ -45,11 +48,23 @@ public class GrpcCallProvider {

private static Supplier<ManagedChannelBuilder<?>> overrideChannelBuilderSupplier;

private final Task<ManagedChannel> channelTask;
private Task<ManagedChannel> channelTask;
private final AsyncQueue asyncQueue;

private CallOptions callOptions;

// A timer that elapses after CONNECTIVTY_ATTEMPT_TIMEOUT_MS, at which point we restart the
// channel and try connecting again.
private DelayedTask connectivityAttemptTimer;

// This timeout is used when attempting to establish a connection in gRPC. If a connection attempt
// does not succeed in time, we restart the channel, rather than having it hang indefinitely.
private static final int CONNECTIVITY_ATTEMPT_TIMEOUT_MS = 15 * 1000;

private final Context context;
private final DatabaseInfo databaseInfo;
private final CallCredentials firestoreHeaders;

/**
* Helper function to globally override the channel that RPCs use. Useful for testing when you
* want to bypass SSL certificate checking.
Expand All @@ -69,24 +84,11 @@ public static void overrideChannelBuilder(
DatabaseInfo databaseInfo,
CallCredentials firestoreHeaders) {
this.asyncQueue = asyncQueue;
this.context = context;
this.databaseInfo = databaseInfo;
this.firestoreHeaders = firestoreHeaders;

// We execute network initialization on a separate thread to not block operations that depend on
// the AsyncQueue.
this.channelTask =
Tasks.call(
Executors.BACKGROUND_EXECUTOR,
() -> {
ManagedChannel channel = initChannel(context, databaseInfo);
FirestoreGrpc.FirestoreStub firestoreStub =
FirestoreGrpc.newStub(channel)
.withCallCredentials(firestoreHeaders)
// Ensure all callbacks are issued on the worker queue. If this call is
// removed, all calls need to be audited to make sure they are executed on the
// right thread.
.withExecutor(asyncQueue.getExecutor());
callOptions = firestoreStub.getCallOptions();
return channel;
});
initChannelTask();
}

/** Sets up the SSL provider and configures the gRPC channel. */
Expand Down Expand Up @@ -198,4 +200,71 @@ void shutdown() {
Thread.currentThread().interrupt();
}
}

/**
* Cancels the connectivityStateTimer if the new state indicates gRPC is online. Starts the timer
* if the new state indicates gRPC is connecting.
*/
private void onConnectivityStateChange(ManagedChannel channel) {
ConnectivityState newState = channel.getState(false);
// Check that the new state is online, then cancel timer.
if (newState == ConnectivityState.READY) {
clearConnectivityTimer();
} else if (newState == ConnectivityState.CONNECTING) {
connectivityAttemptTimer =
asyncQueue.enqueueAfterDelay(
TimerId.CONNECTIVITY_ATTEMPT_TIMER,
CONNECTIVITY_ATTEMPT_TIMEOUT_MS,
() -> {
clearConnectivityTimer();
resetChannel(channel);
});
}
// Re-listen for next state change.
channel.notifyWhenStateChanged(
newState, () -> asyncQueue.enqueueAndForget(() -> onConnectivityStateChange(channel)));
}

/**
* Shuts down and re-initializes the channel.
*
* <p>This is used when the connectivity attempt timer elapses and we need to reset the gRPC
* channel to re-establish connectivity.
*/
private void resetChannel(ManagedChannel channel) {
asyncQueue.enqueueAndForget(
() -> {
channel.shutdownNow();
initChannelTask();
});
}

private void initChannelTask() {
// We execute network initialization on a separate thread to not block operations that depend on
// the AsyncQueue.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential concern with recreating the channel is it could be expensive. We're already doing it on a background thread for performance reasons, so it's possible this adds some meaningful delay.

As part of your logging, I'd recommend logging the start/finish of this so that we can get timing data and see how long this takes each time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's taking around 15-40ms to reset the channel from what I've seen based on the logs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. That is longer than I hoped. I don't know if it's a problem or not. I suggest getting input from Gil and gRPC folks (and perhaps point out this delay to gRPC and see if they have other suggestions for implementing a connection timeout).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we're spending an extra 40 ms to bring the reaction time on failed reconnects down from two minutes to 15 seconds. While 40 ms isn't exactly cheap it creates a big enough win that it seems worthwhile.

Also note that this only kicks in when Android's own network transition logic isn't kicking in. That we're not being inundated with requests for this feature suggests that it's going to be fairly rare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

this.channelTask =
Tasks.call(
Executors.BACKGROUND_EXECUTOR,
() -> {
ManagedChannel channel = initChannel(context, databaseInfo);
onConnectivityStateChange(channel);
FirestoreGrpc.FirestoreStub firestoreStub =
FirestoreGrpc.newStub(channel)
.withCallCredentials(firestoreHeaders)
// Ensure all callbacks are issued on the worker queue. If this call is
// removed, all calls need to be audited to make sure they are executed on the
// right thread.
.withExecutor(asyncQueue.getExecutor());
callOptions = firestoreStub.getCallOptions();
return channel;
});
}

/** Clears the connectivity timer if it exists. */
private void clearConnectivityTimer() {
if (connectivityAttemptTimer != null) {
connectivityAttemptTimer.cancel();
connectivityAttemptTimer = null;
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -143,8 +143,8 @@ void handleWatchStreamFailure(Status status) {
logClientOfflineWarningIfNecessary(
String.format(
Locale.ENGLISH,
"Connection failed %d times. Most recent error: %s",
MAX_WATCH_STREAM_FAILURES,
"Backend didn't respond within %d seconds. Most recent error: %s\n",
ONLINE_STATE_TIMEOUT_MS / 1000,
status));
setAndBroadcastState(OnlineState.OFFLINE);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,12 @@ public enum TimerId {
* A timer used to retry transactions. Since there can be multiple concurrent transactions,
* multiple of these may be in the queue at a given time.
*/
RETRY_TRANSACTION
RETRY_TRANSACTION,
/**
* A timer used to monitor when a connection attempt in gRPC is unsuccessful and retry
* accordingly.
*/
CONNECTIVITY_ATTEMPT_TIMER
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ public class ExponentialBackoff {
private final TimerId timerId;
private final long initialDelayMs;
private final double backoffFactor;
private final long maxDelayMs;

private long maxDelayMs;
private long currentBaseMs;
private long lastAttemptTime;
private DelayedTask timerTask;
Expand Down Expand Up @@ -103,6 +103,16 @@ public void resetToMax() {
currentBaseMs = maxDelayMs;
}

/**
* Set the backoff's maximum delay for only the next call to backoffAndRun, after which the delay
* will be reset to maxDelayMs.
*
* @param newMax The temporary maximum delay to set.
*/
public void setTemporaryMaxDelay(long newMax) {
maxDelayMs = newMax;
}

/**
* Waits for currentDelayMs, increases the delay and runs the specified task. If there was a
* pending backoff task waiting to run already, it will be canceled.
Expand Down Expand Up @@ -151,6 +161,9 @@ public void backoffAndRun(Runnable task) {
} else if (currentBaseMs > maxDelayMs) {
currentBaseMs = maxDelayMs;
}

// Reset max delay to the default.
maxDelayMs = DEFAULT_BACKOFF_MAX_DELAY_MS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ignored the comments I made on the last version of this PR.

The max delay was configured in the constructor and may not actually be DEFAULT_BACKOFF_MAX_DELAY_MS. You can't read this constant here.

Instead, save two max delay fields in the constructor. One should be final, as maxDelayMs is today and it should be used here as the value to which to reset. The second value should be the one you actually bounce around based on which kind of error is in effect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (attempted).

}

public void cancel() {
Expand Down