Improve recovery #245

acogoluegnes · 2022-11-25T17:22:09Z

Create a locator connection for each provided URI. This way a connection can take over when the locator node goes down. This speeds up recovery.

Track scheduled tasks. This is likely to be disabled in a stable release. Useful to track down unfinished tasks.

Add retry to operations in the consumer coordinator.

Refresh consumer candidate nodes if the re-assignment of a consumer times out.

This improvements are based on the feedback from the effects of a rolling restart in K8S using stream-perf-test. Not all producers and consumers are recovered after all nodes have been restarted. The changes in this commit mitigates this problem.

Create a locator connection for each provided URI. This way a connection can take over when the locator node goes down. This speeds up recovery. Track scheduled tasks. This is likely to be disabled in a stable release. Useful to track down unfinished tasks. Add retry to operations in the consumer coordinator. Refresh consumer candidate nodes if the re-assignment of a consumer times out. This improvements are based on the feedback from the effects of a rolling restart in K8S using stream-perf-test. Not all producers and consumers are recovered after all nodes have been restarted. The changes in this commit mitigates this problem.

Another recovery can kick in while a first one is in progress. As we added retry, the first one can retry and refresh its connection, so both will succeed and we end up with more consumers than expected.

When stream is not available and even when there is no candidates at the moment (added some retry delay for the latter).

It would contain the client connections for a given node. It was difficult to maintain the consistency between those structures, so we're better off not using this layer. Now client connections are looked up with a linear scan, which is good enough, as the number of connections should remain under the thousand.

Instead of wrapping them in StreamExceptions. Easier to deal with timeout exception.

References #200

Improve recovery

acogoluegnes added 3 commits November 29, 2022 11:22

Complete re-assignment if consumer is closed

f753c1e

Prevent concurrent consumer recovery

10a0887

Another recovery can kick in while a first one is in progress. As we added retry, the first one can retry and refresh its connection, so both will succeed and we end up with more consumers than expected.

acogoluegnes force-pushed the recovery-improvements branch from 065fd49 to 10a0887 Compare November 29, 2022 10:22

acogoluegnes added 8 commits November 29, 2022 18:06

Retry consumer recovery on more conditions

c2a1626

When stream is not available and even when there is no candidates at the moment (added some retry delay for the latter).

Release permits only if possible

39fe702

Rethrow StreamExceptions in Client

1b71be8

Instead of wrapping them in StreamExceptions. Easier to deal with timeout exception.

Base retry completion on backoff delay policy

e62b82e

Bump SLF4J and logback for performance tool

f168ca8

References #200

Merge branch 'main' into recovery-improvements

53cda39

acogoluegnes added this to the 0.9.0 milestone Dec 14, 2022

acogoluegnes marked this pull request as ready for review December 14, 2022 16:03

acogoluegnes merged commit b329976 into main Dec 14, 2022

acogoluegnes deleted the recovery-improvements branch December 14, 2022 16:03

github-actions bot pushed a commit that referenced this pull request Dec 14, 2022

Merge pull request #245 from rabbitmq/recovery-improvements

8196584

Improve recovery

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve recovery #245

Improve recovery #245

Uh oh!

acogoluegnes commented Nov 25, 2022

Uh oh!

Uh oh!

Improve recovery #245

Improve recovery #245

Uh oh!

Conversation

acogoluegnes commented Nov 25, 2022

Uh oh!

Uh oh!