Skip to content

Improve recovery #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 14, 2022
Merged

Improve recovery #245

merged 11 commits into from
Dec 14, 2022

Conversation

acogoluegnes
Copy link
Contributor

Create a locator connection for each provided URI. This way a connection can take over when the locator node goes down. This speeds up recovery.

Track scheduled tasks. This is likely to be disabled in a stable release. Useful to track down unfinished tasks.

Add retry to operations in the consumer coordinator.

Refresh consumer candidate nodes if the re-assignment of a consumer times out.

This improvements are based on the feedback from the effects of a rolling restart in K8S using stream-perf-test. Not all producers and consumers are recovered after all nodes have been restarted. The changes in this commit mitigates this problem.

Create a locator connection for each provided URI. This way
a connection can take over when the locator node goes down.
This speeds up recovery.

Track scheduled tasks. This is likely to be disabled in
a stable release. Useful to track down unfinished tasks.

Add retry to operations in the consumer coordinator.

Refresh consumer candidate nodes if the re-assignment of
a consumer times out.

This improvements are based on the feedback from the effects
of a rolling restart in K8S using stream-perf-test. Not all
producers and consumers are recovered after all nodes have
been restarted. The changes in this commit mitigates this problem.
Another recovery can kick in while a first one is in progress.
As we added retry, the first one can retry and refresh its
connection, so both will succeed and we end up with more
consumers than expected.
When stream is not available and even when there is no
candidates at the moment (added some retry delay for the latter).
It would contain the client connections for a given
node. It was difficult to maintain the consistency
between those structures, so we're better off not using
this layer.

Now client connections are looked up with a linear scan,
which is good enough, as the number of connections
should remain under the thousand.
It would contain the client connections for a given
node. It was difficult to maintain the consistency
between those structures, so we're better off not using
this layer.

Now client connections are looked up with a linear scan,
which is good enough, as the number of connections
should remain under the thousand.
Instead of wrapping them in StreamExceptions. Easier to
deal with timeout exception.
@acogoluegnes acogoluegnes added this to the 0.9.0 milestone Dec 14, 2022
@acogoluegnes acogoluegnes marked this pull request as ready for review December 14, 2022 16:03
@acogoluegnes acogoluegnes merged commit b329976 into main Dec 14, 2022
@acogoluegnes acogoluegnes deleted the recovery-improvements branch December 14, 2022 16:03
github-actions bot pushed a commit that referenced this pull request Dec 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant