Replies: 7 comments 19 replies
-
You haven't provided a lot of details on what exactly you do in your test. Assuming that you restart a node hosting a stream leader replica, there are several factors at play:
Overall, it can take 20 to 60s with all defaults. For a lot of environments, that would hardly be "very long". With heartbeat timeouts lowered to 5s I can see how it can get to the 10-15s range. Even if some tuning would give you a sub-5s recovery time, there is a very real cost to this: the false positives when networks or peers slow down. |
Beta Was this translation helpful? Give feedback.
-
In case of Kubernetes or similar container orchestration systems there is one more factor at play: if you use a This entirely depends on the type and configuration of the K8S service. |
Beta Was this translation helpful? Give feedback.
-
The default is 10s. Lowering this to 5s could help. I don't know if Stream PerfTest exposes this setting (likely not), and would not recommend very low values for the aforementioned real (not hypothetical) risk of false positives around timeouts. |
Beta Was this translation helpful? Give feedback.
-
Hi again, during the last days I was focused on testing the RabbitMQ cluster a little bit more, and also was trying to gather logs from different places. I created 3 extra VMs (which are in the same time Kubernetes Nodes) and I placed in each of them one of the RabbitMQ cluster pods. In order to be sure that the failover delay is not coming because of the load-balancer, I started the PerfTool as a pod inside the Kubernetes cluster like this: PerfTool as pod
What happened: Note: Errors happen sometimes after the second iteration of the failovertest. That means:
PerfTool error messages
It states that: java.net.UnknownHostException: message-broker-server-1.message-broker-nodes.message-broker: Name or service not known, but actually that hostname is valid. I run nslookup on the same pod and I could resolve this hostname, this hostname is automatically generated from the RMQ cluster operator. The "message-broker-nodes" in this case is the headless service created from the RMQ cluster operator. I don't know if this is a bug, or not. If I use the --load-balancer flag during the PerfTest then I don't get these errors, but I don't know if it makes sense. I would expect that the Cluster Service which serves the cluster and comes from the RMQ cluster operator, is not to be seen as an load-balancer, by default this service is a ClusterIP service, and its' endpoints are to be accessed from everywhere in the cluster with the help of the headless service which also comes automatically from the RMQ Cluster Operator. If you are interested to pursue the issue of the errors coming from the PerfTool, I can provide you with some more information below: In order to better understand what is happening in the background I installed the hubble from cilium in order to catch the traffic and here are the results: Traffic between PerfTool (Client) and RMQClusterAt the beginning: After the failure:
And here also the logs from the rabbitmq cluster nodes which lived during the failover: Logs from RabbitMQ cluster nodes
If I read these logs correctly it means that it took in this case around 15 sec to elect the new leader, right? p.s.
|
Beta Was this translation helpful? Give feedback.
-
@albionb96 I fixed a bug in the stream Java client that was likely the cause of long recovery times and unnecessary consumer connections when the |
Beta Was this translation helpful? Give feedback.
-
@albionb96 About the You can enable logging for a couple of classes: |
Beta Was this translation helpful? Give feedback.
-
As already been mentioned, sreams rely on erlang monitors for failure detection. Erlang monitors of processes on a remote node in the case of network partitions rely on the erlang distribution connection to time out before monitors are sent. This could, by default, take around a minute . This is what we see in this case. As the nodes are force stopped they leave dangling erlang distribution TCP connections on the other nodes. The nodes don't know a node is gone as they rely on the TCP connection for this and it looks like it is still there. The only viable option here is lower the Of course a more orderly shutdown procedure where the server being shut down has enough time to send FIN or RST packets (as would be used for upgrades / maintenance etc) so that the remotes can detect the connection is lost will not have this undue delay and detection will be near instantaneous. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Hi,
I tried some failovertest with both stream-perf-test-0.9.0 and stream-perf-test-0.12.0, and although I looked at all possible configuration parameters of the tool and also config parameters of the RabbitMQ Cluster, I was not able to remove or reduce the downtimes shown from the client metrics when a node is killed (leader node or a node with connections on it).
The downtimes vary from 20-60 seconds. I tried to run also the docker image format of the client inside my kubernetes cluster and there I had even longer downtimes. I don't know why it takes the client so long to automatically recover the connections.
Reproduction steps
Expected behavior
I would expect to not see any significant downtimes during these failovertests, but the metrics of the perf tool like: latency, rabbitmq_stream_published_total etc. show downtimes from 20-60seconds, which are too big I think when somebody wants to use rabbitMQ in HA fashion for a HA application which can accept let's say max 1-5 seconds of downtimes in case of failures.
Additional context
I don't know the internals of the java client in this case, but although it says that by default it will create only one consumer and producer I see in the connections overview more than one consumer connection:

Beta Was this translation helpful? Give feedback.
All reactions