Skip to content

Under certain condition client connection can block for a long period #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimas opened this issue Jan 7, 2017 · 10 comments
Closed

Under certain condition client connection can block for a long period #236

dimas opened this issue Jan 7, 2017 · 10 comments

Comments

@dimas
Copy link
Contributor

dimas commented Jan 7, 2017

When you have a low heartbeat interval you normally expect RabbitMQ client to quickly recognise connection loss. However, there is a scenario when it is not the case. Pre-requisites for that are:

  • AMQP over SSL. Won't happen on a plain socket
  • some weird network condition causing certain packet to get lost
  • unfortunate timing. Client should be sending message around that "network event". To increase probability and reproduce it more reliably I just made client send a message every second.

So the client just periodically publishes a message. Then happens that mysterious network even that by the looks of it blocks packets passing in one direction, from client to the server, while allowing packets in opposite direction. (There can be more than that as some active network equipment is definitely involved and I blame it for the root cause)

Still, you would expect, a few missing heartbeats (which I set for 30 seconds) and connection will be killed. Not exactly.

  1. The RabbitMQ server indeed very quickly detect missing heartbeats and closes the connection
  2. The client basically hangs for 15 minutes (!) and only after that time it declares connection dead.

There are three threads involved. So my interpretation of what happens is that

  • Application thread. It tries to publish message and goes all the way down to SSL socket write+flush. Socket's write/flush blocks because client cannot get confirmation from the remote TCP stack.
"pinger-1-Thread-1" daemon prio=10 tid=0x9f743688 nid=0x527 runnable [0x9f0fe000]
   java.lang.Thread.State: RUNNABLE
    at java.net.SocketOutputStream.socketWrite0(Native Method)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
    at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:375)
    at sun.security.ssl.OutputRecord.write(OutputRecord.java:363)
    at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:801)
    at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:773)
    at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:119)
    - locked <0xa5870a10> (a sun.security.ssl.AppOutputStream)
    at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
    at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
    - locked <0xa5872a38> (a java.io.BufferedOutputStream)
    at java.io.DataOutputStream.flush(DataOutputStream.java:123)
    at com.rabbitmq.client.impl.SocketFrameHandler.flush(SocketFrameHandler.java:149)
    at com.rabbitmq.client.impl.AMQConnection.flush(AMQConnection.java:521)
    at com.rabbitmq.client.impl.AMQCommand.transmit(AMQCommand.java:124)
    at com.rabbitmq.client.impl.AMQChannel.quiescingTransmit(AMQChannel.java:333)
    - locked <0xa06a4370> (a java.lang.Object)
    at com.rabbitmq.client.impl.AMQChannel.transmit(AMQChannel.java:309)
    - locked <0xa06a4370> (a java.lang.Object)
    at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:656)
    at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:639)
    at com.rabbitmq.client.impl.ChannelN.basicPublish(ChannelN.java:630)
...

Note that write() timeout unlike read() timeout is not controlled by setSoTimeout() but chosen by TCP stack based on calculated round-trip-time and number of retries.

  • HeartbeatSender thread is here only for completeness. It does not play any role in this race condition as it just tries to write to the socket and gets completely blocked because lock to output stream is currently held by application thread:
"pool-1-thread-2" prio=10 tid=0x9f74fb48 nid=0x525 waiting for monitor entry [0x9f3fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at java.io.BufferedOutputStream.write(BufferedOutputStream.java:94)
    - waiting to lock <0xa5872a38> (a java.io.BufferedOutputStream)
    at java.io.DataOutputStream.writeByte(DataOutputStream.java:153)
    at com.rabbitmq.client.impl.Frame.writeTo(Frame.java:188)
    at com.rabbitmq.client.impl.SocketFrameHandler.writeFrame(SocketFrameHandler.java:144)
    - locked <0xa5872a50> (a java.io.DataOutputStream)
    at com.rabbitmq.client.impl.HeartbeatSender$HeartbeatRunnable.run(HeartbeatSender.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:701)
  • finally, there is connection main loop thread thread that is supposed to detect missing heartbeats from the remote side and kill the connection. Unfortunately, as it reads from the socket, it receives graceful connection closure notification from the remote SSL stack (when server closes the connection because of missing heartbeat) so it starts closing connection on the local side too. But closing it apparently involves notifying the remote side of the action. And this makes the main thread to attempt writing to the same socket. And, just to remind, all the appropriate locks are held by the application thread
"AMQP Connection 10.19.36.12:5671" prio=10 tid=0x9f70c578 nid=0x524 waiting on condition [0x9f57e000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0xa5884f88> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:871)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1201)
    at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:771)
    at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:644)
    at sun.security.ssl.SSLSocketImpl.sendAlert(SSLSocketImpl.java:1819)
    at sun.security.ssl.SSLSocketImpl.warning(SSLSocketImpl.java:1668)
    at sun.security.ssl.SSLSocketImpl.closeInternal(SSLSocketImpl.java:1470)
    - locked <0xa5870948> (a sun.security.ssl.SSLSocketImpl)
    at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:1763)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1024)
    - locked <0xa5870948> (a sun.security.ssl.SSLSocketImpl)
    - locked <0xa58709d8> (a java.lang.Object)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:840)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:94)
    - locked <0xa587d330> (a sun.security.ssl.AppInputStream)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
    - locked <0xa5885000> (a java.io.BufferedInputStream)
    at java.io.DataInputStream.readUnsignedByte(DataInputStream.java:288)
    at com.rabbitmq.client.impl.Frame.readFrom(Frame.java:94)
    at com.rabbitmq.client.impl.SocketFrameHandler.readFrame(SocketFrameHandler.java:138)
    - locked <0xa5881498> (a java.io.DataInputStream)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:541)
    at java.lang.Thread.run(Thread.java:701)

15 minutes later application thread received a SocketTimeout exception from java.net.SocketOutputStream.socketWrite0() and everything finally aborts.

It does look as a bug in SSLSocket in the first place - the moment they start calling write() from read() method, they cannot guarantee that timeout configured on the socket will work. (Again, my assumption is that write timeout is not configurable). And this can affect connections of any type not just AMQP.
However I am not overly optimistic that issue like that could be easily fixed. After all, it was in the standard library from the day one.

So the question is more like - can something be done at rabbitmq client level instead?

PS: line numbers in stack traces are for JRE 1.6.0_27 (but from the looks of it, Java 8 library is not much different)

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 7, 2017 via email

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 7, 2017

I believe this is a duplicate of #194 and should not happen with NIO (#11) simply because Socket#write isn't used there. Go give 4.0.1 a try (NIO is opt-in). Closing for now.

@dimas
Copy link
Contributor Author

dimas commented Jan 7, 2017

Reproduce with both 3.3.5 and 3.6.5

I do not think it is the same as #194 but I agree it is definitely related. Both revolve around the fact that Socket#write blocks.
I have read about NIO support in 4.0 already but I doubt we can use it. Anyways, given you providing a way around the problem and I cannot propose anything better at the moment, I guess the issue can be kept closed.

Thanks

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 7, 2017 via email

@dimas
Copy link
Contributor Author

dimas commented Jan 7, 2017

In addition to OpenJDK we also use 3rd party JVM for some low-end devices. I doubt that JVM supports NIO but I am checking...

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 7, 2017 via email

@dimas
Copy link
Contributor Author

dimas commented Jan 7, 2017

I understand ) but that JVM is a stripped down version because it is designed for set-top boxes and the like. I doubt many of these devices require NIO so it was never their priority.

@dimas
Copy link
Contributor Author

dimas commented Jan 7, 2017

JFYI, the fix for #194 (v4.0.1 without NIO) does not fix my problem confirming it is a slightly different issue.
The NIO mode seem to work much better and instead of 15 minutes connection loss is detected in about 30 seconds as it should. However

  1. during these 30 seconds basicPublish was returning successfully making the application to believe connection is fully operational while it was not. But I guess this is by design
  2. more interestingly there were some assertion logged:
Exception in thread "rabbitmq-connection-shutdown-amqp://[email protected]:5671/" java.lang.AssertionError: BlockingCell can only be set once
	at com.rabbitmq.utility.BlockingCell.set(BlockingCell.java:141)
	at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown(AMQConnection.java:681)
	at com.rabbitmq.client.impl.AMQConnection.handleIoError(AMQConnection.java:663)
	at com.rabbitmq.client.impl.nio.NioLoop$1.run(NioLoop.java:280)
	at java.lang.Thread.run(Thread.java:701)

Not sure if this is something expected, AssertionError never is...

@acogoluegnes
Copy link
Contributor

With NIO, outbound frames are queued in an in-memory queue and dequeued by the NIO thread, that's why basicPublish (and any asynchronous AMQP method) doesn't block, even if network connectivity has been lost. It can block if the queue if full, but this is another problem.

The issue looks like #194 nevertheless and the workaround needs some specific settings:

connectionFactory.setRequestedHeartbeat(5);
ExecutorService shutdownExecutor = Executors.newSingleThreadExecutor();
connectionFactory.setShutdownExecutor(shutdownExecutor);

Could you try those (with 4.0.1 and the default, blocking mode)? It's been successful for some users.

@dimas
Copy link
Contributor Author

dimas commented Jan 9, 2017

@acogoluegnes, nope, I am afraid I still had the same 15 minutes delay. I did not reduce heartbeat from 30 to 5 though but still I would expect it to fail much quicker.
I still believe it is a different problem here. Because sslSocket.read() may call sslSocket.write(), the whole read operation gets blocked for 15 minutes - it just does not obey soSocketTimeout anymore so I am not sure how you expect it to finish quicker.
Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants