Skip to content

NIO main thread can terminate due to an unhandled javax.net.ssl.SSLException #237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimas opened this issue Jan 16, 2017 · 26 comments
Closed
Assignees
Milestone

Comments

@dimas
Copy link
Contributor

dimas commented Jan 16, 2017

I described my problem here - #236
As per suggestion, I tried using v4.0.1 version of the library with NIO to work around the issue.
The good thing is that that version does indeed detect connection problem much faster that 15 minutes that it used to take for previous version. The bad thing is that NIO main thread catches unhandled exception and terminates rendering the whole application unusable.

SEVERE: Unhandled exception. Aborting.
Throwable occurred: java.lang.AssertionError: BlockingCell can only be set once
	at com.rabbitmq.utility.BlockingCell.set (BlockingCell.java:141)
	at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown (AMQConnection.java:681)
	at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure (AMQConnection.java:654)
	at com.rabbitmq.client.impl.nio.NioLoop.run (NioLoop.java:78)
	at java.lang.Thread.run (Unknown Source, bco=11)

I cannot easily reproduce it as I do not know what that bloody piece of network equipment does but it feels like it just stops outgoing packets from he client machine at some point while still allowing incoming ones.

previous exceptions that lead to the one above

Jan 16, 2017 11:45:56 AM com.rabbitmq.client.impl.nio.NioLoop run
WARNING: Error during reading frames
Throwable occurred: javax.net.ssl.SSLException: closed in read
	at com.rabbitmq.client.impl.nio.SslEngineByteBufferInputStream.read (SslEngineByteBufferInputStream.java:89)
	at java.io.DataInputStream.readUnsignedByte (Unknown Source, bco=4)
	at com.rabbitmq.client.impl.Frame.readFrom (Frame.java:91)
	at com.rabbitmq.client.impl.nio.NioLoop.run (NioLoop.java:149)
	at java.lang.Thread.run (Unknown Source, bco=11)
Jan 16, 2017 11:45:56 AM com.rabbitmq.client.impl.ForgivingExceptionHandler log
SEVERE: An unexpected connection driver error occured
Throwable occurred: com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing with heartbeat = 30 seconds
	at com.rabbitmq.client.impl.AMQConnection.handleHeartbeatFailure (AMQConnection.java:648)
	at com.rabbitmq.client.impl.nio.NioLoop.run (NioLoop.java:78)
	at java.lang.Thread.run (Unknown Source, bco=11)
Jan 16, 2017 11:45:56 AM com.rabbitmq.client.impl.ForgivingExceptionHandler log
SEVERE: An unexpected connection driver error occured
Throwable occurred: javax.net.ssl.SSLException: closed in read
	at com.rabbitmq.client.impl.nio.SslEngineByteBufferInputStream.read (SslEngineByteBufferInputStream.java:89)
	at java.io.DataInputStream.readUnsignedByte (Unknown Source, bco=4)
	at com.rabbitmq.client.impl.Frame.readFrom (Frame.java:91)
	at com.rabbitmq.client.impl.nio.NioLoop.run (NioLoop.java:149)
	at java.lang.Thread.run (Unknown Source, bco=11)
@michaelklishin
Copy link
Contributor

The best we can do about TLS exceptions is to start a connection shutdown.

@michaelklishin michaelklishin changed the title NIO main thread dies in 4.0.1 NIO main thread can terminate due to an unhandled javax.net.ssl.SSLException Jan 16, 2017
@michaelklishin
Copy link
Contributor

In fact, in this specific case there were missing heartbeats detected anyway, so it's even less clear what can be done. Not terminate is about as much as I can suggest.

@acogoluegnes
Copy link
Contributor

The catch clause handles only Exceptions, not Errors (which AssertionError is a subclass of). Catching Error would avoid crashing the whole loop.

@acogoluegnes
Copy link
Contributor

Maybe we'd better off catching only AssertionError (basically handling the case @dimas faces), as handling Errors in applications isn't recommended. I would also create another issue for 5.0, where BlockingCell would stop throwing AssertionError, as those are too dramatic in those cases (and then stop handling AssertionError in NioLoop.

@michaelklishin
Copy link
Contributor

@acogoluegnes that makes sense to me.

@acogoluegnes
Copy link
Contributor

@dimas Can you try a 4.0.2 snapshot before we release 4.0.2?

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

@acogoluegnes , nope. Got

SEVERE: Unhandled exception. Aborting.
Throwable occurred: java.lang.AssertionError: BlockingCell can only be set once
	at com.rabbitmq.utility.BlockingCell.set (BlockingCell.java:141)
	at com.rabbitmq.client.impl.AMQConnection.doFinalShutdown (AMQConnection.java:681)
	at com.rabbitmq.client.impl.AMQConnection.handleIoError (AMQConnection.java:663)
	at com.rabbitmq.client.impl.nio.NioLoop$1.run (NioLoop.java:283)
	at java.lang.Thread.run (Unknown Source, bco=11)

this time

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

Btw, I know very little what your BlockingCell is but from the looks of it, it should be IllegalStateException or smt like that not an Error.
And regardless of that - have you considered changing nio loop to catch Throwables ? Yeah, I know about all the guidelines and stuff but this loop is a critical piece and unless you expect application authors to write their own watchdogs on the top of AQMP client code, it must be super-reliable and almost non-destructible...

@acogoluegnes
Copy link
Contributor

@dimas Does it also stop the NIO thread? It shouldn't I guess.

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 16, 2017

If a network connection goes down in a way that takes a while to detect (or degrates dramatically) then continuations will fail eventually. The question is whether new connections are functional after that.

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

Why shouldn't it? You see, the stack trace is slightly different. I bet that "Unhandled exception. Aborting." is logged by Thread code as this exception leaked "run" method of NioLoop.

Regarding my comment above, I see you cannot easily handle Throwable anyway as there is no try/catch for each iteration, just a one big try/catch around the loop to log the exception.... I do not know what are the best practices in this area but now I feel much less safe with the rabbitmq client as I can see how easy to break it. Not sure if at any point you may call user callbacks from IO loop but if you can - God knows what can be thrown from these.

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

@michaelklishin , no, nothing works. The only NIO loop thread you have is aborted

@michaelklishin
Copy link
Contributor

@dimas well, since you have a way to reproduce and we don't then please do find out what wasn't caught. I'm not sold on the idea of catching every (or nearly every) possible exception, this kind of overly defensive programming works great until it doesn't. @acogoluegnes thoughts?

@acogoluegnes
Copy link
Contributor

@dimas Yes, BlockingCell is a bit of a drama queen, I filled in #239 for this, but it'll ship in 5.0.0 because it's an API change.

0e60f7b handled the AssertionError. We can also handle it in the failure you just found. This one is asynchronous, so it shouldn't make the whole NIO loop crash too.

Regarding your remark about user callbacks, they're called asynchronously, in other threads.

@acogoluegnes
Copy link
Contributor

@michaelklishin I'm not sold on catching every single Error, there may be errors that the client wouldn't be able to recover from, letting the JVM process getting crazy. It may be better to crash in those cases.

acogoluegnes added a commit that referenced this issue Jan 16, 2017
@michaelklishin
Copy link
Contributor

@acogoluegnes right, so we should identify a reasonable subset. @dimas can you help us identify what other exceptions (or call sites) aren't covered right now? Your testing environment seems to be a good way of breaking things in less-than-obvious ways :)

@acogoluegnes
Copy link
Contributor

@dimas I pushed another snapshot, where the BlockingCell error is caught, can you give a try?

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

Finally checked the code :) Of course you are right and the last exception cannot abort the NIO loop, I was just misreading the stack trace. Not sure why it caused such a devastating result then. I bet we install some custom handler for all uncaught exceptions somewhere that terminates the whole app. Maybe it was the root cause of everything. Will check.

Anyways, 4.0.2-20170116.163201-13 snapshot seems to do the trick! Although I had to disable automatic recovery. We do not use automatic recovery anyway so I would have to do that step in any case. With auto recovery it killed itself with OOM. Can look into that separately if you need me to.

@michaelklishin
Copy link
Contributor

@dimas a heap dump from when enabling automatic recovery causes runaway heap growth would be interesting. What's the heap size limit you use for your JVMs?

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

It is a very memory constrained environment. I believe we limit memory to 9 megs or something like that. But this is not a standard JVM so you cannot directly compare to OpenJDK for example.
I managed to reproduce OOM couple of times after initial report but do not have thread dump yet because as I mentioned above, we install unhandled exception handler that terminates the app - it was actually added to deal with these OOM exceptions in the first place because when it is thrown in a thread, most libraries do not survive after that point so we better restart the whole process.
With that knowledge, are you still interested for me to proceed in order to reproduce it?

Also, when do you expect to release 4.0.2 ?

Thanks

@michaelklishin
Copy link
Contributor

michaelklishin commented Jan 16, 2017 via email

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

Just browsing through the code it does not look like the client code depends on the fact that BlockingCell throws AssertionError much. You do not have catches for it except the ones you added as part of this issue.
So changing BlockingCell to throw IllegalStateException here

throw new AssertionError("BlockingCell can only be set once");
and IllegalArgumentException here
throw new AssertionError("Timeout cannot be less than zero");
should have no negative effects on the client at all. Both are uncheched exceptions so method signature does not change.
Also, it does not seem you document any of your APIs that they may throw AssertionError so it is probably not part of their contracts?
If you worrying about some external code depending on the current contract of BlockingCell itself, this probably can be solved by using anonymous subclass in the client code where you currently use BlockingCell. The subclass would just converting these exceptions into RuntimeExceptions like

  new BlockingCell<Object>() {

    @Override
    public synchronized Object get(long timeout) throws InterruptedException, TimeoutException {
        try {
            return super.get(timeout);
        } catch (AssertionError e) {
            throw new IllegalArgumentException(e.getMessage());
        }
    }

    @Override
    public synchronized void set(Object newValue) {
        try {
            super.set(newValue);
        } catch (AssertionError e) {
            throw new IllegalStateException(e.getMessage());
        }
    }
  }

Or maybe the subclass can be not anonymous but a normal one if you need to use it in multiple places.
Later, when in 5.0 you create a new BlockingCell, you can just remove that subclassing.

Although simplest is probably to copy&paste BlockingCell into a new class, fix exceptions there and use the new class it in the client while deprecating the old one.

That would allow you removing catches for AssertionError you added today and make thing the whole thing a bit more reliable...

Just thinking.

@dimas
Copy link
Contributor Author

dimas commented Jan 16, 2017

Also you have couple other places where you throw AssertionErrors, probably makes sense to fix them to RuntimeExceptions too unless it is the part of the contract:

@acogoluegnes
Copy link
Contributor

@dimas Thanks for all those suggestions, they definitely make sense, and we'll include them in #239. We're trying to comply to semantic versioning, so we cannot change thrown exceptions, even if they're not part of the method signature (someone could catch those exceptions and this would break their code). This is how it goes with legacy code :-)
The 4.0.2 workarounds should do the job and we'll get rid of them in 5.0.0, when we're done with AssertionErrors.
We'll ship 4.0.2 this week.

@michaelklishin
Copy link
Contributor

@acogoluegnes we can ship a milestone today ;)

@dimas
Copy link
Contributor Author

dimas commented Jan 17, 2017

Yes, 4.0.2 should be enough to get me going. Although I will be very much looking forward for 5.0 where you remove all AssertionError of course as any of them can still kill the app.

Thanks very much guys for really quick turnaround!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants