-
Notifications
You must be signed in to change notification settings - Fork 585
NIO main thread can terminate due to an unhandled javax.net.ssl.SSLException #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The best we can do about TLS exceptions is to start a connection shutdown. |
In fact, in this specific case there were missing heartbeats detected anyway, so it's even less clear what can be done. Not terminate is about as much as I can suggest. |
The |
Maybe we'd better off catching only |
@acogoluegnes that makes sense to me. |
@dimas Can you try a 4.0.2 snapshot before we release 4.0.2? |
@acogoluegnes , nope. Got
this time |
Btw, I know very little what your BlockingCell is but from the looks of it, it should be IllegalStateException or smt like that not an Error. |
@dimas Does it also stop the NIO thread? It shouldn't I guess. |
If a network connection goes down in a way that takes a while to detect (or degrates dramatically) then continuations will fail eventually. The question is whether new connections are functional after that. |
Why shouldn't it? You see, the stack trace is slightly different. I bet that "Unhandled exception. Aborting." is logged by Thread code as this exception leaked "run" method of NioLoop. Regarding my comment above, I see you cannot easily handle Throwable anyway as there is no try/catch for each iteration, just a one big try/catch around the loop to log the exception.... I do not know what are the best practices in this area but now I feel much less safe with the rabbitmq client as I can see how easy to break it. Not sure if at any point you may call user callbacks from IO loop but if you can - God knows what can be thrown from these. |
@michaelklishin , no, nothing works. The only NIO loop thread you have is aborted |
@dimas well, since you have a way to reproduce and we don't then please do find out what wasn't caught. I'm not sold on the idea of catching every (or nearly every) possible exception, this kind of overly defensive programming works great until it doesn't. @acogoluegnes thoughts? |
@dimas Yes, 0e60f7b handled the Regarding your remark about user callbacks, they're called asynchronously, in other threads. |
@michaelklishin I'm not sold on catching every single |
@acogoluegnes right, so we should identify a reasonable subset. @dimas can you help us identify what other exceptions (or call sites) aren't covered right now? Your testing environment seems to be a good way of breaking things in less-than-obvious ways :) |
@dimas I pushed another snapshot, where the |
Finally checked the code :) Of course you are right and the last exception cannot abort the NIO loop, I was just misreading the stack trace. Not sure why it caused such a devastating result then. I bet we install some custom handler for all uncaught exceptions somewhere that terminates the whole app. Maybe it was the root cause of everything. Will check. Anyways, 4.0.2-20170116.163201-13 snapshot seems to do the trick! Although I had to disable automatic recovery. We do not use automatic recovery anyway so I would have to do that step in any case. With auto recovery it killed itself with OOM. Can look into that separately if you need me to. |
@dimas a heap dump from when enabling automatic recovery causes runaway heap growth would be interesting. What's the heap size limit you use for your JVMs? |
It is a very memory constrained environment. I believe we limit memory to 9 megs or something like that. But this is not a standard JVM so you cannot directly compare to OpenJDK for example. Also, when do you expect to release 4.0.2 ? Thanks |
Yeah, overrunning a 9 MB heap is not something I'd consider a big deal for
this library.
4.0.2 can be released in a day or two.
…On Tue, Jan 17, 2017 at 2:22 AM, Dmitry Andrianov ***@***.***> wrote:
It is a very memory constrained environment. I believe we limit memory to
9 megs or something like that. But this is not a standard JVM so you cannot
directly compare to OpenJDK for example.
I managed to reproduce OOM couple of times after initial report but do not
have thread dump yet because as I mentioned above, we install unhandled
exception handler that terminates the app - it was actually added to deal
with these OOM exceptions in the first place because when it is thrown in a
thread, most libraries do not survive after that point so we better restart
the whole process.
With that knowledge, are you still interested for me to proceed in order
to reproduce it?
Also, when do you expect to release 4.0.2 ?
Thanks
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAEQrYRlIOYRsJYCS70uruNANoMjFHpks5rS_uggaJpZM4LkecS>
.
--
MK
Staff Software Engineer, Pivotal/RabbitMQ
|
Just browsing through the code it does not look like the client code depends on the fact that BlockingCell throws AssertionError much. You do not have catches for it except the ones you added as part of this issue.
Also, it does not seem you document any of your APIs that they may throw AssertionError so it is probably not part of their contracts? If you worrying about some external code depending on the current contract of BlockingCell itself, this probably can be solved by using anonymous subclass in the client code where you currently use BlockingCell. The subclass would just converting these exceptions into RuntimeExceptions like new BlockingCell<Object>() {
@Override
public synchronized Object get(long timeout) throws InterruptedException, TimeoutException {
try {
return super.get(timeout);
} catch (AssertionError e) {
throw new IllegalArgumentException(e.getMessage());
}
}
@Override
public synchronized void set(Object newValue) {
try {
super.set(newValue);
} catch (AssertionError e) {
throw new IllegalStateException(e.getMessage());
}
}
} Or maybe the subclass can be not anonymous but a normal one if you need to use it in multiple places. Although simplest is probably to copy&paste BlockingCell into a new class, fix exceptions there and use the new class it in the client while deprecating the old one. That would allow you removing catches for AssertionError you added today and make thing the whole thing a bit more reliable... Just thinking. |
Also you have couple other places where you throw AssertionErrors, probably makes sense to fix them to RuntimeExceptions too unless it is the part of the contract:
|
@dimas Thanks for all those suggestions, they definitely make sense, and we'll include them in #239. We're trying to comply to semantic versioning, so we cannot change thrown exceptions, even if they're not part of the method signature (someone could catch those exceptions and this would break their code). This is how it goes with legacy code :-) |
@acogoluegnes we can ship a milestone today ;) |
Yes, 4.0.2 should be enough to get me going. Although I will be very much looking forward for 5.0 where you remove all AssertionError of course as any of them can still kill the app. Thanks very much guys for really quick turnaround! |
I described my problem here - #236
As per suggestion, I tried using v4.0.1 version of the library with NIO to work around the issue.
The good thing is that that version does indeed detect connection problem much faster that 15 minutes that it used to take for previous version. The bad thing is that NIO main thread catches unhandled exception and terminates rendering the whole application unusable.
I cannot easily reproduce it as I do not know what that bloody piece of network equipment does but it feels like it just stops outgoing packets from he client machine at some point while still allowing incoming ones.
previous exceptions that lead to the one above
The text was updated successfully, but these errors were encountered: