Intermittent SSL socket connection failure #213

edstover · 2016-08-10T16:43:44Z

Neo4j Java driver version: 1.0.4
Neo4j server version: 3.0.3

We are using a 3-node HA cluster setup in AWS with 2 ELBs; one for read operations pointing to the slave nodes, and one for write operations pointing to the master node. The ELBs are configured to use the Neo4j management end-points for health checks and to fail over when one of the nodes goes down and the 'master' moves. The ELBs are also configured to pass SSL traffic to the back-end servers, so SSL termination is done on the Neo4j server instances. Our application code has Neo4j Driver object instances for read and write operations that connect to the corresponding ELB instance using the BOLT protocol and requiring encryption.

The problem we are having is periodic failure by the Neo4j Driver to establish an SSL connection. It seems that after some period of inactivity, a request to read something from the graph results in a failure to establish an SSL connection. Issuing the same request again succeeds.

Here is the relevant stack trace:

org.neo4j.driver.v1.exceptions.ClientException:
Failed to establish SSL socket connection. at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:179) at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.read(TLSSocketChannel.java:374) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readNextPacket(BufferingChunkedInput.java:408)
at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readChunkSize(BufferingChunkedInput.java:344) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.read(BufferingChunkedInput.java:246) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.fillScratchBuffer(BufferingChunkedInput.java:215)
at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readByte(BufferingChunkedInput.java:109) at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:441) at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130) at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124) at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)
at org.neo4j.driver.internal.connector.socket.SocketConnection.sync(SocketConnection.java:100) at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.sync(ConcurrencyGuardingConnection.java:122) at org.neo4j.driver.internal.pool.PooledConnection.sync(PooledConnection.java:144)
at org.neo4j.driver.internal.InternalSession.close(InternalSession.java:130)

Here are relevant code snippets:

Driver neo4jReadDriver = GraphDatabase.driver(serverURI,
            AuthTokens.basic(username, password),
            Config.build()
                    .withEncryptionLevel(Config.EncryptionLevel.REQUIRED)
                    .toConfig());

private StatementResult run(Driver neo4jDriver, String statementTemplate, Map<String, Object> statementParameters) {
    try (Session neo4jSession = neo4jDriver.session()) {
        return neo4jSession.run(statementTemplate, statementParameters);
    }
}

String cypherStatement = "<cypher>";
HashMap<String, Object> params = new HashMap<>();
StatementResult result = run(neo4jReadDriver, cypherStatement, params);

The SSL connection failure happens at the end of the 'try' block when the session is closed. An immediate re-try of the same call succeeds.

Are there any recommended configuration settings for using the Neo4j driver with AWS ELBs?
Have the Neo4j drivers been tested in HA configurations using AWS and ELBs? Are there any recommended configuration settings when deploying into AWS and using ELBs?

The text was updated successfully, but these errors were encountered:

didvae · 2016-09-02T10:16:19Z

Hi, any update on this?
We are having the same problem, at the end of the exception we can read

Caused by: org.neo4j.driver.internal.packstream.PackStream$Unexpected: Expected a struct, but got: 71
        at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:450)
        at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124)
        at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)

ivinskyi · 2016-09-09T16:00:25Z

Same here. Fails randomly once in 20-30 queries, stacktrace:

Caused by: org.neo4j.driver.v1.exceptions.ClientException: Failed to establish SSL socket connection.
        at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:179)
        at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.read(TLSSocketChannel.java:374)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readNextPacket(BufferingChunkedInput.java:408)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readChunkSize(BufferingChunkedInput.java:344)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.read(BufferingChunkedInput.java:246)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.fillScratchBuffer(BufferingChunkedInput.java:215)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readByte(BufferingChunkedInput.java:109)
        at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:441)
        at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
        at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveOne(SocketConnection.java:135)
        at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.receiveOne(ConcurrencyGuardingConnection.java:150)
        at org.neo4j.driver.internal.pool.PooledConnection.receiveOne(PooledConnection.java:170)
        at org.neo4j.driver.internal.InternalStatementResult.tryFetchNext(InternalStatementResult.java:303)
        at org.neo4j.driver.internal.InternalStatementResult.hasNext(InternalStatementResult.java:181)
        at org.neo4j.driver.internal.InternalStatementResult.list(InternalStatementResult.java:251)
        at org.neo4j.driver.internal.InternalStatementResult.list(InternalStatementResult.java:245)
      ...
        ... 7 more
        Suppressed: org.neo4j.driver.v1.exceptions.ClientException: Unable to read response from server: Expected a struct, but got: 6c
            at org.neo4j.driver.internal.connector.socket.SocketConnection.mapRecieveError(SocketConnection.java:168)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:126)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.sync(SocketConnection.java:100)
            at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.sync(ConcurrencyGuardingConnection.java:122)
            at org.neo4j.driver.internal.pool.PooledConnection.sync(PooledConnection.java:144)
            at org.neo4j.driver.internal.InternalSession.close(InternalSession.java:130)
            ... 10 more
        Caused by: org.neo4j.driver.internal.packstream.PackStream$Unexpected: Expected a struct, but got: 6c
            at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:450)
            at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
            at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
            at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)
            ... 15 more

pontusmelke · 2016-09-09T16:23:26Z

Hi sorry for not getting back to you earlier, we are looking into the issue.

Regards,
Pontus

zone-tech · 2016-09-13T07:18:51Z

@pontusmelke , hi, same here, any workaround?

kevanghobadi · 2016-09-19T19:35:33Z

Hello, We are also hitting this issue and a sleep before we create our session seems to reduce the frequency. Any progress? Thanks!

crichey · 2016-09-19T21:49:31Z

I believe this is a bug that is being addressed in the next patch release.

Sent from my iPhone

On Sep 19, 2016, at 15:35, Kevan Ghobadi [email protected] wrote:

Hello, We are also hitting this issue and a sleep before we create our session seems to reduce the frequency. Any progress? Thanks!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

edstover · 2016-09-20T16:34:01Z

Here is some additional information regarding this issue:

Setup and configured an HAProxy load balancer, thinking the issue was centered around the AWS ELBs, but found the same problem exists when the back-end connection between the load balance and the Neo4j servers times out. HAProxy has a default timeout of 2 hours, so it takes longer to manifest, but it still happens. For testing purposes, reduced the HAProxy timeout to 60 seconds (same as AWS ELB default) and observed the same connection failure that we saw with AWS ELBs.
Suspecting an issue with SSL pass-through at the load balancers, changed the Neo4j driver to use Config.EncryptionLevel.NONE. The connection failures persisted, but the exception thrown by the driver was different -- Broken Pipe.
To deal with the continuing connection issues when using load balancers, implemented a re-try loop with a max of 10 iterations and a 1 second wait between calls to execute Cypher statements, as follows:

`

static final int maxRetries = 10;
static final long retryWaitIntervalMS = 1000L;

private StatementResult run(Driver neo4jDriver, String statementTemplate, Map<String, Object> statementParameters) {

    int tries = 0;

    while (tries < maxRetries) {
        tries += 1;

        if (tries > 1) {
            try {
                Thread.sleep(retryWaitIntervalMS);
            } catch (InterruptedException e) {
            }
        }

        try (Session neo4jSession = neo4jDriver.session()) {
            StatementResult result = neo4jSession.run(statementTemplate, statementParameters);

            if ((result != null) && result.hasNext()) {
                return result;
            }

            return null;
        } catch (ClientException cex) {
            <log the exception>
        } catch (Exception ex) {
            <log the exception>
            throw ex;
        }
    }

    throw new RuntimeException(String.format("Neo4j statement execution failed.  Tried %s times.", tries));
}`

Encountered the same connection failures when executing Cypher statements inside of a transaction. Also implemented re-try loop around transactions, but that could not be encapsulated in a low-level private method, like the single Cypher statement execution illustrated above. When a connection failure occurs inside of a transaction, the transaction fails, so a retry requires the transaction to be re-executed from the beginning. This is ugly to implement because it elevates the re-try into the business logic of the application.

I hope you find this information is helpful.

zhenlineo · 2016-10-13T08:16:13Z

This fix might be related to figure out the real cause: #249

zhenlineo · 2016-10-21T08:45:26Z

Hi, after investigate on this problem it turns clear that it is the timeout on load balancer that kills the connections between the client and the server that result in this error.

As the driver always assume that the connection between the server and the client is persistent and could be re-used, so the driver do not designed to handle such timeout in the network. The problem is caused by the fact that the driver pools the connections and reuse them all the time. So if the connection breaks, then we will see such exception throw.

The solution to this problem is probably enlarge/turn off the timeout on load balancer. Or adding retries in your code.

Regards,
Zhen

zhenlineo · 2016-10-21T08:48:25Z

Pls update to 1.0.6 and you will get a better error message for this failed to establish ssl connection error.

tavolate · 2016-11-04T18:46:49Z

We have the same problem.

tavolate · 2016-11-07T08:11:26Z

can you post your workaround? Where did you insert the retry logic?

edstover · 2016-11-07T18:27:11Z

@tavolate I posted my retry logic above. In my case, I encapsulated this code in a base class used by all of my Neo4j data access classes so that all Cypher queries are executed with retry capability.

tavolate · 2016-11-07T21:01:26Z

thank you @edstover but we are using spring with neo4j and I'm looking for the best place where we can encapsulate the retry logic. Suggestions?

lutovich · 2017-05-02T13:54:37Z

Hello,

I just want to update this old ticket with references to couple new APIs which might be helpful here:

Connection liveness check timeout configuration setting: https://github.com/neo4j/neo4j-java-driver/blob/1.2.0/driver/src/main/java/org/neo4j/driver/v1/Config.java#L291-L317. This should force driver re-acquire connection when set to values less than load balancer idle connection timeout.
Transaction function APIs that allow retries with exponential backoff. More info can be found in docs: https://neo4j.com/docs/developer-manual/current/drivers/sessions-transactions/#driver-transactions-transaction-functions.

Hope this helps.

technige self-assigned this Aug 11, 2016

zhenlineo closed this as completed Oct 21, 2016

zhenlineo mentioned this issue Nov 24, 2016

Intermittent error "This socket has been ended by the other party" neo4j/neo4j-javascript-driver#144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Intermittent SSL socket connection failure #213

Intermittent SSL socket connection failure #213

edstover commented Aug 10, 2016

didvae commented Sep 2, 2016

Uh oh!

ivinskyi commented Sep 9, 2016

Uh oh!

pontusmelke commented Sep 9, 2016

Uh oh!

zone-tech commented Sep 13, 2016

Uh oh!

kevanghobadi commented Sep 19, 2016

Uh oh!

crichey commented Sep 19, 2016

Uh oh!

edstover commented Sep 20, 2016

Uh oh!

zhenlineo commented Oct 13, 2016

Uh oh!

zhenlineo commented Oct 21, 2016

Uh oh!

zhenlineo commented Oct 21, 2016

Uh oh!

tavolate commented Nov 4, 2016

Uh oh!

tavolate commented Nov 7, 2016

Uh oh!

edstover commented Nov 7, 2016

Uh oh!

tavolate commented Nov 7, 2016

Uh oh!

lutovich commented May 2, 2017

Uh oh!

Intermittent SSL socket connection failure #213

Intermittent SSL socket connection failure #213

Comments

edstover commented Aug 10, 2016

didvae commented Sep 2, 2016

Uh oh!

ivinskyi commented Sep 9, 2016

Uh oh!

pontusmelke commented Sep 9, 2016

Uh oh!

zone-tech commented Sep 13, 2016

Uh oh!

kevanghobadi commented Sep 19, 2016

Uh oh!

crichey commented Sep 19, 2016

Uh oh!

edstover commented Sep 20, 2016

Uh oh!

zhenlineo commented Oct 13, 2016

Uh oh!

zhenlineo commented Oct 21, 2016

Uh oh!

zhenlineo commented Oct 21, 2016

Uh oh!

tavolate commented Nov 4, 2016

Uh oh!

tavolate commented Nov 7, 2016

Uh oh!

edstover commented Nov 7, 2016

Uh oh!

tavolate commented Nov 7, 2016

Uh oh!

lutovich commented May 2, 2017

Uh oh!