Skip to content

Intermittent SSL socket connection failure #213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
edstover opened this issue Aug 10, 2016 · 15 comments
Closed

Intermittent SSL socket connection failure #213

edstover opened this issue Aug 10, 2016 · 15 comments
Assignees

Comments

@edstover
Copy link

Neo4j Java driver version: 1.0.4
Neo4j server version: 3.0.3

We are using a 3-node HA cluster setup in AWS with 2 ELBs; one for read operations pointing to the slave nodes, and one for write operations pointing to the master node. The ELBs are configured to use the Neo4j management end-points for health checks and to fail over when one of the nodes goes down and the 'master' moves. The ELBs are also configured to pass SSL traffic to the back-end servers, so SSL termination is done on the Neo4j server instances.  Our application code has Neo4j Driver object instances for read and write operations that connect to the corresponding ELB instance using the BOLT protocol and requiring encryption.  

The problem we are having is periodic failure by the Neo4j Driver to establish an SSL connection.  It seems that after some period of inactivity, a request to read something from the graph results in a failure to establish an SSL connection.  Issuing the same request again succeeds.

Here is the relevant stack trace:

org.neo4j.driver.v1.exceptions.ClientException:
Failed to establish SSL socket connection. at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:179) at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.read(TLSSocketChannel.java:374) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readNextPacket(BufferingChunkedInput.java:408)
at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readChunkSize(BufferingChunkedInput.java:344) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.read(BufferingChunkedInput.java:246) at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.fillScratchBuffer(BufferingChunkedInput.java:215)
at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readByte(BufferingChunkedInput.java:109) at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:441) at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130) at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124) at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)
at org.neo4j.driver.internal.connector.socket.SocketConnection.sync(SocketConnection.java:100) at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.sync(ConcurrencyGuardingConnection.java:122) at org.neo4j.driver.internal.pool.PooledConnection.sync(PooledConnection.java:144)
at org.neo4j.driver.internal.InternalSession.close(InternalSession.java:130)

Here are relevant code snippets:

Driver neo4jReadDriver = GraphDatabase.driver(serverURI,
            AuthTokens.basic(username, password),
            Config.build()
                    .withEncryptionLevel(Config.EncryptionLevel.REQUIRED)
                    .toConfig());

private StatementResult run(Driver neo4jDriver, String statementTemplate, Map<String, Object> statementParameters) {
    try (Session neo4jSession = neo4jDriver.session()) {
        return neo4jSession.run(statementTemplate, statementParameters);
    }
}

String cypherStatement = "<cypher>";
HashMap<String, Object> params = new HashMap<>();
StatementResult result = run(neo4jReadDriver, cypherStatement, params);

The SSL connection failure happens at the end of the 'try' block when the session is closed. An immediate re-try of the same call succeeds.

Are there any recommended configuration settings for using the Neo4j driver with AWS ELBs?
Have the Neo4j drivers been tested in HA configurations using AWS and ELBs? Are there any recommended configuration settings when deploying into AWS and using ELBs?

@technige technige self-assigned this Aug 11, 2016
@didvae
Copy link

didvae commented Sep 2, 2016

Hi, any update on this?
We are having the same problem, at the end of the exception we can read

Caused by: org.neo4j.driver.internal.packstream.PackStream$Unexpected: Expected a struct, but got: 71
        at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:450)
        at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124)
        at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)

@ivinskyi
Copy link

ivinskyi commented Sep 9, 2016

Same here. Fails randomly once in 20-30 queries, stacktrace:

Caused by: org.neo4j.driver.v1.exceptions.ClientException: Failed to establish SSL socket connection.
        at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.unwrap(TLSSocketChannel.java:179)
        at org.neo4j.driver.internal.connector.socket.TLSSocketChannel.read(TLSSocketChannel.java:374)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readNextPacket(BufferingChunkedInput.java:408)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readChunkSize(BufferingChunkedInput.java:344)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.read(BufferingChunkedInput.java:246)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.fillScratchBuffer(BufferingChunkedInput.java:215)
        at org.neo4j.driver.internal.connector.socket.BufferingChunkedInput.readByte(BufferingChunkedInput.java:109)
        at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:441)
        at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
        at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
        at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveOne(SocketConnection.java:135)
        at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.receiveOne(ConcurrencyGuardingConnection.java:150)
        at org.neo4j.driver.internal.pool.PooledConnection.receiveOne(PooledConnection.java:170)
        at org.neo4j.driver.internal.InternalStatementResult.tryFetchNext(InternalStatementResult.java:303)
        at org.neo4j.driver.internal.InternalStatementResult.hasNext(InternalStatementResult.java:181)
        at org.neo4j.driver.internal.InternalStatementResult.list(InternalStatementResult.java:251)
        at org.neo4j.driver.internal.InternalStatementResult.list(InternalStatementResult.java:245)
      ...
        ... 7 more
        Suppressed: org.neo4j.driver.v1.exceptions.ClientException: Unable to read response from server: Expected a struct, but got: 6c
            at org.neo4j.driver.internal.connector.socket.SocketConnection.mapRecieveError(SocketConnection.java:168)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:126)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.sync(SocketConnection.java:100)
            at org.neo4j.driver.internal.connector.ConcurrencyGuardingConnection.sync(ConcurrencyGuardingConnection.java:122)
            at org.neo4j.driver.internal.pool.PooledConnection.sync(PooledConnection.java:144)
            at org.neo4j.driver.internal.InternalSession.close(InternalSession.java:130)
            ... 10 more
        Caused by: org.neo4j.driver.internal.packstream.PackStream$Unexpected: Expected a struct, but got: 6c
            at org.neo4j.driver.internal.packstream.PackStream$Unpacker.unpackStructHeader(PackStream.java:450)
            at org.neo4j.driver.internal.messaging.PackStreamMessageFormatV1$Reader.read(PackStreamMessageFormatV1.java:397)
            at org.neo4j.driver.internal.connector.socket.SocketClient.receiveOne(SocketClient.java:130)
            at org.neo4j.driver.internal.connector.socket.SocketClient.receiveAll(SocketClient.java:124)
            at org.neo4j.driver.internal.connector.socket.SocketConnection.receiveAll(SocketConnection.java:121)
            ... 15 more

@pontusmelke
Copy link
Contributor

Hi sorry for not getting back to you earlier, we are looking into the issue.

Regards,
Pontus

@zone-tech
Copy link

@pontusmelke , hi, same here, any workaround?

@kevanghobadi
Copy link

Hello, We are also hitting this issue and a sleep before we create our session seems to reduce the frequency. Any progress? Thanks!

@crichey
Copy link

crichey commented Sep 19, 2016

I believe this is a bug that is being addressed in the next patch release.

Sent from my iPhone

On Sep 19, 2016, at 15:35, Kevan Ghobadi [email protected] wrote:

Hello, We are also hitting this issue and a sleep before we create our session seems to reduce the frequency. Any progress? Thanks!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@edstover
Copy link
Author

Here is some additional information regarding this issue:

  1. Setup and configured an HAProxy load balancer, thinking the issue was centered around the AWS ELBs, but found the same problem exists when the back-end connection between the load balance and the Neo4j servers times out. HAProxy has a default timeout of 2 hours, so it takes longer to manifest, but it still happens. For testing purposes, reduced the HAProxy timeout to 60 seconds (same as AWS ELB default) and observed the same connection failure that we saw with AWS ELBs.

  2. Suspecting an issue with SSL pass-through at the load balancers, changed the Neo4j driver to use Config.EncryptionLevel.NONE. The connection failures persisted, but the exception thrown by the driver was different -- Broken Pipe.

  3. To deal with the continuing connection issues when using load balancers, implemented a re-try loop with a max of 10 iterations and a 1 second wait between calls to execute Cypher statements, as follows:

`

static final int maxRetries = 10;
static final long retryWaitIntervalMS = 1000L;

private StatementResult run(Driver neo4jDriver, String statementTemplate, Map<String, Object> statementParameters) {

    int tries = 0;

    while (tries < maxRetries) {
        tries += 1;

        if (tries > 1) {
            try {
                Thread.sleep(retryWaitIntervalMS);
            } catch (InterruptedException e) {
            }
        }

        try (Session neo4jSession = neo4jDriver.session()) {
            StatementResult result = neo4jSession.run(statementTemplate, statementParameters);

            if ((result != null) && result.hasNext()) {
                return result;
            }

            return null;
        } catch (ClientException cex) {
            <log the exception>
        } catch (Exception ex) {
            <log the exception>
            throw ex;
        }
    }

    throw new RuntimeException(String.format("Neo4j statement execution failed.  Tried %s times.", tries));
}`
  1. Encountered the same connection failures when executing Cypher statements inside of a transaction. Also implemented re-try loop around transactions, but that could not be encapsulated in a low-level private method, like the single Cypher statement execution illustrated above. When a connection failure occurs inside of a transaction, the transaction fails, so a retry requires the transaction to be re-executed from the beginning. This is ugly to implement because it elevates the re-try into the business logic of the application.

I hope you find this information is helpful.

@zhenlineo
Copy link
Contributor

This fix might be related to figure out the real cause: #249

@zhenlineo
Copy link
Contributor

Hi, after investigate on this problem it turns clear that it is the timeout on load balancer that kills the connections between the client and the server that result in this error.

As the driver always assume that the connection between the server and the client is persistent and could be re-used, so the driver do not designed to handle such timeout in the network. The problem is caused by the fact that the driver pools the connections and reuse them all the time. So if the connection breaks, then we will see such exception throw.

The solution to this problem is probably enlarge/turn off the timeout on load balancer. Or adding retries in your code.

Regards,
Zhen

@zhenlineo
Copy link
Contributor

Pls update to 1.0.6 and you will get a better error message for this failed to establish ssl connection error.

@tavolate
Copy link

tavolate commented Nov 4, 2016

We have the same problem.

@tavolate
Copy link

tavolate commented Nov 7, 2016

can you post your workaround? Where did you insert the retry logic?

@edstover
Copy link
Author

edstover commented Nov 7, 2016

@tavolate I posted my retry logic above. In my case, I encapsulated this code in a base class used by all of my Neo4j data access classes so that all Cypher queries are executed with retry capability.

@tavolate
Copy link

tavolate commented Nov 7, 2016

thank you @edstover but we are using spring with neo4j and I'm looking for the best place where we can encapsulate the retry logic. Suggestions?

@lutovich
Copy link
Contributor

lutovich commented May 2, 2017

Hello,

I just want to update this old ticket with references to couple new APIs which might be helpful here:

  1. Connection liveness check timeout configuration setting: https://github.com/neo4j/neo4j-java-driver/blob/1.2.0/driver/src/main/java/org/neo4j/driver/v1/Config.java#L291-L317. This should force driver re-acquire connection when set to values less than load balancer idle connection timeout.
  2. Transaction function APIs that allow retries with exponential backoff. More info can be found in docs: https://neo4j.com/docs/developer-manual/current/drivers/sessions-transactions/#driver-transactions-transaction-functions.

Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests