Auto refresh a list of cluster nodes in TarantoolClusterClient #34

hypersolid · 2018-07-25T14:15:05Z

https://github.com/tarantool/vshard is current cluster solution for Tarantool. It introduces concept of "routers" - Tarantool instances which route requests to appropriate "storage" instances where data is actually stored.

I think we could add TarantoolClusterClient for tarantool clusters, which will round-robin requests to a group of Tarantool instances. It will be also useful to be able to pass custom sharding function to it (for custom sharding solutions).

The text was updated successfully, but these errors were encountered:

sharonovd · 2018-12-28T12:44:31Z

Since we now have TarantoolClusterClient, we could add the following to it:
[ ] ability to extract list of available nodes from first connected node and add them to TarantoolClusterClientConfig
[ ] ability to refresh this list periodically

The procedure called on Tarantool side to determine the list of available nodes must be pluggable, since we have several use-cases (sharded cluster, replicaset, 'cluster' module etc.)

We may consider round robin between available instances

Totktonada · 2019-02-04T05:34:26Z

Notes from meeting with Dmitry (@sharonovd) (Jan, 31).

Is it possible that the procedure will return an id that is not an URI?
- The procedure should return list of URIs. It is okay to expect in the java connector that all URIs will be in the inet domain ([user[:pass]@]host:port), because the connector does not support unix sockets.
Whether we should store arguments in TarantoolClusterClientConfig and pass them to the procedure?
- It is not so needed, so let's implement it only if it is simple.
What to do if a node we currently connected disappears from the list?
- Raw idea: If there are no errors in last requests we can leave remaining requests to end with the current node, but close it for new request. It doesn't look as a simple solution.
- We can retry all requests that were on the fly with the new node, but anyway we need to handle DML requests correctly (ensure we don't do it twice).
Whether we want to support a list of instances that can be asked for cluster URLs instead of just one?
- We need to elaborate our behaviour in the case when a node is dead when we try to fetch the list at the first time or update the list.
- It seems that we need to implement round robin across these informational nodes too.
- Raw idea: The round robin logic can be separated from certain actions one want to do on these nodes. It can be implemented as a separate class that holds a list of URLs, but doesn't hold a connection for one of them. It provides call method and connects to a node (try each across the list) right during a call to this method.
NB: Look at the couchbase API.

Totktonada · 2019-02-19T12:25:07Z

Whether we want to support a list of instances that can be asked for cluster URLs instead of just one?

I thought about that again and decided that there are no much sense to support a list of instances here. There are three possible situations when we need to switch from one instance to an another one to obtain a list of node in a cluster, all are about unavailability of the original node: accidental node downtime, planned maintanence on the node, cluster updating with moving out the node.

Accidental node downtime will break fetching a list of nodes, but DQL/DML requests will be switched by a cluster client onto another node. The nodes list updating is not a final goal; it'll stuck for some time, but DQL/DML will work. So this case does not show us any reason to expand 'info node' to a list of such nodes.

Planned maintanence when the node is temporary down is quite same.

When we want to move the node out of a cluster it is some kind of planned work. Expanding 'info node' to a list would not save us from a situation when all nodes in the list will be down (sooner or later). So we anyway should include updating of an info node address / list into the plan for such cluster reconstruction. A list does not add any value here.

ponomarevDmitri · 2019-02-19T23:27:50Z

Ok, the final algorithm will consist from following steps:

Configuring. A connector client must specify an 'info node' address as host:port string and username/password as separate fields in a cluster config (like it've been made in current cluster config). Also the list must specify a function on a tarantool instance side that have to return a list of strings which consists from 'host:port' addresses of nodes (let's call them 'target nodes').
Fetching list of nodes. The connector fetches list of cluster nodes using the function from 'info node'. (This operation can run in background)
Updating the target nodes list.
If newly obtained list does not contain the current node then wait responses of requests which've been sent to this node in background. Retry failed request at nodes from the new list. Instant failing the sent CUD requests make less sense cause if the node is down then all requests will be automatically retried anyways, or if the node is alive then the requests will be completed. It will be more transparent for a connector client - he will know nothing about hypothetical node fail.
The client round-robins amid nodes in the list.
Update the node list by timeout (return to step 2))

Totktonada · 2019-03-13T01:10:55Z

There is the question how to handle the situation when a new list of nodes
does not contain a node we currectly connected. We need to switch a cluster
client to one of the new nodes. It is possible that we'll have some amount of
requests 'in fly': they were sent, but responses are not received for now.

There are the following ways to handle this situation:

Pause new requests (don't send them, but add to a separate queue), wait for
'in fly' requests, then switch node.
Pause switching to a new node (and send all new requests to an old node)
until we'll detect one of two situations: there are no requests 'in fly' or
a reconnect occurs.
Wait for responses to requests 'in fly', but send all new requests to a new
node.

Third way requires to store two sockets in a cluster client and handle both.
readPacket() and other utility methods from TarantoolBase need to be
factored out to be able to work with arbitrary socket. That is why refactoring
in PR #135 was made. Maybe also other refactoring needed.

I don't sure it was right thing to inherit TarantoolClusterClient from
TarantoolClientImpl (we add round robin socket provider and retry strategy to
a regular client in a cluster client). Maybe this would be more natural if we
would store several TarantoolClientImpl instances in TarantoolClusterClient?
Don't sure, so consider this as a raw idea.

I would say that 2nd way should be less intruisive considering our current
architerture. But it is up to a developer which way is the best one
considering amount of work (the deadline is near), complexity of changes (size
of a patch) and resulting code readability.

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34

Totktonada · 2019-03-28T10:20:33Z

Whether we want to support a list of instances that can be asked for cluster URLs instead of just one?

Now I understood that I missed the case when we initially connect a cluster client. At this point it worth to have several addresses to bootstrap a cluster client, because it is possible that one node is down at the moment.

Also there is no much need to split discovery and DQL/DML nodes when both are lists.

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: tarantool#34

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: tarantool#34

- Avoid a possible race between reading, writing and reconnecting threads when a reconnection process is started. It might have happened that the lagged thread (reading or writing) could reset the state to RECONNECT after the reconnecting thread has already started and set the state to 0. As a result, all next attempts to reconnect will never happen. Now the reconnect thread holds on the state as long as it is required. - Avoid another possible race between reading and writing threads when they are started during the reconnection process. It might have happened that one of the threads crashed when it was starting and another slightly lagged thread set up its flag. It could have led that the reconnecting thread saw RECONNECT + R/W state instead of pure RECONNECT. Again, this case broke down all next reconnection attempts. Now reading and writing threads take into account whether RECONNECT state is already set or not. - Avoid LockSupport class usage for a thread to be suspended and woken up. Actually, LockSupport is more like an internal component to build high-level blocking primitives. It is not recommended using this class directly. It was replaced by ReentrantLock.Condition primitive based on LockSupport but which has proper LockSupport usage inside. Fixes: #142 Addects: #34, #136

- Avoid a possible race between reading, writing and reconnecting threads when a reconnection process is started. It might have happened that the lagged thread (reading or writing) could reset the state to RECONNECT after the reconnecting thread has already started and set the state to 0. As a result, all next attempts to reconnect will never happen. Now the reconnect thread holds on the state as long as it is required. - Avoid another possible race between reading and writing threads when they are started during the reconnection process. It might have happened that one of the threads crashed when it was starting and another slightly lagged thread set up its flag. It could have led that the reconnecting thread saw RECONNECT + R/W state instead of pure RECONNECT. Again, this case broke down all next reconnection attempts. Now reading and writing threads take into account whether RECONNECT state is already set or not. - Avoid LockSupport class usage for a thread to be suspended and woken up. Actually, LockSupport is more like an internal component to build high-level blocking primitives. It is not recommended using this class directly. It was replaced by ReentrantLock.Condition primitive based on LockSupport but which has proper LockSupport usage inside. Fixes: #142 Affects: #34, #136

- Avoid a possible race between reading, writing and reconnecting threads when a reconnection process is started. It might have happened that the lagged thread (reading or writing) could reset the state to RECONNECT after the reconnecting thread has already started and set the state to 0. As a result, all next attempts to reconnect will never happen. Now the reconnect thread holds on the state as long as it is required. - Avoid another possible race between reading and writing threads when they are started during the reconnection process. It might have happened that one of the threads crashed when it was starting and another slightly lagged thread set up its flag. It could have led that the reconnecting thread saw RECONNECT + R/W state instead of pure RECONNECT. Again, this case broke down all next reconnection attempts. Now reading and writing threads take into account whether RECONNECT state is already set or not. - Replace LockSupport with ReentrantLock.Condition for a thread to be suspended and woken up. Our cluster tests and standalone demo app show that LockSupport is not a safe memory barrier as it could be. The reconnect thread relies on a visibility guarantee between park-unpark invocations which, actually, sometimes doesn't work. Also, according to java-docs LockSupport is more like an internal component to build high-level blocking primitives. It is not recommended using this class directly. It was replaced by ReentrantLock.Condition primitive based on LockSupport but which has proper LockSupport usage inside. Fixes: #142 Affects: #34, #136

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34

- Avoid a possible race between reading, writing and reconnecting threads when a reconnection process is started. It might have happened that the lagged thread (reading or writing) could reset the state to RECONNECT after the reconnecting thread has already started and set the state to 0. As a result, all next attempts to reconnect will never happen. Now the reconnect thread holds on the state as long as it is required. - Avoid another possible race between reading and writing threads when they are started during the reconnection process. It might have happened that one of the threads crashed when it was starting and another slightly lagged thread set up its flag. It could have led that the reconnecting thread saw RECONNECT + R/W state instead of pure RECONNECT. Again, this case broke down all next reconnection attempts. Now reading and writing threads take into account whether RECONNECT state is already set or not. - Replace LockSupport with ReentrantLock.Condition for a thread to be suspended and woken up. Our cluster tests and standalone demo app show that LockSupport is not a safe memory barrier as it could be. The reconnect thread relies on a visibility guarantee between park-unpark invocations which, actually, sometimes doesn't work. Also, according to java-docs LockSupport is more like an internal component to build high-level blocking primitives. It is not recommended using this class directly. It was replaced by ReentrantLock.Condition primitive based on LockSupport but which has proper LockSupport usage inside. Fixes: #142 Affects: #34, #136

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34

- Avoid a possible race between reading, writing and reconnecting threads when a reconnection process is started. It might have happened that the lagged thread (reading or writing) could reset the state to RECONNECT after the reconnecting thread has already started and set the state to 0. As a result, all next attempts to reconnect will never happen. Now the reconnect thread holds on the state as long as it is required. - Avoid another possible race between reading and writing threads when they are started during the reconnection process. It might have happened that one of the threads crashed when it was starting and another slightly lagged thread set up its flag. It could have led that the reconnecting thread saw RECONNECT + R/W state instead of pure RECONNECT. Again, this case broke down all next reconnection attempts. Now reading and writing threads take into account whether RECONNECT state is already set or not. - Replace LockSupport with ReentrantLock.Condition for a thread to be suspended and woken up. Our cluster tests and standalone demo app show that LockSupport is not a safe memory barrier as it could be. The reconnect thread relies on a visibility guarantee between park-unpark invocations which, actually, sometimes doesn't work. Also, according to java-docs LockSupport is more like an internal component to build high-level blocking primitives. It is not recommended using this class directly. It was replaced by ReentrantLock.Condition primitive based on LockSupport but which has proper LockSupport usage inside. Fixes: #142 Affects: #34, #136

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a Tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34 See also: #142

A state of a client is a set of the following flags: {READING, WRITING, RECONNECT, CLOSED}. Let's name a state when no flags are set UNINITIALIZED. A reader thread sets READING, performs reading until an error or an interruption, drops READING and tries to trigger reconnection (if a state allows, see below). A writer do quite same things, but with the WRITING flag. The key point here is that a reconnection is triggered from a reader/writer thread and only when certain conditions are met. The prerequisite to set RECONNECT and signal (unpark) a connector thread is that a client has UNINITIALIZED state. There are several problems here: - Say, a reader stalls a bit after dropping READING, then a writer drops WRITING and trigger reconnection. Then reader wokes up and set RECONNECT again. - Calling unpark() N times for a connector thread when it is alive doesn't lead to skipping next N park() calls, so the problem above is not just about extra reconnection, but lead the connector thread to be stuck. - Say, a reader stalls just before setting READING. A writer is hit by an IO error and triggers reconnection (set RECONNECT, unpark connector). Then the reader wakes up and set READING+RECONNECT state that disallows a connector thread to proceed further (it expects pure RECONNECT). Even when the reader drops READING it will not wake up (unpark) the connector thread, because RECONNECT was already set (state is not UNINITIALIZED). This commit introduces several changes that eliminate the problems above: - Use ReentrantLock + Condition instead of park() / unpark() to never miss signals to reconnect, does not matter whether a connector is parked. - Ensure a reader and a writer threads from one generation (that are created on the same reconnection iteration) triggers reconnection once. - Hold RECONNECT state most of time a connector works (while acquiring a new socket, connecting and reading Tarantool greeting) and prevent to set READING/WRITING while RECONNECT is set. Fixes: #142 Affects: #34, #136

A state of a client is a set of the following flags: {READING, WRITING, RECONNECT, CLOSED}. Let's name a state when no flags are set UNINITIALIZED. A reader thread sets READING, performs reading until an error or an interruption, drops READING and tries to trigger reconnection (if a state allows, see below). A writer do quite same things, but with the WRITING flag. The key point here is that a reconnection is triggered from a reader/writer thread and only when certain conditions are met. The prerequisite to set RECONNECT and signal (unpark) a connector thread is that a client has UNINITIALIZED state. There are several problems here: - Say, a reader stalls a bit after dropping READING, then a writer drops WRITING and trigger reconnection. Then reader wokes up and set RECONNECT again. - Calling unpark() N times for a connector thread when it is alive doesn't lead to skipping next N park() calls, so the problem above is not just about extra reconnection, but lead the connector thread to be stuck. - Say, a reader stalls just before setting READING. A writer is hit by an IO error and triggers reconnection (set RECONNECT, unpark connector). Then the reader wakes up and set READING+RECONNECT state that disallows a connector thread to proceed further (it expects pure RECONNECT). Even when the reader drops READING it will not wake up (unpark) the connector thread, because RECONNECT was already set (state is not UNINITIALIZED). This commit introduces several changes that eliminate the problems above: - Use ReentrantLock + Condition instead of park() / unpark() to never miss signals to reconnect, does not matter whether a connector is parked. - Ensure a reader and a writer threads from one generation (that are created on the same reconnection iteration) triggers reconnection once. - Hold RECONNECT state most of time a connector works (while acquiring a new socket, connecting and reading Tarantool greeting) and prevent to set READING/WRITING while RECONNECT is set. - Ensure a new reconnection iteration will start only after old reader and old writer threads exit (because we cannot receive a reconnection signal until we send it). Fixes: #142 Affects: #34, #136

Refactor SocketChannelProvider implementations. Now we have two SingleSocketChannelProviderImpl and RoundRobinSocketProviderImpl used by simple and cluster clients respectively. To achieve this a BaseSocketChannelProvider was extracted. Add a service discovery implementation based on a Tarantool stored procedure which is called to obtain a new list of cluster nodes. Integrate service discovery and current cluster client. The client now is aware of potential nodes changing using a configurable background job which periodically checks whether node addresses have changed. If so the client refreshes the RoundRobinSocketProviderImpl and replaces old nodes by new ones. It also requires some additional effort in case of missing the current node in the list. We need to reconnect to another node as soon as possible with a minimal delay between client requests. To achieve this we currently try to catch a moment when the requests in progress have been accomplished and we can finish reconnection process. Closes: #34 See also: #142

sharonovd added the customer label Dec 28, 2018

Totktonada assigned ponomarevDmitri Jan 31, 2019

Totktonada assigned nicktorwald Mar 12, 2019

Totktonada mentioned this issue Mar 13, 2019

Fix a file descriptor leak with wrong user:pass #133

Merged

nicktorwald mentioned this issue Mar 18, 2019

Support auto refresh a list of cluster nodes #136

Merged

Totktonada changed the title ~~Add support for tarantool clusters (e.g. tarantool/vshard routers)~~ Auto refresh a list of cluster nodes in TarantoolClusterClient Mar 21, 2019

Totktonada unassigned ponomarevDmitri Mar 21, 2019

nicktorwald mentioned this issue Mar 29, 2019

Race condition in TarantoolClientImpl #145

Closed

nicktorwald closed this as completed in #136 Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto refresh a list of cluster nodes in TarantoolClusterClient #34

Auto refresh a list of cluster nodes in TarantoolClusterClient #34

hypersolid commented Jul 25, 2018

sharonovd commented Dec 28, 2018

Totktonada commented Feb 4, 2019

Totktonada commented Feb 19, 2019

ponomarevDmitri commented Feb 19, 2019 •

edited

Loading

Totktonada commented Mar 13, 2019

Totktonada commented Mar 28, 2019

Auto refresh a list of cluster nodes in TarantoolClusterClient #34

Auto refresh a list of cluster nodes in TarantoolClusterClient #34

Comments

hypersolid commented Jul 25, 2018

sharonovd commented Dec 28, 2018

Totktonada commented Feb 4, 2019

Notes from meeting with Dmitry (@sharonovd) (Jan, 31).

Totktonada commented Feb 19, 2019

ponomarevDmitri commented Feb 19, 2019 • edited Loading

Totktonada commented Mar 13, 2019

Totktonada commented Mar 28, 2019

ponomarevDmitri commented Feb 19, 2019 •

edited

Loading