-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Curious behavior: idleTimeoutMillis
works locally, but not on "serverless" environments
#2718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's the lambda environment freeze / thaw causing this issue. You can't use a long lived connection pool with a lambda as there is no guarantee that the lambda will continue running to clear the expired connections or keep the underlying TCP socket alive. See this thread and the nested linked issue for more details: #2243 |
Hi @sehrope thank you! I read the #2243 issue before creating mine, and my app is not suffering from "connection timeout". But you pointed out again the freeze / thaw behavior, and I'm "stress" testing my app to keep all lambdas warm, for example: ab -n 1000 -c 100 -l https://xxx And at some point (near the 900 request) everything hangs, because there's no more available connection, and the But then I tried to create the Pool with
It was way slower, but at the end, all 1000 request where served and no connections where hanging (tbh 5 of them was, I don't know why). So know I'm wondering: if it's the freeze condition the problem, how fast does this happen to a lambda? |
Additional note:
Consumes all connections again without recovery/closing connections any more. Maybe some lambdas aren't being frozen because they became idle, but because they were forcibly frozen to make room for other new lambdas in a high load scenario? [edit] my urge to use Pools is because creating a new connection every time or reusing one from the Pool, is a difference between 80ms and 1.5ms. |
I think that's possible as the docs say it could be frozen or bounced at any time: https://docs.aws.amazon.com/lambda/latest/dg/runtimes-context.html The last line also says not to assume you have connections open across requests:
If there's multiple copies of the lambda running or if AWS spins up new ones, it's possible that the server's max connection limit is reached and a newly created lambda cannot open any new connections until the old ones on other lambdas are closed. And they might not actually be gracefully closed if the lambda is frozen so it'd hang until the server finally drops them due to TCP errors. |
Yes. What really bugs me is that, even using |
So @sehrope just to map out all scenarios and double check if I have the right models in my mind: Considering1 request from User to App, in which it makes 3 travels (queries) from App to Database (AWS RDS tiniest tier with Scenario 1 (slowest | safer)For every single query, create a new client to open a new connection, run the query, and right afterward destroy the client and close the connection. "latency": {
"first_query": 60.39779100008309,
"second_query": 62.02611799980514,
"third_query": 87.15444500022568
}, ✅ No connection left hanging under low load. $ ab -n 1000 -c 100 -l https://xxx
# > Time taken for tests: 16.601 seconds Scenario 2 (slower | safe)For 1 request, open just one connection and reuse it through the subsequent queries in the request, and in the end close it. It's safe, but less safe than Scenario 1 because you need to close the Database connections by yourself and in all code branches, for example:
But it pays of I guess: "latency": {
"first_query": 58.97044100007042,
"second_query": 2.284815000137314,
"third_query": 0.9723079998511821
}, ✅ No connection left hanging under low load. $ ab -n 1000 -c 100 -l https://xxx
# > Time taken for tests: 6.699 seconds Scenario 3 (fast | unsafe)For 1 request, use Pool declared in the global scope to manage all connected Clients (create and reuse). You don't need to close Clients by yourself, the Pool does this once First request: "latency": {
"first_query": 61.41757299995515,
"second_query": 3.38494899997022,
"third_query": 1.2718850000528619
}, Second request made before reaching "latency": {
"first_query": 1.6486390000209212,
"second_query": 1.6974199999822304,
"third_query": 1.25238800002262
}, ✅ No connection left hanging under low load.
Maybe you could kinda sync Scenario 4 (faster | unsafer)Always keep on global scope an open connection. It's faster because there's zero overhead (important to note that the Pool overhead is minimal, and it can manage multiple Clients), but it is one level unsafe, because of all problems from Scenario 3, plus Postgres closing your connection from its side, just like mentioned in the issue #2243 First request: "latency": {
"first_query": 56.0580700000282377,
"second_query": 1.2427840000018477,
"third_query": 0.8340720001142472
}, Second request: "latency": {
"first_query": 0.8882809999631718,
"second_query": 0.9511439999332651,
"third_query": 0.5543670000042766
}, ✅ No connection left hanging under low load.
|
Scenario 2 is not completely safe under loadIn a load of 5000 requests (50 concurrent), I received 22 Anyway, differences between Scenario 2
Scenario 4
|
Alternatively you could create a new pool for every request, and clean it up every time. Did you handle errors on |
True! I will combine this strategy with
I didn't and it's a great point, but are you sure a "waiting to open" connection will throw this message? I thought it was due to the fact a request called node-postgres/packages/pg/lib/client.js Lines 568 to 573 in 684cd09
What I could do is to handle conditions for |
That sounds like a bug in your either the pool itself or your pool connection management. Regardless of load, the pool should not close connections that are checked out and should not hand out connections it knows that it closed.
Are you calling Does your code use |
@sehrope on Scenario 2 I wasn't using a Pool, I was manually creating one client, assigning it to the global scope and for every new request, checking if already existed, if so, reuse it, otherwise create a new one, and only at the end of that request, closing the client and redefining its variable to |
Oh yes that'd definitely break under load. The client's have an internal query command queue so concurrent usage of client kind of works until it doesn't (i.e. anything that requires connection state management like a transaction). In practice it's a terrible idea to rely on the queue behavior and sharing client's concurrently is a big no-no. How about having your request handler get or create a pool at the start of the request and if it's the only active request upon completion, it destroys the pool? Something like this (not tested, just a skeleton!): let pool = null;
let poolUsageCount = 0;
// App code should use this to get the pool rather than referencing the variable directly:
export const getPool = () => {
if (!pool) {
throw new Error('Called getPool() outside of an active request');
}
return pool;
}
exports.handler = async function(event, context) {
if (!pool) {
pool = new pg.Pool();
poolUsageCount += 1;
}
try {
await myRealHandler(event, context);
} finally {
poolUsageCount -= 1;
if (poolUsageCount === 0) {
pool.end();
pool = null;
}
}
} Under load there would be enough concurrent active requests that the pool does not get destroyed. It'd be reused by concurrent requests and any open connections held in the pool would be reused as well. But if the request count ever drops low enough that there is nothing running, it would ensure the pool is destroyed and recreated in case the lamda is frozen / thawed before the next request. You might have to tweak it a bit more to have the top level request handler destroy the pool before sending back the response to ensure it all happens before AWS thinks it's okay to scrap the lamda. |
@sehrope great!! I'm avoiding having the controller manage the pool, but I'm not discarding this strategy, thank you 🤝 Meanwhile, I started my tests with Pool: idleTimeoutMillis: 5000 Postgres:
While under 7000ms everything looks great at every request, after passing 7000, the next request blew up in my face:
The most important part: "stack":["error: terminating connection due to idle-session timeout","
at Parser.parseErrorMessage (/var/task/node_modules/pg-protocol/dist/parser.js:287:98)","
at Parser.handlePacket (/var/task/node_modules/pg-protocol/dist/parser.js:126:29)"," It's more clear to me that in a Lambda, the Pool stops its internal work as soon as the request is returned, so it's unable to disconnect the client after |
Btw, there is no concurrent client sharing in this case (and the
My mistake, the connection timeout sets the node-postgres/packages/pg/lib/client.js Lines 103 to 107 in 21ccd4f
In this case, I don't think connection timeout would result in that exact error message. If you don't use pool, that error should only be triggered by using a client after calling client.end() . The connection timeout might result in Connection terminated unexpectedly errors.
|
@boromisp Oh dang you're right. Lamda only sends a single request at a time to an instance. @filipedeschamps Forget my idea of keeping it open it if it's concurrently being used. That would only work if the requests are actually concurrent. |
I think I found an almost sweet spot, it's not the fastest strategy because you're hit in the first query (to open connection), but the Pool reuses the client to every subsequent query in that request and closes the client right after. So the pool is using:
With First query result "latency": {
"first_query": 13.636365999933332,
"second_query": 2.0373400000389665,
"third_query": 0.6963150000665337
}, Load results 1000 requests
Load results 5000 requests
No connections left hanging. But I'm confident it will not work for every load, because of this: Pool:
It's clear to me that the Pool internal timers completely freeze right after the request is returned. Next, I'm going to try setting Postgres to |
To my surprise, somehow even after 60 seconds, the Lambda keeps the socket connection open. Pool: idleTimeoutMillis: 10000 Postgres:
First request works as expected, and 2 minutes later, sending the second request makes the app crash with [edit] also crashed after 5 minutes. |
For now, my final choice will be: Pool Postgres Load of 5.000 requests (100 concurrent)
Connections left hanging One request example "latency": {
"first_query": 19.316103999968618,
"second_query": 2.637129999930039,
"third_query": 1.5327200000174344
}, In comparison, if I open one new connection for every query of the request: Load of 5.000 requests (100 concurrent)
Connections left hanging One request example "latency": {
"first_query": 19.859559000004083,
"second_query": 17.381792000029236,
"third_query": 18.042578999884427
}, |
Final solution (already working in production) is to use Pool as much as you can, but keep checking "opened connections" versus "available connections" and if they're too close, begins to My repository is still private, but here's the full solution: // database.js - exports a Singleton with `query()` and `getNewClient()` methods
import { Pool, Client } from 'pg';
import retry from 'async-retry';
import { ServiceError } from 'errors/index.js';
const configurations = {
user: process.env.POSTGRES_USER,
host: process.env.POSTGRES_HOST,
database: process.env.POSTGRES_DB,
password: process.env.POSTGRES_PASSWORD,
port: process.env.POSTGRES_PORT,
connectionTimeoutMillis: 5000,
idleTimeoutMillis: 30000,
max: 1,
ssl: {
rejectUnauthorized: false,
},
allowExitOnIdle: true,
};
// https://github.com/filipedeschamps/tabnews.com.br/issues/84
if (['test', 'development'].includes(process.env.NODE_ENV) || process.env.CI) {
delete configurations.ssl;
}
const cache = {
pool: null,
maxConnections: null,
reservedConnections: null,
openedConnections: null,
openedConnectionsLastUpdate: null,
};
async function query(query, params) {
let client;
try {
client = await tryToGetNewClientFromPool();
return await client.query(query, params);
} catch (error) {
const errorObject = new ServiceError({
message: error.message,
context: {
query: query.text,
},
errorUniqueCode: 'INFRA:DATABASE:QUERY',
stack: new Error().stack,
});
console.error(errorObject);
throw errorObject;
} finally {
if (client) {
const tooManyConnections = await checkForTooManyConnections(client);
if (tooManyConnections) {
client.release();
await cache.pool.end();
cache.pool = null;
} else {
client.release();
}
}
}
}
async function tryToGetNewClientFromPool() {
const clientFromPool = await retry(newClientFromPool, {
retries: 50,
minTimeout: 0,
factor: 2,
});
return clientFromPool;
async function newClientFromPool() {
if (!cache.pool) {
cache.pool = new Pool(configurations);
}
return await cache.pool.connect();
}
}
async function checkForTooManyConnections(client) {
const currentTime = new Date().getTime();
const openedConnectionsMaxAge = 10000;
const maxConnectionsTolerance = 0.9;
if (cache.maxConnections === null || cache.reservedConnections === null) {
const [maxConnections, reservedConnections] = await getConnectionLimits();
cache.maxConnections = maxConnections;
cache.reservedConnections = reservedConnections;
}
if (
!cache.openedConnections === null ||
!cache.openedConnectionsLastUpdate === null ||
currentTime - cache.openedConnectionsLastUpdate > openedConnectionsMaxAge
) {
const openedConnections = await getOpenedConnections();
cache.openedConnections = openedConnections;
cache.openedConnectionsLastUpdate = currentTime;
}
if (cache.openedConnections > (cache.maxConnections - cache.reservedConnections) * maxConnectionsTolerance) {
return true;
}
return false;
async function getConnectionLimits() {
const [maxConnectionsResult, reservedConnectionResult] = await client.query(
'SHOW max_connections; SHOW superuser_reserved_connections;'
);
return [
maxConnectionsResult.rows[0].max_connections,
reservedConnectionResult.rows[0].superuser_reserved_connections,
];
}
async function getOpenedConnections() {
const openConnectionsResult = await client.query(
'SELECT numbackends as opened_connections FROM pg_stat_database where datname = $1',
[process.env.POSTGRES_DB]
);
return openConnectionsResult.rows[0].opened_connections;
}
}
async function getNewClient() {
try {
const client = await tryToGetNewClient();
return client;
} catch (error) {
const errorObject = new ServiceError({
message: error.message,
errorUniqueCode: 'INFRA:DATABASE:GET_NEW_CONNECTED_CLIENT',
stack: new Error().stack,
});
console.error(errorObject);
throw errorObject;
}
}
async function tryToGetNewClient() {
const client = await retry(newClient, {
retries: 50,
minTimeout: 0,
factor: 2,
});
return client;
// You need to close the client when you are done with it
// using the client.end() method.
async function newClient() {
const client = new Client(configurations);
await client.connect();
return client;
}
}
export default Object.freeze({
query,
getNewClient,
}); |
Final resultsFirst strategyOpen and close connection for every single query:
Last strategyManage Pool based on opened vs available connections:
|
@filipedeschamps Thank you so much for investigating this issue. had the same problem and am quite happy with the following solution: const MAX_SIGNED_32_BIT_INTEGER = 2147483647;
...
{
idleTimeoutMillis: MAX_SIGNED_32_BIT_INTEGER,
reapIntervalMillis: MAX_SIGNED_32_BIT_INTEGER,
}
... This configuration effectively deactivates the reap cycle in tarn.js, meaning that idle connections are never cleaned up, regardless of whether the lambda process is hot or cold. Given the operational characteristics of lambdas, server-side cleanup does not work reliably, so disabling it seems sensible. Additionally, I set the I also implemented graceful shutdown listeners in the lambda to ensure all remaining connections are cleaned up before the process is terminated, although this step seems optional since the server would clean these up after 30 seconds anyway. More details can be found in this Stack Overflow discussion. A good option is also to look into RDS Proxy to achieve an external db connection pool. |
Instead of deactivating the application side db connection pool completely I went with setting the value of the database idle timeout slightly higher. The full approach is described here: https://blog.stefanwaldhauser.me/posts/lambda_db_connection_leak/
|
Yes! I've had a hard time with this, since |
I was skeptical to write this issue since it might involve other things, not just
pg
, but I will give it a try:So, everything works as expected on
localhost
: after I release a client to the Pool withclient.release()
andidleTimeoutMillis
reaches its limit, my Pool correctly emitsremove
event (signaling that it correctly destroyed the client and also the real connection to the Database). Also, I can use my localhost to connect to the same Database I'm using in Production (RDS Postgres) and it works as expected.But on environments like Vercel (which uses AWS Lambas under the hood), for some very curious reason,
idleTimeoutMillis
seems to not work as expected. Looks like after releasing a Client, it stays in the Pool asidle
forever, not destroying the client/connection after reachingidleTimeoutMillis
and leaving a lot ofidle
connections hanging on RDS.To make things more strange, if I force a client to close using
client.release(true)
, theremove
event from the Pool is emitted and the connection is closed as expected... but forcing this to every client ruins the purpose of using a Pool in the first place.I don't know if there's some different behavior of the eventloop in this kind of environment, and then the internal
pg
timers don't get run or something like this.The text was updated successfully, but these errors were encountered: