Skip to content

Update records while streaming a big table #2768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jasperkuperus opened this issue Jun 30, 2022 · 2 comments
Closed

Update records while streaming a big table #2768

jasperkuperus opened this issue Jun 30, 2022 · 2 comments

Comments

@jasperkuperus
Copy link

I'm streaming through a big table using:

const stream = client.query(
  new QueryStream(`
    SELECT * FROM my_table
  `),
);

In the stream.on('data', ...), I execute a query to update that same row. I hereby hope to stream through the big table efficiently without using too much RAM.

However, when I add some logging, I clearly see that first the complete stream is read before any of the updates are executed. And when I explicitly pause the stream to allow the updates to happen, it deadlocks.

Reading this issue (#660), it looks like a query is only executed after it's predecessor is done. If that applies to my case, this explains what's going on. Is this assumption right?

Now, I've tried to work around this by using a second connection to execute the updates. That works fine. But, in my particular scenario, I have to do 2 sequential passes over the table and the second pass should see the updates that happened during the first pass. All of it should happen in a transaction, any failure in the first or second pass should rollback the complete operation.

Any suggestions on how to best approach this?

@richardabendroth
Copy link

By design, PostgreSQL is not handling multiple queries simultaneously for the same connection. Also, transactions cannot be shared across multiple connections. I'm not too sure what you need to achieve there. You'd generally want to look at connection pools if you need query parallelism in your application. But I have understood, that in your case this is not an option. It reads like you are trying to distribute a part of your application across your node application (data manipulation and validation?) and the database (data integrity management?). This requires you to transmit large amounts of data back (and forth?) for a single "function" of your overall application. If I might suggest: have a look at stored procedures or INSTEAD OF triggers in PostgreSQL. You might be able to offload most of the functionality into the database and reduce the overhead significantly. Instead of

BEGIN; 
...
UPDATE ... RETURNING *;
-- stuff in node
UPDATE ... RETURNING *;
-- stuff in node
ROLLBACK/COMMIT;

you would be looking at something like

client.query('SELECT some_stored_procedure($1)', [someArrayOrJsonObject]);

that can also return a table if needed or

client.query('INSERT INTO some_view_with_insteadof_triggers VALUES ... RETURNING *', ...);

that would defer the operation to a trigger procedure updating/checking other tables.

Having written that: you will need to take a look into PL/PGSQL and how triggers, trigger procedures, stored procedures(functions), and views with instead of triggers work.

As a general rule of thumb: If your transaction overhead (the data that needs to be transmitted back and forth for a single transaction) seems a bit too far off, the design for where you do what within your application's "function" needs some afterthought.

@jasperkuperus
Copy link
Author

@richardabendroth Thanks for your reply. That confirms my suspicions indeed. The work I had to do required some heavy business logic calculations in node. So, writing a stored procedure could have worked, but would have caused my business logic to duplicate and potentially introduce implementation differences between the two. In my specific case, I've loosened my requirement of doing the 2 passes in 1 transaction. By doing that, using 2 clients, one for streaming the rows, one for updating the rows in a transaction, worked out for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants