Fix double-sync crash on postgres 9.x #2367

brianc · 2020-10-07T19:55:02Z

An attempt to fix #1105

It looks like in rare, random occasions postgres 9.x is sending both ErrorResponse and CommandComplete when a query times out. In an attempt to fix this, I have changed the query to only send a sync on an error message if it hasn't already sent a sync. Judging from the docs this should never happen, and indeed on newer versions of postgres the bug is not reproducible...but lots of folks still run postgres 9.x in production. I'm not super fond of this fix as it feels like it's working around a possible race condition in postgres 9.x so I'd love some additional eyes & ideas on this one. Also, want to run the test suite and make sure everything passes in the whole test matrix on travis.

One option is to put this behind some environment variable like PGSKIPDOUBLESYNC or something and allow people to opt in to this change. I'm worried about a (according to the docs, not possible) circumstance where we receive two messages requiring a sync and we only send one sync back.

brianc · 2020-10-07T21:58:51Z

@charmander @sehrope any thoughts here? All tests are passing which is 👍 but it feels like I'm working around an issue or possible race condition w/in older versions of postgres itself?

charmander · 2020-10-07T23:26:00Z

@brianc Did you build a tool that could reproduce this pretty reliably in order to narrow it down to 9.x, or does the evidence come from somewhere else? Maybe we can narrow it down to a specific version and find the change in PostgreSQL to confirm the fix.

sehrope · 2020-10-08T15:10:06Z

I'm able to reproduce this error (without the fix) on 9.3 through 10 (not just 9.x). I can't reproduce it on v11, v12, or v13 after a bunch of repeated runs. It also seems less frequent on v10 vs earlier versions so I'm guessing it is some kind of server race condition to cause the difference. Hoping to have some more time to look into it later this week.

sehrope · 2020-10-08T15:12:24Z

Oh and can confirm that @brianc's fix removes the issue on those older versions. At least in my repeated testing.

Still feels weird working around a potential server FEBE bug. If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync. Thrashing connection pools is never fun but I worry either this or the fix would lead to an error in some other location too.

packages/pg/test/integration/gh-issues/1105-tests.js

brianc · 2020-10-08T15:15:14Z

Still feels weird working around a potential server FEBE bug. If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync.

Yeah I'm asking in #postgres on IRC to see if anyone on the postgres development team knows more about this. This isn't a bad idea, i'll look at doing this as well.

brianc · 2020-10-08T15:32:37Z

Okay here is the matrix of test runs which fail without the patch applied.

brianc · 2020-10-08T15:37:06Z

If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync.

It's not trivial to trash a client and error out any pending and in-flight queries and stop processing other incoming messages w/ the current architecture. I think this might risk introducing more places things could go wrong and unexpected errors or behaviors, but I'll fiddle with it. I am thinking one "safer" (less invasive) fix could be to have an environment variable like PGSKIPDOUBLESYNC=true or something & folks can opt-in to the fix?

brianc · 2020-10-08T18:51:18Z

okay...so w/ some help from @sehrope and a postgres maintainer from IRC I've changed the approach somewhat. When discussing the maintainer (I believe their github name is @RhodiumToad) suggested to pipeline the sync message at the end of the other pipelined messages. Sending sync immediately after execute allows the backend to process all the messages and only respond with either CommandComplete or ErrorResponse. This fixes the issue in the test w/o requiring any hacks or discarding messages. It also speeds up some of my crude benchmarks by over 10% because we save 1 network round trip which is a super awesome bonus side effect. This may have a larger speedup on slower networks, and could potentially address the problem in #2097 and #1774 (not sure though)

packages/pg/lib/query.js

Fix double-sync crash on postgres 9.x

0bc0d9a

brianc mentioned this pull request Oct 7, 2020

TypeError: Cannot read property 'name' of null #1105

Closed

Remove fix to fail tests

e006b03

sehrope reviewed Oct 8, 2020

View reviewed changes

packages/pg/test/integration/gh-issues/1105-tests.js Outdated Show resolved Hide resolved

Apply fix

73b940a

brianc added 3 commits October 8, 2020 10:39

Update comments

7f06bc0

Change when sync is sent during pipelining

27637c1

Fixes based on postgres maintainer advice

ba231d5

Comments & cleanup

ab995bb

brianc merged commit d8681fc into master Oct 8, 2020

junaid1460 reviewed Mar 2, 2021

View reviewed changes

packages/pg/lib/query.js Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-sync crash on postgres 9.x #2367

Fix double-sync crash on postgres 9.x #2367

brianc commented Oct 7, 2020

brianc commented Oct 7, 2020

charmander commented Oct 7, 2020

sehrope commented Oct 8, 2020

sehrope commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020 •

edited

Loading

Fix double-sync crash on postgres 9.x #2367

Fix double-sync crash on postgres 9.x #2367

Conversation

brianc commented Oct 7, 2020

brianc commented Oct 7, 2020

charmander commented Oct 7, 2020

sehrope commented Oct 8, 2020

sehrope commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020

brianc commented Oct 8, 2020 • edited Loading

brianc commented Oct 8, 2020 •

edited

Loading