Skip to content

Fix double-sync crash on postgres 9.x #2367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 8, 2020

Conversation

brianc
Copy link
Owner

@brianc brianc commented Oct 7, 2020

An attempt to fix #1105

It looks like in rare, random occasions postgres 9.x is sending both ErrorResponse and CommandComplete when a query times out. In an attempt to fix this, I have changed the query to only send a sync on an error message if it hasn't already sent a sync. Judging from the docs this should never happen, and indeed on newer versions of postgres the bug is not reproducible...but lots of folks still run postgres 9.x in production. I'm not super fond of this fix as it feels like it's working around a possible race condition in postgres 9.x so I'd love some additional eyes & ideas on this one. Also, want to run the test suite and make sure everything passes in the whole test matrix on travis.

One option is to put this behind some environment variable like PGSKIPDOUBLESYNC or something and allow people to opt in to this change. I'm worried about a (according to the docs, not possible) circumstance where we receive two messages requiring a sync and we only send one sync back.

@brianc
Copy link
Owner Author

brianc commented Oct 7, 2020

@charmander @sehrope any thoughts here? All tests are passing which is 👍 but it feels like I'm working around an issue or possible race condition w/in older versions of postgres itself?

@charmander
Copy link
Collaborator

@brianc Did you build a tool that could reproduce this pretty reliably in order to narrow it down to 9.x, or does the evidence come from somewhere else? Maybe we can narrow it down to a specific version and find the change in PostgreSQL to confirm the fix.

@sehrope
Copy link
Contributor

sehrope commented Oct 8, 2020

I'm able to reproduce this error (without the fix) on 9.3 through 10 (not just 9.x). I can't reproduce it on v11, v12, or v13 after a bunch of repeated runs. It also seems less frequent on v10 vs earlier versions so I'm guessing it is some kind of server race condition to cause the difference. Hoping to have some more time to look into it later this week.

@sehrope
Copy link
Contributor

sehrope commented Oct 8, 2020

Oh and can confirm that @brianc's fix removes the issue on those older versions. At least in my repeated testing.

Still feels weird working around a potential server FEBE bug. If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync. Thrashing connection pools is never fun but I worry either this or the fix would lead to an error in some other location too.

@brianc
Copy link
Owner Author

brianc commented Oct 8, 2020

Still feels weird working around a potential server FEBE bug. If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync.

Yeah I'm asking in #postgres on IRC to see if anyone on the postgres development team knows more about this. This isn't a bad idea, i'll look at doing this as well.

@brianc
Copy link
Owner Author

brianc commented Oct 8, 2020

Okay here is the matrix of test runs which fail without the patch applied.

@brianc
Copy link
Owner Author

brianc commented Oct 8, 2020

If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync.

It's not trivial to trash a client and error out any pending and in-flight queries and stop processing other incoming messages w/ the current architecture. I think this might risk introducing more places things could go wrong and unexpected errors or behaviors, but I'll fiddle with it. I am thinking one "safer" (less invasive) fix could be to have an environment variable like PGSKIPDOUBLESYNC=true or something & folks can opt-in to the fix?

@brianc
Copy link
Owner Author

brianc commented Oct 8, 2020

okay...so w/ some help from @sehrope and a postgres maintainer from IRC I've changed the approach somewhat. When discussing the maintainer (I believe their github name is @RhodiumToad) suggested to pipeline the sync message at the end of the other pipelined messages. Sending sync immediately after execute allows the backend to process all the messages and only respond with either CommandComplete or ErrorResponse. This fixes the issue in the test w/o requiring any hacks or discarding messages. It also speeds up some of my crude benchmarks by over 10% because we save 1 network round trip which is a super awesome bonus side effect. This may have a larger speedup on slower networks, and could potentially address the problem in #2097 and #1774 (not sure though)

@brianc brianc merged commit d8681fc into master Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TypeError: Cannot read property 'name' of null
4 participants