-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix double-sync crash on postgres 9.x #2367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@charmander @sehrope any thoughts here? All tests are passing which is 👍 but it feels like I'm working around an issue or possible race condition w/in older versions of postgres itself? |
@brianc Did you build a tool that could reproduce this pretty reliably in order to narrow it down to 9.x, or does the evidence come from somewhere else? Maybe we can narrow it down to a specific version and find the change in PostgreSQL to confirm the fix. |
I'm able to reproduce this error (without the fix) on 9.3 through 10 (not just 9.x). I can't reproduce it on v11, v12, or v13 after a bunch of repeated runs. It also seems less frequent on v10 vs earlier versions so I'm guessing it is some kind of server race condition to cause the difference. Hoping to have some more time to look into it later this week. |
Oh and can confirm that @brianc's fix removes the issue on those older versions. At least in my repeated testing. Still feels weird working around a potential server FEBE bug. If there's truly a protocol break in older versions might make more sense to scrap the entire connection as who knows what else could be out of sync. Thrashing connection pools is never fun but I worry either this or the fix would lead to an error in some other location too. |
Yeah I'm asking in #postgres on IRC to see if anyone on the postgres development team knows more about this. This isn't a bad idea, i'll look at doing this as well. |
Okay here is the matrix of test runs which fail without the patch applied. |
It's not trivial to trash a client and error out any pending and in-flight queries and stop processing other incoming messages w/ the current architecture. I think this might risk introducing more places things could go wrong and unexpected errors or behaviors, but I'll fiddle with it. I am thinking one "safer" (less invasive) fix could be to have an environment variable like |
okay...so w/ some help from @sehrope and a postgres maintainer from IRC I've changed the approach somewhat. When discussing the maintainer (I believe their github name is @RhodiumToad) suggested to pipeline the |
An attempt to fix #1105
It looks like in rare, random occasions postgres 9.x is sending both
ErrorResponse
andCommandComplete
when a query times out. In an attempt to fix this, I have changed the query to only send async
on an error message if it hasn't already sent a sync. Judging from the docs this should never happen, and indeed on newer versions of postgres the bug is not reproducible...but lots of folks still run postgres 9.x in production. I'm not super fond of this fix as it feels like it's working around a possible race condition in postgres 9.x so I'd love some additional eyes & ideas on this one. Also, want to run the test suite and make sure everything passes in the whole test matrix on travis.One option is to put this behind some environment variable like
PGSKIPDOUBLESYNC
or something and allow people to opt in to this change. I'm worried about a (according to the docs, not possible) circumstance where we receive two messages requiring a sync and we only send one sync back.