segfault in nginx under high concurrency #21

buddy-ekb · 2013-12-05T13:52:46Z

I suspect that there is some bug in ngx_postgres' connection cache code. While stressing the nginx with "ab -c 20 ..." it occasionally segfaulted, so I ran it under valgrind. There is the result:

==25581== Invalid read of size 8
==25581== at 0x406DAB: ngx_destroy_pool (ngx_palloc.c:76)
==25581== by 0x4B6457: ngx_postgres_upstream_free_connection (ngx_postgres_upstream.c:584)
==25581== by 0x4B00FE: ngx_postgres_keepalive_free_peer (ngx_postgres_keepalive.c:218)
==25581== by 0x4B60DC: ngx_postgres_upstream_free_peer (ngx_postgres_upstream.c:509)
==25581== by 0x4B6674: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:79)
==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507)
==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488)
==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366)
==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76)
==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314)
==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976)
==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691)
==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248)
==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315)
==25581== by 0x403DF4: main (nginx.c:404)
==25581== Address 0x5641930 is 96 bytes inside a block of size 256 free'd
==25581== at 0x4A063F0: free (vg_replace_malloc.c:446)
==25581== by 0x406E39: ngx_destroy_pool (ngx_palloc.c:87)
==25581== by 0x44E8D2: ngx_http_close_connection (ngx_http_request.c:3489)
==25581== by 0x44E56D: ngx_http_close_request (ngx_http_request.c:3350)
==25581== by 0x44E0C1: ngx_http_lingering_close_handler (ngx_http_request.c:3209)
==25581== by 0x44DF6B: ngx_http_set_lingering_close (ngx_http_request.c:3171)
==25581== by 0x44C858: ngx_http_finalize_connection (ngx_http_request.c:2493)
==25581== by 0x44C443: ngx_http_finalize_request (ngx_http_request.c:2384)
==25581== by 0x4B6800: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:140)
==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507)
==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488)
==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366)
==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76)
==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314)
==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976)
==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691)
==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248)
==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315)
==25581== by 0x403DF4: main (nginx.c:404)

I believe that "ngx_postgres_keepalive_free_peer" (ngx_postgres_keepalive.c:164) somehow fails to correctly isolate some nginx memory pool from freeing which occurs later at (ngx_postgres_util.c:140) and eventually leads to an attempt of reading invalid data from freed and reused memory area.

I'm looking forward to giving any comprehensive information to anybody whom it may concern.

agentzh · 2013-12-05T20:58:21Z

@buddy-ekb Thank you for the report! We'll look into this. BTW, what versions of nginx and ngx_postgres are you using?

PiotrSikora · 2013-12-05T21:18:30Z

@buddy-ekb: are you running it under Valgrind with master_procces off? If so, that's a known issue (and the reason 1.0 isn't released yet).

I'd be interested in seeing coredump from the segfaults, tho.

buddy-ekb · 2013-12-06T06:23:54Z

@agentzh
Nginx version is 1.5.7, the binary is for linux x86_64 compiled with ngx_postgres 1.0rc3 and linked against postgresql 9.3.1 client library from pgdg rpm repository.

@PiotrSikora
I put necessary files here https://www.dropbox.com/sh/m19ozujak1jucup/_9U4V5JXgY

Of course, I was running nginx under valgrind with master_procces off, but indeed it dumped the provided core without valgrind and had been dumping cores without master_process off as well, so it is very likely that the problem is not only in valgrind.

agentzh · 2013-12-10T05:15:58Z

@buddy-ekb Thank you for the info! I'll try to look into this issue in the coming days :)

PiotrSikora · 2013-12-10T05:43:09Z

@buddy-ekb: the core dump you provided is from nginx running with master_process off, could you please provide one from nginx that runs normally? I tend to think that the core dumps in productions are different than this one... Thanks!

buddy-ekb · 2013-12-10T11:37:45Z

@PiotrSikora here you are https://www.dropbox.com/sh/krmghdszj28bkkv/vc1uCpMsIu

A few remarks on them:

All the cores dumped with master_process on;
The only difference between core.19220 and core.19339 is daemon on in the latter case;
core.19361 dumped after signalling to the master process to quit (the same master as for core.19339);
I had to rebuild nginx binary from sources since I had made several additions in ngx_postgres-1.0rc3/src/ngx_postgres_upstream.c (tried to insert some logging in hope to catch the bug quickly, but was out of luck). Now the line numbers are in sync with vanilla sources of nginx and ngx_postgres.

…g logging was enabled in the nginx build. thanks buddy-ekb for the report in #21.

agentzh · 2014-08-01T22:59:23Z

@buddy-ekb Sorry for the delay on my side! I've just committed a fix for your problem to git master. Will you try it out on your side? Thank you!

buddy-ekb · 2014-08-04T09:23:53Z

With the patch the stress testing has been passing without a hitch. Thank you very much, @agentzh !

agentzh · 2014-08-05T19:19:39Z

@buddy-ekb BTW, I've noticed that you've enabled debug logging via the --with-debug option (and this segfault only happened with debug logging enabled). If you are serious about performance, better disable this in your nginx (or openresty) build.

buddy-ekb · 2014-08-06T04:34:40Z

@agentzh, thank you for the remark!

Though I consider this performance penalty cost nothing against transaction handling inside PostgreSQL even in case of simpe "SELECT 1". As a matter of fact, I did some benchmarks on localhost "debug vs. nodebug", "ipv4 vs. unix sockets", "ipv4 vs. ipv6" and it seems to me there is no significant difference between these options.

agentzh added a commit that referenced this issue Aug 1, 2014

bugfix: segmentation fault might happen in ngx_destroy_pool when debu…

088ab19

…g logging was enabled in the nginx build. thanks buddy-ekb for the report in #21.

buddy-ekb closed this as completed Aug 4, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault in nginx under high concurrency #21

segfault in nginx under high concurrency #21

buddy-ekb commented Dec 5, 2013

agentzh commented Dec 5, 2013

PiotrSikora commented Dec 5, 2013

buddy-ekb commented Dec 6, 2013

agentzh commented Dec 10, 2013

PiotrSikora commented Dec 10, 2013

buddy-ekb commented Dec 10, 2013

agentzh commented Aug 1, 2014

buddy-ekb commented Aug 4, 2014

agentzh commented Aug 5, 2014

buddy-ekb commented Aug 6, 2014

segfault in nginx under high concurrency #21

segfault in nginx under high concurrency #21

Comments

buddy-ekb commented Dec 5, 2013

agentzh commented Dec 5, 2013

PiotrSikora commented Dec 5, 2013

buddy-ekb commented Dec 6, 2013

agentzh commented Dec 10, 2013

PiotrSikora commented Dec 10, 2013

buddy-ekb commented Dec 10, 2013

agentzh commented Aug 1, 2014

buddy-ekb commented Aug 4, 2014

agentzh commented Aug 5, 2014

buddy-ekb commented Aug 6, 2014