Skip to content

segfault in nginx under high concurrency #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
buddy-ekb opened this issue Dec 5, 2013 · 10 comments
Closed

segfault in nginx under high concurrency #21

buddy-ekb opened this issue Dec 5, 2013 · 10 comments

Comments

@buddy-ekb
Copy link

I suspect that there is some bug in ngx_postgres' connection cache code. While stressing the nginx with "ab -c 20 ..." it occasionally segfaulted, so I ran it under valgrind. There is the result:

==25581== Invalid read of size 8
==25581== at 0x406DAB: ngx_destroy_pool (ngx_palloc.c:76)
==25581== by 0x4B6457: ngx_postgres_upstream_free_connection (ngx_postgres_upstream.c:584)
==25581== by 0x4B00FE: ngx_postgres_keepalive_free_peer (ngx_postgres_keepalive.c:218)
==25581== by 0x4B60DC: ngx_postgres_upstream_free_peer (ngx_postgres_upstream.c:509)
==25581== by 0x4B6674: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:79)
==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507)
==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488)
==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366)
==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76)
==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314)
==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976)
==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691)
==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248)
==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315)
==25581== by 0x403DF4: main (nginx.c:404)
==25581== Address 0x5641930 is 96 bytes inside a block of size 256 free'd
==25581== at 0x4A063F0: free (vg_replace_malloc.c:446)
==25581== by 0x406E39: ngx_destroy_pool (ngx_palloc.c:87)
==25581== by 0x44E8D2: ngx_http_close_connection (ngx_http_request.c:3489)
==25581== by 0x44E56D: ngx_http_close_request (ngx_http_request.c:3350)
==25581== by 0x44E0C1: ngx_http_lingering_close_handler (ngx_http_request.c:3209)
==25581== by 0x44DF6B: ngx_http_set_lingering_close (ngx_http_request.c:3171)
==25581== by 0x44C858: ngx_http_finalize_connection (ngx_http_request.c:2493)
==25581== by 0x44C443: ngx_http_finalize_request (ngx_http_request.c:2384)
==25581== by 0x4B6800: ngx_postgres_upstream_finalize_request (ngx_postgres_util.c:140)
==25581== by 0x4B4DCD: ngx_postgres_upstream_done (ngx_postgres_processor.c:507)
==25581== by 0x4B4D46: ngx_postgres_upstream_get_ack (ngx_postgres_processor.c:488)
==25581== by 0x4B4951: ngx_postgres_upstream_get_result (ngx_postgres_processor.c:366)
==25581== by 0x4B40F0: ngx_postgres_process_events (ngx_postgres_processor.c:76)
==25581== by 0x4AF7D2: ngx_postgres_rev_handler (ngx_postgres_handler.c:314)
==25581== by 0x46415C: ngx_http_upstream_handler (ngx_http_upstream.c:976)
==25581== by 0x437C1D: ngx_epoll_process_events (ngx_epoll_module.c:691)
==25581== by 0x42853F: ngx_process_events_and_timers (ngx_event.c:248)
==25581== by 0x4347E4: ngx_single_process_cycle (ngx_process_cycle.c:315)
==25581== by 0x403DF4: main (nginx.c:404)

I believe that "ngx_postgres_keepalive_free_peer" (ngx_postgres_keepalive.c:164) somehow fails to correctly isolate some nginx memory pool from freeing which occurs later at (ngx_postgres_util.c:140) and eventually leads to an attempt of reading invalid data from freed and reused memory area.

I'm looking forward to giving any comprehensive information to anybody whom it may concern.

@agentzh
Copy link
Contributor

agentzh commented Dec 5, 2013

@buddy-ekb Thank you for the report! We'll look into this. BTW, what versions of nginx and ngx_postgres are you using?

@PiotrSikora
Copy link
Contributor

@buddy-ekb: are you running it under Valgrind with master_procces off? If so, that's a known issue (and the reason 1.0 isn't released yet).

I'd be interested in seeing coredump from the segfaults, tho.

@buddy-ekb
Copy link
Author

@agentzh
Nginx version is 1.5.7, the binary is for linux x86_64 compiled with ngx_postgres 1.0rc3 and linked against postgresql 9.3.1 client library from pgdg rpm repository.

@PiotrSikora
I put necessary files here https://www.dropbox.com/sh/m19ozujak1jucup/_9U4V5JXgY

Of course, I was running nginx under valgrind with master_procces off, but indeed it dumped the provided core without valgrind and had been dumping cores without master_process off as well, so it is very likely that the problem is not only in valgrind.

@agentzh
Copy link
Contributor

agentzh commented Dec 10, 2013

@buddy-ekb Thank you for the info! I'll try to look into this issue in the coming days :)

@PiotrSikora
Copy link
Contributor

@buddy-ekb: the core dump you provided is from nginx running with master_process off, could you please provide one from nginx that runs normally? I tend to think that the core dumps in productions are different than this one... Thanks!

@buddy-ekb
Copy link
Author

@PiotrSikora here you are https://www.dropbox.com/sh/krmghdszj28bkkv/vc1uCpMsIu

A few remarks on them:

  • All the cores dumped with master_process on;
  • The only difference between core.19220 and core.19339 is daemon on in the latter case;
  • core.19361 dumped after signalling to the master process to quit (the same master as for core.19339);
  • I had to rebuild nginx binary from sources since I had made several additions in ngx_postgres-1.0rc3/src/ngx_postgres_upstream.c (tried to insert some logging in hope to catch the bug quickly, but was out of luck). Now the line numbers are in sync with vanilla sources of nginx and ngx_postgres.

agentzh added a commit that referenced this issue Aug 1, 2014
…g logging was enabled in the nginx build. thanks buddy-ekb for the report in #21.
@agentzh
Copy link
Contributor

agentzh commented Aug 1, 2014

@buddy-ekb Sorry for the delay on my side! I've just committed a fix for your problem to git master. Will you try it out on your side? Thank you!

@buddy-ekb
Copy link
Author

With the patch the stress testing has been passing without a hitch. Thank you very much, @agentzh !

@agentzh
Copy link
Contributor

agentzh commented Aug 5, 2014

@buddy-ekb BTW, I've noticed that you've enabled debug logging via the --with-debug option (and this segfault only happened with debug logging enabled). If you are serious about performance, better disable this in your nginx (or openresty) build.

@buddy-ekb
Copy link
Author

@agentzh, thank you for the remark!

Though I consider this performance penalty cost nothing against transaction handling inside PostgreSQL even in case of simpe "SELECT 1". As a matter of fact, I did some benchmarks on localhost "debug vs. nodebug", "ipv4 vs. unix sockets", "ipv4 vs. ipv6" and it seems to me there is no significant difference between these options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants