Skip to content

test: flaky issue that replicas can't be terminated with SIGTERM on OSX #32

Open
@avtikhon

Description

@avtikhon

Tarantool version:
| Tarantool 2.7.0-82-gad1aa096b
| Target: Darwin-x86_64-Debug
| Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
| Compiler: /Library/Developer/CommandLineTools/usr/bin/cc /Library/Developer/CommandLineTools/usr/bin/c++
| C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror
| CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror

OS version:
OSX *

Bug description:
On heavy loaded OSX hosts stress testing with replicas may fail on dropping cluster replicas, due to test-run used to stop it only SIGTERM signal.

Steps to reproduce:

  1. If the patch from PR tarantool/test-run Command line options in a Lua script tarantool#244 (Add test_timeout to limit test run time test-run#244) already committed, than need to set --test-timeout 1000 --no-output-timeout=40, where --test-timeout must be bigger than --no-output-timeout.

  2. If the patch from PR tarantool/test-run New implementation of box.replace does not check that tuple is array tarantool#186 (Add the 'timeout' option into 'stop server' command and kill a server… test-run#186) already committed, than need to revert it.

  3. Try to run the stress test in loop till the issue occurs:

( ( c=0; while ./test-run.py --builddir ~/tnt -j 20 `for r in {1..40} ; do echo replication/box_set_replication_stress ; done` --force --long --test-timeout 1000 --no-output-timeout=40; do date; c=$(($c+1)); echo RUN ITERATIONS: $c; done ; echo FAILED ON ITERATION: $c ) | tee a.log; echo HANGED:; grep "Test hung" a.log | wc -l )
  1. It will produce the issue and show output, like:
No output during 60 seconds. Will abort after 40 seconds without output. List of workers not reporting the status:
- 003_replication [replication/box_set_replication_stress.test.lua, memtx] at var/003_replication/box_set_replication_stress.result:36
- 005_replication [replication/box_set_replication_stress.test.lua, memtx] at var/005_replication/box_set_replication_stress.result:36
- 012_replication [replication/box_set_replication_stress.test.lua, memtx] at var/012_replication/box_set_replication_stress.result:36
- 016_replication [replication/box_set_replication_stress.test.lua, memtx] at var/016_replication/box_set_replication_stress.result:36
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result	Tue Nov 24 03:15:41 2020
+++ var/003_replication/box_set_replication_stress.result	Sat Nov 28 10:37:34 2020
@@ -34,5 +34,3 @@
 
 -- Cleanup.
 test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result	Tue Nov 24 03:15:41 2020
+++ var/005_replication/box_set_replication_stress.result	Sat Nov 28 10:36:55 2020
@@ -34,5 +34,3 @@
 
 -- Cleanup.
 test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result	Tue Nov 24 03:15:41 2020
+++ var/012_replication/box_set_replication_stress.result	Sat Nov 28 10:37:14 2020
@@ -34,5 +34,3 @@
 
 -- Cleanup.
 test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result	Tue Nov 24 03:15:41 2020
+++ var/016_replication/box_set_replication_stress.result	Sat Nov 28 10:37:12 2020
@@ -34,5 +34,3 @@
 
 -- Cleanup.
 test_run:drop_cluster(SERVERS)
- | ---
- | ...

[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Statistics:
* pass: 76
FAILED ON ITERATION: 1
HANGED:
       4

  1. Find left instances, like here from 4 tests:
tntmac02:test tntmac02.tarantool.i$ ps
  PID TTY           TIME CMD
42749 ttys000    0:02.27 -bash
58879 ttys000    0:04.52 tarantool master.lua <running>   
58881 ttys000    0:04.30 tarantool master.lua <running>   
58934 ttys000    0:04.49 tarantool master.lua <running>   
58941 ttys000    0:04.68 tarantool master.lua <running>   
59266 ttys000    0:02.25 tarantool master_quorum1.lua <running>   
59311 ttys000    0:05.46 tarantool master_quorum2.lua <running>   
59448 ttys000    0:02.12 tarantool master_quorum1.lua <running>   
59461 ttys000    0:02.10 tarantool master_quorum1.lua <running>   
59559 ttys000    0:04.63 tarantool master_quorum2.lua <running>   
59593 ttys000    0:05.19 tarantool master_quorum2.lua <running>   
59935 ttys000    0:02.11 tarantool master_quorum1.lua <running>   
59961 ttys000    0:04.90 tarantool master_quorum2.lua <running>   
36478 ttys001    0:00.21 -bash
  1. Try to kill all of left instances, but 1st replica won't be terminated:
tntmac02:test tntmac02.tarantool.i$ kill -15 58879 58881 58934 58941 59266 59311 59448 59461 59559 59593 59935 59961
tntmac02:test tntmac02.tarantool.i$ ps
  PID TTY           TIME CMD
42749 ttys000    0:02.28 -bash
59266 ttys000    0:02.26 tarantool master_quorum1.lua <running>   
59448 ttys000    0:02.13 tarantool master_quorum1.lua <running>   
59461 ttys000    0:02.11 tarantool master_quorum1.lua <running>   
59935 ttys000    0:02.12 tarantool master_quorum1.lua <running>   
36478 ttys001    0:00.21 -bash
  1. check log artifacts after all previous steps:

artifacts.zip

Optional (but very desirable):

  • coredump
  • backtrace
  • netstat

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions