Description
Tarantool version:
| Tarantool 2.7.0-82-gad1aa096b
| Target: Darwin-x86_64-Debug
| Build options: cmake . -DCMAKE_INSTALL_PREFIX=/usr/local -DENABLE_BACKTRACE=ON
| Compiler: /Library/Developer/CommandLineTools/usr/bin/cc /Library/Developer/CommandLineTools/usr/bin/c++
| C_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-gnu-alignof-expression -Werror
| CXX_FLAGS: -Wno-unknown-pragmas -fexceptions -funwind-tables -fno-omit-frame-pointer -fno-stack-protector -fno-common -msse2 -std=c++11 -Wall -Wextra -Wno-strict-aliasing -Wno-char-subscripts -Wno-invalid-offsetof -Wno-gnu-alignof-expression -Werror
OS version:
OSX *
Bug description:
On heavy loaded OSX hosts stress testing with replicas may fail on dropping cluster replicas, due to test-run used to stop it only SIGTERM signal.
Steps to reproduce:
-
If the patch from PR tarantool/test-run Command line options in a Lua script tarantool#244 (Add test_timeout to limit test run time test-run#244) already committed, than need to set
--test-timeout 1000 --no-output-timeout=40
, where--test-timeout
must be bigger than--no-output-timeout
. -
If the patch from PR tarantool/test-run New implementation of box.replace does not check that tuple is array tarantool#186 (Add the 'timeout' option into 'stop server' command and kill a server… test-run#186) already committed, than need to revert it.
-
Try to run the stress test in loop till the issue occurs:
( ( c=0; while ./test-run.py --builddir ~/tnt -j 20 `for r in {1..40} ; do echo replication/box_set_replication_stress ; done` --force --long --test-timeout 1000 --no-output-timeout=40; do date; c=$(($c+1)); echo RUN ITERATIONS: $c; done ; echo FAILED ON ITERATION: $c ) | tee a.log; echo HANGED:; grep "Test hung" a.log | wc -l )
- It will produce the issue and show output, like:
No output during 60 seconds. Will abort after 40 seconds without output. List of workers not reporting the status:
- 003_replication [replication/box_set_replication_stress.test.lua, memtx] at var/003_replication/box_set_replication_stress.result:36
- 005_replication [replication/box_set_replication_stress.test.lua, memtx] at var/005_replication/box_set_replication_stress.result:36
- 012_replication [replication/box_set_replication_stress.test.lua, memtx] at var/012_replication/box_set_replication_stress.result:36
- 016_replication [replication/box_set_replication_stress.test.lua, memtx] at var/016_replication/box_set_replication_stress.result:36
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result Tue Nov 24 03:15:41 2020
+++ var/003_replication/box_set_replication_stress.result Sat Nov 28 10:37:34 2020
@@ -34,5 +34,3 @@
-- Cleanup.
test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result Tue Nov 24 03:15:41 2020
+++ var/005_replication/box_set_replication_stress.result Sat Nov 28 10:36:55 2020
@@ -34,5 +34,3 @@
-- Cleanup.
test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result Tue Nov 24 03:15:41 2020
+++ var/012_replication/box_set_replication_stress.result Sat Nov 28 10:37:14 2020
@@ -34,5 +34,3 @@
-- Cleanup.
test_run:drop_cluster(SERVERS)
- | ---
- | ...
Test hung! Result content mismatch:
--- replication/box_set_replication_stress.result Tue Nov 24 03:15:41 2020
+++ var/016_replication/box_set_replication_stress.result Sat Nov 28 10:37:12 2020
@@ -34,5 +34,3 @@
-- Cleanup.
test_run:drop_cluster(SERVERS)
- | ---
- | ...
[Main process] No output from workers. It seems that we hang. Send SIGKILL to workers; exiting...
---------------------------------------------------------------------------------
Statistics:
* pass: 76
FAILED ON ITERATION: 1
HANGED:
4
- Find left instances, like here from 4 tests:
tntmac02:test tntmac02.tarantool.i$ ps
PID TTY TIME CMD
42749 ttys000 0:02.27 -bash
58879 ttys000 0:04.52 tarantool master.lua <running>
58881 ttys000 0:04.30 tarantool master.lua <running>
58934 ttys000 0:04.49 tarantool master.lua <running>
58941 ttys000 0:04.68 tarantool master.lua <running>
59266 ttys000 0:02.25 tarantool master_quorum1.lua <running>
59311 ttys000 0:05.46 tarantool master_quorum2.lua <running>
59448 ttys000 0:02.12 tarantool master_quorum1.lua <running>
59461 ttys000 0:02.10 tarantool master_quorum1.lua <running>
59559 ttys000 0:04.63 tarantool master_quorum2.lua <running>
59593 ttys000 0:05.19 tarantool master_quorum2.lua <running>
59935 ttys000 0:02.11 tarantool master_quorum1.lua <running>
59961 ttys000 0:04.90 tarantool master_quorum2.lua <running>
36478 ttys001 0:00.21 -bash
- Try to kill all of left instances, but 1st replica won't be terminated:
tntmac02:test tntmac02.tarantool.i$ kill -15 58879 58881 58934 58941 59266 59311 59448 59461 59559 59593 59935 59961
tntmac02:test tntmac02.tarantool.i$ ps
PID TTY TIME CMD
42749 ttys000 0:02.28 -bash
59266 ttys000 0:02.26 tarantool master_quorum1.lua <running>
59448 ttys000 0:02.13 tarantool master_quorum1.lua <running>
59461 ttys000 0:02.11 tarantool master_quorum1.lua <running>
59935 ttys000 0:02.12 tarantool master_quorum1.lua <running>
36478 ttys001 0:00.21 -bash
- check log artifacts after all previous steps:
Optional (but very desirable):
- coredump
- backtrace
- netstat