Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

olegrok · 2021-04-29T10:55:38Z

After some actions I got:

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm, file /Users/a.barulev/Projects/sdk/tarantool-2.8/src/box/txn_limbo.c, line 413.
Process 1881 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x0000000100118d5c tarantool`txn_limbo_read_confirm + 476
tarantool`txn_limbo_read_confirm:
->  0x100118d5c <+476>: jmp    0x100118d61               ; <+481>
    0x100118d61 <+481>: movq   -0x18(%rbp), %rax
    0x100118d65 <+485>: movq   0x10(%rax), %rdi
    0x100118d69 <+489>: callq  0x1001126e0               ; txn_complete_success
Target 0: (tarantool) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00007fff202e2946 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff20311615 libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007fff20266411 libsystem_c.dylib`abort + 120
    frame #3: 0x00007fff202657e8 libsystem_c.dylib`__assert_rtn + 314
  * frame #4: 0x0000000100118d5c tarantool`txn_limbo_read_confirm + 476
    frame #5: 0x00000001001189c5 tarantool`txn_limbo_ack + 645
    frame #6: 0x0000000100114096 tarantool`txn_commit + 614
    frame #7: 0x000000010011e904 tarantool`box_process_rw + 420
    frame #8: 0x00000001001269a5 tarantool`box_process1 + 389
    frame #9: 0x00000001001274f2 tarantool`box_replace + 114
    frame #10: 0x0000000100222ba8 tarantool`lbox_replace + 152
    frame #11: 0x00000001002bbd83 tarantool`lj_BC_FUNCC + 68
    frame #12: 0x00000001002c417a tarantool`lua_pcall + 490
    frame #13: 0x000000010024ba33 tarantool`luaT_call + 35
    frame #14: 0x00000001002447e6 tarantool`lua_fiber_run_f + 118
    frame #15: 0x0000000100003cca tarantool`fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) + 26
    frame #16: 0x000000010027236b tarantool`fiber_loop + 187
    frame #17: 0x0000000100bafc37 tarantool`coro_init + 87
(lldb) list

Problem shoots 10 of 10 times on MacOS and 1 of 10 on Linux on the same logic.
I can't extract reproducer since the problem appears on one TDG dev branch under Tarantool Enterprise 2.8.1. But ask me for additional details if stacktrace above is not enough.

Gerold103 · 2021-05-27T19:48:19Z

Reproducer:

fiber = require('fiber')
box.cfg{
    replication_synchro_quorum = 1,
    replication_synchro_timeout = 1000000,
}
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')
s2 = box.schema.create_space('test2')
_ = s2:create_index('pk')

box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 1)
lsn = box.info.lsn
f = fiber.create(function() s:replace{1} end)
while not box.info.lsn == lsn + 1 do fiber.sleep(0.1) end

f2 = fiber.create(function() s2:replace{2} end)
box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 0)
box.error.injection.set("ERRINJ_WAL_DELAY", false)

It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entires on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057

It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057

It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057 (cherry picked from commit 2a0a56c)

olegrok added crash bug Something isn't working labels Apr 29, 2021

kyukhin added this to the 2.7.3 milestone Apr 29, 2021

kyukhin added 5sp qsync replication replication labels Apr 29, 2021

Gerold103 self-assigned this May 27, 2021

Gerold103 closed this as completed in 2a0a56c Jun 1, 2021

olegrok mentioned this issue Jul 30, 2021

Run TDG test suite with enabled mvcc and synchro before release #6275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

olegrok commented Apr 29, 2021

Gerold103 commented May 27, 2021

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

Comments

olegrok commented Apr 29, 2021

Gerold103 commented May 27, 2021