Skip to content

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olegrok opened this issue Apr 29, 2021 · 1 comment
Closed
Assignees
Milestone

Comments

@olegrok
Copy link
Collaborator

olegrok commented Apr 29, 2021

After some actions I got:

Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm, file /Users/a.barulev/Projects/sdk/tarantool-2.8/src/box/txn_limbo.c, line 413.
Process 1881 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #4: 0x0000000100118d5c tarantool`txn_limbo_read_confirm + 476
tarantool`txn_limbo_read_confirm:
->  0x100118d5c <+476>: jmp    0x100118d61               ; <+481>
    0x100118d61 <+481>: movq   -0x18(%rbp), %rax
    0x100118d65 <+485>: movq   0x10(%rax), %rdi
    0x100118d69 <+489>: callq  0x1001126e0               ; txn_complete_success
Target 0: (tarantool) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = hit program assert
    frame #0: 0x00007fff202e2946 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff20311615 libsystem_pthread.dylib`pthread_kill + 263
    frame #2: 0x00007fff20266411 libsystem_c.dylib`abort + 120
    frame #3: 0x00007fff202657e8 libsystem_c.dylib`__assert_rtn + 314
  * frame #4: 0x0000000100118d5c tarantool`txn_limbo_read_confirm + 476
    frame #5: 0x00000001001189c5 tarantool`txn_limbo_ack + 645
    frame #6: 0x0000000100114096 tarantool`txn_commit + 614
    frame #7: 0x000000010011e904 tarantool`box_process_rw + 420
    frame #8: 0x00000001001269a5 tarantool`box_process1 + 389
    frame #9: 0x00000001001274f2 tarantool`box_replace + 114
    frame #10: 0x0000000100222ba8 tarantool`lbox_replace + 152
    frame #11: 0x00000001002bbd83 tarantool`lj_BC_FUNCC + 68
    frame #12: 0x00000001002c417a tarantool`lua_pcall + 490
    frame #13: 0x000000010024ba33 tarantool`luaT_call + 35
    frame #14: 0x00000001002447e6 tarantool`lua_fiber_run_f + 118
    frame #15: 0x0000000100003cca tarantool`fiber_cxx_invoke(int (*)(__va_list_tag*), __va_list_tag*) + 26
    frame #16: 0x000000010027236b tarantool`fiber_loop + 187
    frame #17: 0x0000000100bafc37 tarantool`coro_init + 87
(lldb) list

Problem shoots 10 of 10 times on MacOS and 1 of 10 on Linux on the same logic.
I can't extract reproducer since the problem appears on one TDG dev branch under Tarantool Enterprise 2.8.1. But ask me for additional details if stacktrace above is not enough.

@olegrok olegrok added crash bug Something isn't working labels Apr 29, 2021
@kyukhin kyukhin added this to the 2.7.3 milestone Apr 29, 2021
@Gerold103
Copy link
Collaborator

Reproducer:

fiber = require('fiber')
box.cfg{
    replication_synchro_quorum = 1,
    replication_synchro_timeout = 1000000,
}
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')
s2 = box.schema.create_space('test2')
_ = s2:create_index('pk')

box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 1)
lsn = box.info.lsn
f = fiber.create(function() s:replace{1} end)
while not box.info.lsn == lsn + 1 do fiber.sleep(0.1) end

f2 = fiber.create(function() s2:replace{2} end)
box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 0)
box.error.injection.set("ERRINJ_WAL_DELAY", false)

@Gerold103 Gerold103 self-assigned this May 27, 2021
Gerold103 added a commit that referenced this issue May 27, 2021
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.

Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.

Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.

Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.

When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.

It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.

This replica case is not possible now though. Because all synchro
entires on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.

Closes #6057
Gerold103 added a commit that referenced this issue May 27, 2021
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.

Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.

Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.

Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.

When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.

It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.

This replica case is not possible now though. Because all synchro
entries on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.

Closes #6057
Gerold103 added a commit that referenced this issue May 28, 2021
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.

Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.

Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.

Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.

When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.

It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.

This replica case is not possible now though. Because all synchro
entries on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.

Closes #6057
Gerold103 added a commit that referenced this issue Jun 1, 2021
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.

Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.

Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.

Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.

When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.

It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.

This replica case is not possible now though. Because all synchro
entries on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.

Closes #6057

(cherry picked from commit 2a0a56c)
Gerold103 added a commit that referenced this issue Jun 1, 2021
It is possible that a new async transaction is added to the limbo
when there is an in-progress CONFIRM WAL write for all the pending
sync transactions.

Then when CONFIRM WAL write is done, it might see that the limbo
now in the first place contains an async transaction not yet
written to WAL. A suspicious situation - on one hand the async
transaction does not have any blocking sync txns before it and
can be considered complete, on the other hand its WAL write is not
done and it is not complete.

Before this patch it resulted into a crash - limbo didn't consider
the situation possible at all.

Now when CONFIRM covers a not yet written async transactions, they
are removed from the limbo and are turned to plain transactions.

When their WAL write is done, they see they no more have
TXN_WAIT_SYNC flag and don't even need to interact with the limbo.

It is important to remove them from the limbo right when the
CONFIRM is done. Because otherwise their limbo entry may be not
removed at all when it is done on a replica. On a replica the
limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If
there would be an async transaction in the first position in the
limbo queue, it wouldn't be deleted until next sync transaction
appears.

This replica case is not possible now though. Because all synchro
entries on the applier are written in a blocking way. Nonetheless
if it ever becomes non-blocking, the code should handle it ok.

Closes #6057

(cherry picked from commit 2a0a56c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants