-
Notifications
You must be signed in to change notification settings - Fork 388
Assertion failed: (e->txn->signature >= 0), function txn_limbo_read_confirm #6057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reproducer: fiber = require('fiber')
box.cfg{
replication_synchro_quorum = 1,
replication_synchro_timeout = 1000000,
}
s = box.schema.create_space('test', {is_sync = true})
_ = s:create_index('pk')
s2 = box.schema.create_space('test2')
_ = s2:create_index('pk')
box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 1)
lsn = box.info.lsn
f = fiber.create(function() s:replace{1} end)
while not box.info.lsn == lsn + 1 do fiber.sleep(0.1) end
f2 = fiber.create(function() s2:replace{2} end)
box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 0)
box.error.injection.set("ERRINJ_WAL_DELAY", false) |
Gerold103
added a commit
that referenced
this issue
May 27, 2021
It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entires on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057
Gerold103
added a commit
that referenced
this issue
May 27, 2021
It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057
Gerold103
added a commit
that referenced
this issue
May 28, 2021
It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057
Gerold103
added a commit
that referenced
this issue
Jun 1, 2021
It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057 (cherry picked from commit 2a0a56c)
Gerold103
added a commit
that referenced
this issue
Jun 1, 2021
It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057 (cherry picked from commit 2a0a56c)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
After some actions I got:
Problem shoots 10 of 10 times on MacOS and 1 of 10 on Linux on the same logic.
I can't extract reproducer since the problem appears on one TDG dev branch under Tarantool Enterprise 2.8.1. But ask me for additional details if stacktrace above is not enough.
The text was updated successfully, but these errors were encountered: