Introduce a mechanism to refine expected behavior of regression tests by solver. #7292

jimgrundy · 2022-11-02T14:20:12Z

This PR introduced some new tests for unions in structs: #7278
Those tests pass with some solver configurations (built-in) but not others (Z3).
We have a mechanism to indicate which tests are expected to pass and which to fail, but we don't have a mechanism to indicate expected pass/fail status by solver.

The lack of such a mechanism is a hinderance for SMT back-end development because it discourages tests that don't pass in all solvers. Or, if we check in such tests then they will fail in some configurations and every person with a PR from here on out is doomed to investigate what this is and why it isn't related to their change. Or, worse, they stop paying attention to fails that don't block check in.

A mechanism needs to be added to handle this elegantly. If that can't be done soon then this change should be backed out until it is.

tautschnig · 2022-11-03T09:19:41Z

We have a mechanism to indicate which tests are expected to pass and which to fail, but we don't have a mechanism to indicate expected pass/fail status by solver.

I am not sure I can agree with this statement: don't the tags that we have:

"broken-smt-backend" -- broken for all solvers
"broken-cprover-smt-backend" -- broken just with the built-in solver
"broken-z3-smt-backend" -- broken with Z3, but not the built-in solver

do the trick? What else are we missing?

The existing tags permit a more precise tracking to pinpoint where there issues actually are. Fixes: diffblue#7292

TGWDB · 2022-11-03T10:08:16Z

We have a mechanism to indicate which tests are expected to pass and which to fail, but we don't have a mechanism to indicate expected pass/fail status by solver.

I am not sure I can agree with this statement: don't the tags that we have:

"broken-smt-backend" -- broken for all solvers

"broken-cprover-smt-backend" -- broken just with the built-in solver

"broken-z3-smt-backend" -- broken with Z3, but not the built-in solver

do the trick? What else are we missing?

We also have CI with cvc, so that's one more solver for the potential list.

That said, this arose from issues with the KNOWNBUG run that does not (I'm pretty sure?) test all the solvers we have in CI.

Just how fine-grained do we need to be in CI here? We seem to (as per the quoted message) have pretty good fine-tuning, but it is not always understood or used.

jimgrundy · 2022-11-03T14:53:46Z

So, after discussion with @tautschnig I think what I want is the following:

These particular tests to be tagged using the set of tags we have to indicate accurately in which configurations they are expected to pass an which they are expected to fail
We conduct a review of there tests in the "core" set (run on PRs) and tag them all as "required" (can't merge your PR unless the outcome matches the expectations indicated by the tags). Perhaps with a couple of well motivated exceptions (csmith, coverage). This will force the next person who checks in a test to accurately tag it.
We write some documentation about what tags are available for tests and what they mean and place it somewhere prominent (it should get rendered into a developer section of our web site) so that a person who encounters this know what they need to do.

jimgrundy · 2022-11-03T22:15:42Z

Bumping this to "high" after discussion this morning to get it assigned to the right person/people. We may need @tautschnig to do the documentation, but perhaps DiffBlue hand handle the other aspects?

tautschnig · 2022-11-15T12:33:36Z

* These particular tests to be tagged using the set of tags we have to indicate accurately in which configurations they are expected to pass an which they are expected to fail

This was completed in #7290.

* We conduct a review of there tests in the "core" set (run on PRs) and tag them all as "required" (can't merge your PR unless the outcome matches the expectations indicated by the tags). Perhaps with a couple of well motivated exceptions (csmith, coverage). This will force the next person who checks in a test to accurately tag it.

This requires admin permissions for the repository. @TGWDB could you please co-ordinate this activity?

* We write some documentation about what tags are available for tests and what they mean and place it somewhere prominent (it should get rendered into a developer section of our web site) so that a person who encounters this know what they need to do.

Turns out we already have this information in regression/readme.md. We need to figure out what file anyone might actually look at, and I'd lean towards just renaming this file to README.md so that it doesn't get lost as easily among all the directories. Update: Done in #7347.

NlightNFotis · 2022-11-29T15:01:21Z

@tautschnig Is there something outstanding here, or are we good to close this ticket?

tautschnig · 2022-11-29T15:11:01Z

@tautschnig Is there something outstanding here, or are we good to close this ticket?

I think the piece left open is:

We conduct a review of there tests in the "core" set (run on PRs) and tag them all as "required" (can't merge your PR unless the outcome matches the expectations indicated by the tags). Perhaps with a couple of well motivated exceptions (csmith, coverage). This will force the next person who checks in a test to accurately tag it.

This requires someone with admin permissions to change which of the CI jobs are marked "Required."

NlightNFotis · 2022-12-12T10:13:52Z

@tautschnig We have now updated the list of jobs required to succeed in merging a PR, with the following extra jobs now being required:

compileXen
compileLinux
check-ubuntu-20_04-cmake-gcc-KNOWNBUG
check-ubuntu-20_04-cmake-gcc-THOROUGH
check-ubuntu-20_04-make-clang-smt-z3
check-doxygen
windows-msi-package
ubuntu-18_04-package
check-docker-image

, in addition to the ones we marked as required before our last meeting.

Is there anything we're missing here, and if not, can we mark this as now done?

tautschnig · 2022-12-12T13:11:10Z

Thank you @NlightNFotis ! I think "Publish CBMC documentation / publish" could also be marked as such, but otherwise I think we're done here. Feel free to close!

feliperodri added bug aws Bugs or features of importance to AWS CBMC users labels Nov 2, 2022

feliperodri assigned TGWDB Nov 2, 2022

tautschnig added a commit to tautschnig/cbmc that referenced this issue Nov 3, 2022

More precisely tag union tests failing with (some) SMT back-end

190f8ff

The existing tags permit a more precise tracking to pinpoint where there issues actually are. Fixes: diffblue#7292

tautschnig mentioned this issue Nov 3, 2022

More precisely tag union tests failing with (some) SMT back-end #7295

Closed

3 tasks

jimgrundy added the aws-high label Nov 3, 2022

TGWDB closed this as completed Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce a mechanism to refine expected behavior of regression tests by solver. #7292

Introduce a mechanism to refine expected behavior of regression tests by solver. #7292

jimgrundy commented Nov 2, 2022

tautschnig commented Nov 3, 2022

Uh oh!

TGWDB commented Nov 3, 2022

Uh oh!

jimgrundy commented Nov 3, 2022

Uh oh!

jimgrundy commented Nov 3, 2022

Uh oh!

tautschnig commented Nov 15, 2022 •

edited

Loading

Uh oh!

NlightNFotis commented Nov 29, 2022

Uh oh!

tautschnig commented Nov 29, 2022

Uh oh!

NlightNFotis commented Dec 12, 2022

Uh oh!

tautschnig commented Dec 12, 2022

Uh oh!

Introduce a mechanism to refine expected behavior of regression tests by solver. #7292

Introduce a mechanism to refine expected behavior of regression tests by solver. #7292

Comments

jimgrundy commented Nov 2, 2022

tautschnig commented Nov 3, 2022

Uh oh!

TGWDB commented Nov 3, 2022

Uh oh!

jimgrundy commented Nov 3, 2022

Uh oh!

jimgrundy commented Nov 3, 2022

Uh oh!

tautschnig commented Nov 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NlightNFotis commented Nov 29, 2022

Uh oh!

tautschnig commented Nov 29, 2022

Uh oh!

NlightNFotis commented Dec 12, 2022

Uh oh!

tautschnig commented Dec 12, 2022

Uh oh!

tautschnig commented Nov 15, 2022 •

edited

Loading