Fill in some the the TODOs.

johanneskloos · johanneskloos · commit c831059a120b · 2018-09-04T12:49:06.000+01:00
The BMC section is still not quite complete: We have not yet figured
out the best way to present the actual BMC algorithm.
diff --git a/doc/architectural/background-concepts.md b/doc/architectural/background-concepts.md
@@ -4,7 +4,7 @@
 \author Martin Brain, Peter Schrammel, Johannes Kloos
 
 The purpose of this section is to explain several key concepts used throughout
-CBMC on a high level, ignoring the details of the actual implementation.
+the CPROVER framework on a high level, ignoring the details of the actual implementation.
 In particular, we will discuss different ways to represent programs in memory,
 three important analysis methods and some commonly used terms.
 
@@ -21,16 +21,14 @@ representation, and it is easy to go from representations that are
 close to the source code to representations that focus on specific
 semantic aspects of the program.
 
-The representations that CBMC uses mirror commonly-used representations
+The representations that the CPROVER framework uses mirror those
 used in modern compilers such as LLVM and gcc. I will point out those
-places where CBMC does things differently, attempting to give rationales
-wherever possible.
+places where the CPROVER framework does things differently, attempting to give
+rationales wherever possible.
 
-#More in-depth information can be found in ...
-#TODO: Find a good online compiler construction course and reference
-#here.
-# TODO: There doesn't seem to be a good online course that covers
-# all of these topics in one go. Maybe buy the Dragon Book?
+One in-depth resource for most of this section is the classic
+(compiler construction text book)[https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools]
+by Aho, Lam, Sethi and Ullman.
 
 To illustrate the different concepts, we will consider a small example
 program. While the program is in C, the general ideas apply to other
@@ -76,10 +74,12 @@ representation.
 
 The first step in representing a program in memory is to parse the
 program, at the same time checking for syntax errors, and store the
-result of parsing the program in memory.
+parsing result in memory.
 
 The key data structure that stores the result of this step is known as an
-**Abstract Syntax Tree**, or **AST** for short. ASTs are still relatively
+**Abstract Syntax Tree**, or **AST** for short
+(cf. [Wikipedia](https://en.wikipedia.org/wiki/Abstract_syntax_tree)).
+ASTs are still relatively
 close to the source code, and represent the structure of the source code
 while abstracting away from irrelevant details, e.g., dropping
 parentheses, semicolons and braces as long as those are only used for
@@ -93,7 +93,7 @@ declarations of `atoi` and `printf`, and the function definitions of
 `atoi`. This gives rise to a subtree modeling that we have a function
 `atoi` with return type `int` and unnamed argument of type `const char *`.
 We can represent this using a tree that has, for instance, the
-following structure (this is a simplified version of the tree that CBMC
+following structure (this is a simplified version of the tree that the CPROVER framework
 uses internally):
 
 \dot "AST for the atoi declaration"
@@ -352,11 +352,15 @@ In general, for analyses based around Abstract Interpretation (see below), it
 is usually preferable to use a CFG representation, while other analyses,
 such as variable scope detection, may be easier to perform on ASTs.
 
-The general idea is to break the program into **basic blocks** - a basic
-block is a sequence of statements that is always executed linearly from
-beginning to end. The basic blocks of them program are then used as the
-nodes of a graph. The edges of the graph describe how the program
-execution may move from one basic block to the next.
+The general idea is to present the program as a graph. The nodes of the graph
+are instructions or sequences of instructions. In general, the nodes are
+**basic blocks**: A basic block is a sequence of statements that is always
+executed linearly from beginning to end. The edges of the graph describe how
+the program execution may move from one basic block to the next.
+Note that single statements are always basic blocks; this is the representation
+used inside the CPROVER framework. In the examples below, we try to use maximal basic blocks
+(i.e., basic blocks that are as large as possible); this can be advantageous
+for some analyses.
 
 Let us consider the factorial function as an example. As a reminder,
 here is the code:
@@ -506,11 +510,11 @@ digraph ast {
 }
 \enddot
 
-In CBMC, we provide a precise implemenation of &Phi;, using explicitly tracked
+In the CPROVER framework, we provide a precise implemenation of &Phi;, using explicitly tracked
 information about which branches were taken by the program.
 There are also some differences in how loops are
-handled (finite unrolling in CBMC, versus a &Phi;-based approach in
-compilers); the CBMC approach will be discussed in a later chapter.
+handled (finite unrolling in CPROVER, versus a &Phi;-based approach in
+compilers); this approach will be discussed in a later chapter.
 
 For the time being, let us come back to `factorial`. We can now give
 an SSA using &Phi; functions:
@@ -537,60 +541,154 @@ The SSA is an extremely helpful representation when one
 wishes to perform model checking on the program (see next section),
 since it is much easier to extract the logic formulas used in this
 technique from an SSA compared to a CFG (or, even worse, an AST). That
-being said, CBMC takes a different route, opting to convert to
+being said, the CPROVER framework takes a different route, opting to convert to
 intermediate representation known as GOTO programs instead.
 
-\subsection Trace_section Trace
-
-TODO: I am not quite sure what is supposed to go here.
-
-From GitHub: "I think this was intended to describe how a (counterexample) trace is represented."
-
 \section analysis_techniques_section Analysis techniques
 
 \subsection BMC_section Bounded model checking
 
-To be documented (can copy from the CBMC manual).
+One of the most important analysis techniques by the CPROVER framework,
+implemented using the CBMC (and JBMC) tools, is **bounded model checking**,
+a specific instance of a method known as
+[Model Checking](https://en.wikipedia.org/wiki/Model_checking).
+
+The basic question that model checking tries to answer is: Given some system
+(in our case, a program) and some property, can we find an execution of the
+system such that it reaches a state where the property holds?
+If yes, we would like to know how the program reaches this state - at the very
+least, we want to see what inputs are required, but in general, we would prefer
+having a **trace**, which shows what statements are executed and in which order.
+
+As it turns out, model checking for program is, in general, a hard problem.
+Part of the reason for this is that many model checking algorithms strive for
+a form of ''completeness'' where they either find a trace or return a proof
+that such a trace cannot possibly exist.
+
+Since we are interested in generating test cases, we prefer a different
+approach: It may be that a certain target state is reachable only after
+a very long execution, or not at all, but this information does not help
+us in constructing test cases. For this reason, we introduce an
+**execution bound** that describes how deep we go when analyzing a program.
+
+Model checking techniques using such execution bounds are known as
+**bounded model checking**; they will return either a trace, or a statement
+that says ''the target state could not be reached in n steps'', for a
+given bound n. Thus, for a given bound, we always get an
+**underapproximation** of all states that can be reached: We
+can certainly find those reachable within the given bound, but we may
+miss states that can be reached only with more steps. Conversely, we will
+never claim that a state is not reachable within a certain bound if there is,
+in fact, a way of reaching this state.
+
+The bounded model checking techniques used by the CPROVER framework are
+based on *symbolic model checking*, a family of model checking techniques that
+work on sets of program states and use advanced tools such as SAT solvers (more
+on that below) to calculate the set of reachable states. The key step here
+is to encode both the program and the set of states using an appropriate logic,
+mostly *propositional logic* and (fragments of) *first-order logic*.
+
+In the following, we will quickly discuss propositional logic, in combination
+with SAT solving, and show how to build a simple bounded model checker for
+a finite-state program. Actual bounded model checking for software requires
+a number of additional steps and concepts, which will be introduced as required
+later on.
+
+\subsubsection propositional_logic_subsubsection Propositional logic and SAT solving
+
+Many of the concepts in this section can be found in more detail in the
+Wikipedia article on [Propositional logic](https://en.wikipedia.org/wiki/Propositional_calculus).
+
+Let us start by looking at **propositional formulas**. A propositional formula
+consists of **propositional variables**, say *x*, *y* and *z*, that can
+take the Boolean values **true** and **false**, connected together with
+logical operators (often called *junctors*), namely *and*, *or* and *not*.
+Sometimes, one introduces additional junctors, such as *xor* or *implies*,
+but these can be defined in terms of the three basic junctors just described.
+
+Examples of propositional formulas would be ''*x* and *y*'' or
+''not *x* or *y* or *z*''. We can **evaluate** formulas by setting each
+variable to a Boolean value and reducing using the follows rules:
+- *x* and **false** = **false** and *x* = **false** or **false** = not **true** = **false**
+- *x* or **true** = **true** or *x* = **true** and **true** = not **false** = **true**
+
+An important related question is: Given a propositional formula,
+is there a variable assignment that makes it evaluate to true?
+This is known as the [**SAT problem**](https://en.wikipedia.org/wiki/Boolean_satisfiability_problem).
+The most important things to know about SAT are:
+1. It forms the basis for bounded model checking algorithms;
+2. It is a very hard problem to solve in general: It is NP-complete,
+   meaning that it is easy to check a solution, but (as far as we know) hard
+   to find one;
+3. There has been impressive research in
+  [**SAT solvers**](https://en.wikipedia.org/wiki/Boolean_satisfiability_problem#Algorithms_for_solving_SAT)
+  that work well in practice for the kinds of formulas that we encounter in
+  model checking. A commonly-used SAT solver is [minisat](http://minisat.se).
+4. SAT solvers use a specific input format for propositional formulas, known
+  as the [**conjunctive normal form**](https://en.wikipedia.org/wiki/Conjunctive_normal_form).
+  For details, see the linked Wikipedia page;
+  roughly, a conjunctive normal form formula is a propositional
+  formula with a specific shape: At the lowest level are
+  *atoms*, which are propositional variables ''*x*'' and
+  negated propositional variables ''not *x*'';
+  the next layer above are *clauses*, which are sequences
+  of atoms connected with ''or'', e.g. ''not *x* or *y* or *z*''.
+  The top layer consists sequences of clauses,
+  connected with ''and''.
+
+As an example in how to use a SAT solver, consider the following
+formula (in conjunctive normal form):
+
+''(*x* or *y*) and (*x* or not *y*) and *x* and *y*''
+
+We can represent this formula in the minisat input format as:
+
+```
+p cnf 2 3
+1 2 0
+1 -2 0
+1 0
+2 0
+```
+Compare the [Minisat user guide](https://www.dwheeler.com/essays/minisat-user-guide.html).
+Try to run minisat on this example. What would you expect, and what result do you get?
+
+Next, try running minisat on the following formula:
+''(*x* or *y*) and (*x* or not *y*) and (not *x*) and *y*''
+What changed? Why?
 
-TODO: There seems to be no sufficiently comprehensive section in the
-CBMC manual. Find some other sources to collect information from.
+\subsubsection bmc_subsubsection How bounded model checking works
 
-- Idea of model checking: Given some property, check if we can reach
-  it; if yes, under what conditions?
-- Bounded model-checking: Cut off search at some point; either prove
-  that property must occur before, or accept underapproximation.
 - How does it work? Encoding of state as propositions/first-order
   propositions, step function and composition of step functions.
 - Solve question: Is there a model for this formula?
-
-\subsection SAT_section SAT and SMT
-
-- Goal: Solve model-finding problem.
-- Answers: Here is a model, no model possible
-- Complexity issues
-- SAT and SMT as extremely advanced black boxes that solve
-  a good part of problems efficiently. Point out MiniSat,
-  Z3 as good examples that we use.
-- Pointer to literature: CDCL, Lazy/Eager instantiation
-
-From our wiki:
-In a Boolean formula, variables can only be assigned values from the set
-{true, false}, and they are connected using and, or and not operators.
-The Boolean satisfiability problem is to find a valuation of the variables
-of a given formula so that the formula evaluates to true, or to prove that
-there is no such valuation. SAT is an NP-complete problem, meaning
-a given solution can be always verified in an amount of time that is polynomial
-in terms of the size of the input formula, and
-there is no known algorithm that computes a solution to any given formula
-in polynomial time (and the existence of such an algorithm would prove P = NP).
+- Traces
+
+\subsubsection smt_etc_subsubsection Where to go from here
+
+The above section gives only a superficial overview on how SAT solving and
+bounded model checking work. Inside the CPROVER framework, we use a
+significantly more advanced engine, with numerous optimizations to the
+basic algorithms presented above. One features that stands out is that
+we do not reduce everything to propositional logic, but instead use a more
+powerful logic, namely quantifier-free first-order logic. The main
+difference is that instead of propositional variables, we allow expressions
+that return Boolean values, such as comparisons between numbers or string
+matching expressions. This gives us a richer logic to express properties.
+Of course, a simple SAT solver cannot deal with such formulas, which is why we
+go to [*SMT solvers*](https://en.wikipedia.org/wiki/Satisfiability_modulo_theories)
+instead - these solvers can deal with specific classes
+of first-order formulas (like the ones we produce).
+One well-known SMT solver is [Z3](https://github.com/Z3Prover/z3).
 
 \subsection static_analysis_section Static analysis
 
 While BMC analyses the program by transforming everything to logic
 formulas and, essentially, running the program on sets of concrete
 states, another approach to learn about a program is based on the idea
 of interpreting an abstract version of the program. This is known
-as **abstract interpretation**. Abstract interpretation is one of the
+as [**abstract interpretation**](https://en.wikipedia.org/wiki/Abstract_interpretation).
+Abstract interpretation is one of the
 main methods in the area of **static analysis**.
 
 The key idea is that instead of looking at concrete program states
@@ -600,8 +698,17 @@ sufficiently-precise abstraction (e.g., ''variable x is odd'', or
 using such abstract values. Coming back to our running example, we wish
 to prove that the factorial function never returns 0.
 
+An abstract interpretation is made up from four ingredients:
+1. An **abstract domain**, which represents the analysis results.
+2. A family of **transformers**, which describe how basic programming
+   languge constructs modify the state.
+3. A way to map from pairs of program locations (e.g., positions in the
+   program code) and variable names to values in the abstract domain.
+4. An algorithm to compute a ''fixed point'', computing a map as described
+   in the previous step that describes the behavior of the program.
+
 The first ingredient we need for abstract interpretation is the
-so-called **abstract domain**. An abstract domain is a set $D$ (or, if you
+**abstract domain**. An abstract domain is a set $D$ (or, if you
 prefer, a data type) with the following properties:
 - There is a function merge that takes two elements of $D$ and returns
   an element of $D$. This function is associative (merge(x, merge(y,z))
@@ -622,14 +729,14 @@ For our example, we use the following domain:
 It is easy but tedious to check that all conditions hold.
 
 The domain allows us to express what we know about a given variable or
-value; in our example, whether it is zero or not.
+value at a given program location; in our example, whether it is zero or not.
 The way we use the abstract domain is for each program point, we have a
 map from visible variables to elements of the abstract domain,
 describing what we know about the values of the variables at this point.
 
 For instance, consider the `factorial` example again. After running the
 first basic block, we know that `fac` and `i` both contain 1, so we have
-a map to maps both `fac` and `i` to "not 0".
+a map that associates both `fac` and `i` to "not 0".
 
 The second ingredient we need are the **abstract state transformers**.
 An abstract state transformer describes how a specific expression or
@@ -666,7 +773,13 @@ underlying program instructions. There is a formal description of this
 property, using *Galois connections*; for the details, it is best to
 look at the literature.
 
-At this point, we have all the ingredients we need to set up an abstract
+The third ingredient is straightforward: We use a simple map
+from program locations and variable names to values in the abstract
+domain. In more complex analyses, more involved forms of maps may be
+used (e.g., to handle arbitrary procedure calls, or to account for the
+heap).
+
+At this point, we have almost all the ingredients we need to set up an abstract
 interpretation. To actually analyze a function, we take its CFG and
 perform a *fixpoint algorithm*.
 
@@ -764,7 +877,7 @@ topics can be found in the literature as well.
 
 \section Glossary_section Glossary
 
-# Instrument
+\subsectino instrument_subsection Instrument
 
 To instrument a piece of code means to modify it by (usually) inserting new
 fragments of code that, when executed, tell us something useful about the code
@@ -811,14 +924,31 @@ makes it easier for a given analysis to do its job, regardless of whether that
 is achieved by executing the instrumented code or by just analyzing it in some
 other way.
 
-# Flattening and Lowering
-
-TODO: Explain that this means going from a more high-level
-representation to a more low-level representation, with the goal of
-enabling a different set of transformations, analyses or optimizations.
-
-# Verification Condition
-
-TODO: Sketch how a trivial program maps to VCs
+\subsection flattening_lowering_subsection Flattening and Lowering
+
+As we have seen above, we often operate on many different representations
+of programs, such as ASTs, control flow graphs, SSA programs, logical formulas
+in BMC and so on. Each of these forms is good for certain kinds of analyses,
+transformations or optimizations.
+
+One important kind of step in dealing with program representations is
+going from one representation to the other. Often, such steps are going
+from a more ''high-level'' representation (closer to the source code)
+to a more ''low-level'' representation. Such transformation steps are
+known as **flattening** or **lowering** steps, and tend to be more-or-less
+irreversible.
+
+\subsection verification_condition_subsection Verification Condition
+
+In the CPROVER framework, the term **verification condition** is used
+in a somewhat non-standard way. Let a program and a set of assertions
+be given. We transform the program into an SSA and turn it into a
+logical formula, as described above. Note that in this case, the
+formula will also contain information about what the program does
+after the assertion is reached: This part of the formula, is, in fact,
+irrelevant for deciding whether the program can satisfy the assertion or
+not. The *verification condition* is the part of the formula that only
+covers the program execution until the line that checks the assertion
+has been executed, with everything that comes after it removed.