Skip to content

Commit 28d235e

Browse files
authored
Merge pull request #7470 from NlightNFotis/docs_goto_programs
Add documentation for the central data structures in CBMC IR
2 parents 32318b7 + e55d40e commit 28d235e

File tree

4 files changed

+213
-3
lines changed

4 files changed

+213
-3
lines changed

.github/workflows/publish.yaml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,14 @@ jobs:
1010
uses: actions/checkout@v3
1111

1212
- name: Install doxygen
13-
run: sudo apt install doxygen graphviz pandoc
13+
run: sudo apt install doxygen graphviz pandoc npm
1414

1515
- name: Install python modules
1616
run: sudo python3 -m pip install gitpython pandocfilters
1717

18+
- name: Install mermaid diagram filter
19+
run: sudo npm install --global mermaid-filter
20+
1821
- name: Build documentation
1922
run: cd doc/doxygen-root && make && touch html/.nojekyll
2023

doc/architectural/cbmc-architecture.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,14 @@ tools process this format, either directly from the front-end or from
5555
it’s saved output. These include a wide range of analysis and
5656
transformation tools (see \ref other-tools).
5757

58-
# Concepts #
58+
# Concepts
59+
60+
## Central data structures
61+
62+
For an explanation of the data structures involved in the modeling of a GOTO
63+
program (the GOTO Intermediate Representation used by CBMC and assorted tools)
64+
please see \subpage central-data-structures .
65+
5966
## {C, java bytecode} → Parse tree → Symbol table → GOTO programs → GOTO program transformations → BMC → counterexample (goto_tracet) → printing ##
6067

6168
To be documented.
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
\ingroup module_hidden
2+
3+
\page central-data-structures Central Data Structures
4+
5+
# Central Data Structures
6+
7+
The following is some light technical documentation of the major data structures
8+
used in the input transformation to Intermediate Representation (IR) inside
9+
CBMC and the assorted CProver tools.
10+
11+
## GOTO models
12+
13+
The `goto_modelt` class is the top-level data structure that CBMC (and the other
14+
tools) use for holding the GOTO intermediate representation. The following
15+
diagram is a simplified view of how the data structures relate to each other -
16+
17+
```mermaid
18+
erDiagram
19+
goto_modelt {
20+
symbol_tablet symbol_table
21+
goto_functionst goto_functions
22+
}
23+
goto_modelt ||--|| symbol_tablet : ""
24+
goto_modelt ||--|| goto_functionst : ""
25+
symbol_tablet {
26+
symbol_map symbols
27+
}
28+
symbol_tablet ||--o{ symbolt : ""
29+
symbolt {
30+
string name
31+
}
32+
goto_functionst {
33+
function_map functions
34+
}
35+
goto_functionst ||--o{ goto_functiont : ""
36+
goto_functiont {
37+
goto_programt body
38+
ordered_collection_of parameter_identifiers
39+
}
40+
goto_functiont ||--|| goto_programt : ""
41+
goto_programt {
42+
ordered_collection_of instructions
43+
}
44+
goto_programt ||--o{ instructiont : ""
45+
instructiont {
46+
enumeration instruction_type
47+
goto_program_codet code
48+
boolean_expression guard
49+
source_locationt source_location
50+
}
51+
instructiont ||--|| goto_program_instruction_typet : ""
52+
instructiont ||--o| goto_program_codet : ""
53+
instructiont ||--o| source_locationt : ""
54+
goto_program_codet ||--o| source_locationt : ""
55+
```
56+
57+
A `goto_modelt` is effectively a pair, consisting of:
58+
59+
* A collection of GOTO functions.
60+
* A symbol table containing symbol definitions which may be referred to by
61+
symbol expressions. Symbol expressions are found in the goto functions and the
62+
(sub-)expressions for the definitions of symbols.
63+
64+
In pseudocode, the type looks this:
65+
66+
```js
67+
type goto_modelt {
68+
type goto_functionst = map<identifier, goto_functiont>
69+
type symbol_tablet = map<identifier, symbolt>
70+
}
71+
```
72+
73+
There is an abstract interface for goto models provided by the
74+
`abstract_goto_modelt` class. This is defined and documented in the file
75+
[`src/goto-programs/abstract_goto_model.h`](../../src/goto-programs/abstract_goto_model.h).
76+
Ideally code which takes a goto model as input should accept any implementation
77+
of the abstract interface rather than accepting the concrete `goto_modelt`
78+
exclusively. This helps with compatibility with `jbmc` which uses lazy loading.
79+
See the `lazy_goto_modelt` class which is defined and documented in
80+
[`jbmc/src/java_bytecode/lazy_goto_model.h`](../../jbmc/src/java_bytecode/lazy_goto_model.h)
81+
for details of lazy loading.
82+
83+
For further information about symbols see the `symbolt` class which is defined
84+
and documented in [`src/util/symbol.h`](../../src/util/symbol.h) and the
85+
`symbol_exprt` (symbol expression) class which is defined and documented in
86+
[`src/util/std_expr.h`](../../src/util/std_expr.h).
87+
88+
## goto_functiont
89+
90+
A `goto_functiont` is also defined as a pair. It's designed to represent a function
91+
at the goto level, and effectively it's the following data type (in pseudocode):
92+
93+
```js
94+
type goto_functiont {
95+
goto_programt body
96+
list<identifiers> parameters
97+
}
98+
```
99+
100+
The `goto_programt` denoting the `body` of the function will be the subject of
101+
a more elaborate explanation in the next section. The in-memory structure of a
102+
goto function allows the `body` to be optional, but only functions with bodies
103+
are included in the serialised goto binaries.
104+
105+
The `parameters` subcomponent is a list of identifiers that are to be looked-up
106+
in the symbol-table for their definitions.
107+
108+
## goto_programt
109+
110+
A goto program is a sequence of GOTO instructions (`goto_instructiont`). For
111+
details of the `goto_programt` class itself see the documentation in
112+
[`goto_program.h`](../../src/goto-programs/goto_program.h). For further details
113+
of the life cycle of goto programs see
114+
[`src/goto-programs/README.md`](https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md).
115+
116+
An instruction (`goto_instructiont`) is a triple (an element with three subcomponents),
117+
describing a GOTO level instruction with the following 3 component subtypes,
118+
again in pseudocode:
119+
120+
```js
121+
type goto_instructiont {
122+
instruction_type instr_type
123+
instruction statement
124+
guard boolean_expr
125+
}
126+
```
127+
128+
The pseudocode subtypes above require some more elaboration, so we will provide some extra
129+
detail to illuminate some details at the code level:
130+
131+
* The `instruction_type` above corresponds to the `goto_program_instruction_typet` type
132+
listed in [`goto_program.h`](../../src/goto-programs/goto_program.h)
133+
* Some illustrative instruction types are `assign`, `function_call`, `return`, etc.
134+
* The `instruction` is a statement represented by a `goto_instruction_codet`.
135+
* A `goto_instruction_codet` is an alias of `codet` (a data structure broadly representing
136+
a statement inside CBMC) that contains the actual code to be executed.
137+
* You can distinguish different statements by using the result of the
138+
`get_statement` member function of the `codet` class.
139+
* The `guard` is an `exprt` (a data structure broadly representing an expression inside CBMC)
140+
that is expected to have type `bool`.
141+
* This is optional - not every instruction is expected to have a `guard` associated with it.
142+
143+
## source_locationt
144+
145+
Another important data structure that needs to be discussed at this point is
146+
`source_locationt`.
147+
148+
`source_locationt` are attached into various `exprt`s (the data structure representing
149+
various expressions, usually the result of some early processing, e.g. the result of the
150+
frontend parsing a file).
151+
152+
This means that `codet`s, representing GOTO instructions *also* have a `source_locationt`
153+
attached to them, by virtue of inheriting from `exprt`.
154+
155+
For the most part, `source_locationt` is something that is only being used when we print
156+
various nodes (for property listing, counterexample/trace printing, etc).
157+
158+
It might be possible that a specific source location might point to a CBMC instrumentation
159+
primitive (which might be reported as a `built-in-addition`) or there might even be no-source
160+
location (because it might be part of harnesses generated as an example, that have no presence
161+
in the user code).
162+
163+
## irept
164+
165+
`irep`s are the underlying data structure used to implement many of the data
166+
structures in CBMC and the assorted tools. These include the `exprt`, `typet`,
167+
`codet` and `source_locationt` classes. This is a tree data structure where
168+
each node is expected to contain a string/ID and may have child nodes stored in
169+
both a sequence of child nodes and a map of strings/IDs to child nodes. This
170+
enables the singular `irept` data structure to be used to model graphs such as
171+
ASTs, CFGs, etc.
172+
173+
The classes extending `irept` define how the higher level concepts are mapped
174+
onto the underlying tree data structure. For this reason it is usually advised
175+
that the member functions of the sub-classes should be used to access the data
176+
held, rather than the member functions of the base `irept` class. This aids
177+
potential future restructuring and associates accesses of the data with the
178+
member functions which have the doxygen explaining what the data is.
179+
180+
The strings/IDs held within `irept` are of type `irep_idt`. These can be
181+
converted to `std::string` using the `id2string` function. There is a mechanism
182+
provided for casting expressions and types in
183+
[`src/util/expr_cast.h`](../../src/util/expr_cast.h). In depth documentation
184+
of the `irept` class itself can be found in
185+
[`src/util/irep.h`](../../src/util/irep.h).

doc/doxygen-root/doxygen-markdown/markdown.sh

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,21 @@ $BINDIR/append-last-modified-dates.py $FILES
3434
# $BINDIR/pandoc-codeblock-repair.sh > $file
3535
# done
3636

37+
echo
38+
echo "Rendering mermaid diagrams"
39+
FILES=$(grep --include=\*.md -rl . -e "\`\`\`mermaid")
40+
for file in $FILES; do
41+
echo $file
42+
tmp=/tmp/${file%.*}2.md
43+
mkdir -p $(dirname $tmp)
44+
cp $file $tmp
45+
# Note that gfm is GitHub Flavour Markdown. The double \\ at the start of
46+
# lines is unescaped using sed in order to fix doxygen tags, which are
47+
# broken by pandoc.
48+
pandoc --read=gfm --write=gfm --wrap=none --filter=mermaid-filter $tmp |
49+
$BINDIR/pandoc-codeblock-repair.sh | sed 's/^\\\\/\\/' > $file
50+
done
51+
3752
cprovers=$(find . -name cprover-manual)
3853
cprover=${cprovers[0]}
3954

@@ -49,7 +64,7 @@ echo "Running pandoc filter over cprover-manual markdown files"
4964
FILES=$(find $cprover -name '*.md')
5065
for file in $FILES; do
5166
echo $file
52-
tmp=/tmp/${file%.*}2.md
67+
tmp=/tmp/${file%.*}3.md
5368
mkdir -p $(dirname $tmp)
5469
cp $file $tmp
5570
pandoc --write=markdown_phpextra --wrap=none --filter=$BINDIR/pandoc-cprover-link-filter.py $tmp |

0 commit comments

Comments
 (0)