Skip to content

Add documentation for the central data structures in CBMC IR #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
80fc758
Add documentation for the central data structures in CBMC IR (goto-pr…
NlightNFotis Jan 10, 2023
59bd183
Change the wording to closer correspond to C++ types and concepts
NlightNFotis Jan 17, 2023
b8544bb
Consistently refer to GOTO as GOTO
thomasspriggs Jan 24, 2023
2cc0467
Correct tense of `used` to `use`.
thomasspriggs Jan 24, 2023
7846485
Clarify what `goto_modelt` is used for
thomasspriggs Jan 24, 2023
692ab26
Replace "main" with "top level" in goto model explaination
thomasspriggs Jan 25, 2023
a0a6874
Add mermaid diagram support to documentation site build
thomasspriggs Jan 27, 2023
983e506
Add goto model entity relationship diagram
thomasspriggs Jan 25, 2023
ea76349
Make central-data-structures a subpage of the architecture page
thomasspriggs Jan 26, 2023
8da4a6f
Correct "list of GOTO functions" to "collection"
thomasspriggs Jan 30, 2023
89647bb
Reword description of the symbol table
thomasspriggs Jan 30, 2023
634e7b1
Remove pseudocode from the description of `goto_programt`
thomasspriggs Jan 30, 2023
9a6b9b4
Reword description of goto program
thomasspriggs Jan 30, 2023
87dc9a3
Add explaination of heirarchy of goto model classes
thomasspriggs Feb 1, 2023
4c4174e
Show that `goto_functionst` is a map not a list
thomasspriggs Feb 1, 2023
1b15dcc
Reword explaination of symbol table
thomasspriggs Feb 1, 2023
29be381
Add links to `symbolt` and `symbol_exprt` classes
thomasspriggs Feb 1, 2023
8aef906
Note optional nature of goto program bodies
thomasspriggs Feb 1, 2023
8b87460
Separate the documentation of the `irept` class
thomasspriggs Feb 1, 2023
d2093fd
Provide more extensive overview of `irept` itself
thomasspriggs Feb 1, 2023
3129464
Add link to goto program documentation from central data structures
thomasspriggs Feb 2, 2023
e55d40e
Replace relative link with web link
thomasspriggs Feb 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ jobs:
uses: actions/checkout@v3

- name: Install doxygen
run: sudo apt install doxygen graphviz pandoc
run: sudo apt install doxygen graphviz pandoc npm

- name: Install python modules
run: sudo python3 -m pip install gitpython pandocfilters

- name: Install mermaid diagram filter
run: sudo npm install --global mermaid-filter

- name: Build documentation
run: cd doc/doxygen-root && make && touch html/.nojekyll

Expand Down
9 changes: 8 additions & 1 deletion doc/architectural/cbmc-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,14 @@ tools process this format, either directly from the front-end or from
it’s saved output. These include a wide range of analysis and
transformation tools (see \ref other-tools).

# Concepts #
# Concepts

## Central data structures

For an explanation of the data structures involved in the modeling of a GOTO
program (the GOTO Intermediate Representation used by CBMC and assorted tools)
please see \subpage central-data-structures .

## {C, java bytecode} → Parse tree → Symbol table → GOTO programs → GOTO program transformations → BMC → counterexample (goto_tracet) → printing ##

To be documented.
Expand Down
185 changes: 185 additions & 0 deletions doc/architectural/central-data-structures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
\ingroup module_hidden

\page central-data-structures Central Data Structures

# Central Data Structures

The following is some light technical documentation of the major data structures
used in the input transformation to Intermediate Representation (IR) inside
CBMC and the assorted CProver tools.

## GOTO models

The `goto_modelt` class is the top-level data structure that CBMC (and the other
tools) use for holding the GOTO intermediate representation. The following
diagram is a simplified view of how the data structures relate to each other -

```mermaid
erDiagram
goto_modelt {
symbol_tablet symbol_table
goto_functionst goto_functions
}
goto_modelt ||--|| symbol_tablet : ""
goto_modelt ||--|| goto_functionst : ""
symbol_tablet {
symbol_map symbols
}
symbol_tablet ||--o{ symbolt : ""
symbolt {
string name
}
goto_functionst {
function_map functions
}
goto_functionst ||--o{ goto_functiont : ""
goto_functiont {
goto_programt body
ordered_collection_of parameter_identifiers
}
goto_functiont ||--|| goto_programt : ""
goto_programt {
ordered_collection_of instructions
}
goto_programt ||--o{ instructiont : ""
instructiont {
enumeration instruction_type
goto_program_codet code
boolean_expression guard
source_locationt source_location
}
instructiont ||--|| goto_program_instruction_typet : ""
instructiont ||--o| goto_program_codet : ""
instructiont ||--o| source_locationt : ""
goto_program_codet ||--o| source_locationt : ""
```

A `goto_modelt` is effectively a pair, consisting of:

* A collection of GOTO functions.
* A symbol table containing symbol definitions which may be referred to by
symbol expressions. Symbol expressions are found in the goto functions and the
(sub-)expressions for the definitions of symbols.

In pseudocode, the type looks this:

```js
type goto_modelt {
type goto_functionst = map<identifier, goto_functiont>
type symbol_tablet = map<identifier, symbolt>
}
```

There is an abstract interface for goto models provided by the
`abstract_goto_modelt` class. This is defined and documented in the file
[`src/goto-programs/abstract_goto_model.h`](../../src/goto-programs/abstract_goto_model.h).
Ideally code which takes a goto model as input should accept any implementation
of the abstract interface rather than accepting the concrete `goto_modelt`
exclusively. This helps with compatibility with `jbmc` which uses lazy loading.
See the `lazy_goto_modelt` class which is defined and documented in
[`jbmc/src/java_bytecode/lazy_goto_model.h`](../../jbmc/src/java_bytecode/lazy_goto_model.h)
for details of lazy loading.

For further information about symbols see the `symbolt` class which is defined
and documented in [`src/util/symbol.h`](../../src/util/symbol.h) and the
`symbol_exprt` (symbol expression) class which is defined and documented in
[`src/util/std_expr.h`](../../src/util/std_expr.h).

## goto_functiont

A `goto_functiont` is also defined as a pair. It's designed to represent a function
at the goto level, and effectively it's the following data type (in pseudocode):

```js
type goto_functiont {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to link this to the actual code or generated doxygen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can link to the file using raw markdown, but I don't think it can link to an actual code structure.

I need to look more into doxygen and its linking capabilities.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can link it, it might help increase the longevity. See #7470 (comment) for my experiences with the lifespan of documentation. The last thing I want is for you to put in the effort documenting things and then, in 2 years, have to do it all again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the final documentation, doxygen is automatically linking class names to the class documentation it produces, so when the class names are written in full they get automatically picked up by it and linked.

However, given that these are central data structures and calcified by now (i.e. not experiencing significant change with regards to the actual components of the class), I'd rather the pseudocode representation stays as a fast-track way for someone to understand the essence of these data structures without the informational load and visual clutter that the class documentation pages in doxygen have.

goto_programt body
list<identifiers> parameters
}
```

The `goto_programt` denoting the `body` of the function will be the subject of
a more elaborate explanation in the next section. The in-memory structure of a
goto function allows the `body` to be optional, but only functions with bodies
are included in the serialised goto binaries.

The `parameters` subcomponent is a list of identifiers that are to be looked-up
in the symbol-table for their definitions.

## goto_programt

A goto program is a sequence of GOTO instructions (`goto_instructiont`). For
details of the `goto_programt` class itself see the documentation in
[`goto_program.h`](../../src/goto-programs/goto_program.h). For further details
of the life cycle of goto programs see
[`src/goto-programs/README.md`](https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this be rendered as on diffblue.github.io?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link follows the same format as is found here -

[COMPILING.md](https://github.com/diffblue/cbmc/blob/develop/COMPILING.md#what-architecture)

this example is rendered to https://diffblue.github.io/cbmc/compilation-and-development.html

The original link I used was causing the check-doxygen job to fail with the error - /home/runner/work/cbmc/cbmc/doc/architectural/central-data-structures.md:120: warning: unexpected token TK_EOF as the argument of ref

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but COMPILING.md isn't actually rendered on diffblue.github.io, while goto-programs/README.md is (the latter as https://diffblue.github.io/cbmc/group__goto-programs.html). It would be great to find a way so that people aren't sent back&forth between Doxygen-rendered content and GitHub rendered pages.


An instruction (`goto_instructiont`) is a triple (an element with three subcomponents),
describing a GOTO level instruction with the following 3 component subtypes,
again in pseudocode:

```js
type goto_instructiont {
instruction_type instr_type
instruction statement
guard boolean_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is source_location missing from this list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?!

}
```

The pseudocode subtypes above require some more elaboration, so we will provide some extra
detail to illuminate some details at the code level:

* The `instruction_type` above corresponds to the `goto_program_instruction_typet` type
listed in [`goto_program.h`](../../src/goto-programs/goto_program.h)
* Some illustrative instruction types are `assign`, `function_call`, `return`, etc.
* The `instruction` is a statement represented by a `goto_instruction_codet`.
* A `goto_instruction_codet` is an alias of `codet` (a data structure broadly representing
a statement inside CBMC) that contains the actual code to be executed.
* You can distinguish different statements by using the result of the
`get_statement` member function of the `codet` class.
* The `guard` is an `exprt` (a data structure broadly representing an expression inside CBMC)
that is expected to have type `bool`.
* This is optional - not every instruction is expected to have a `guard` associated with it.

## source_locationt

Another important data structure that needs to be discussed at this point is
`source_locationt`.

`source_locationt` are attached into various `exprt`s (the data structure representing
various expressions, usually the result of some early processing, e.g. the result of the
frontend parsing a file).

This means that `codet`s, representing GOTO instructions *also* have a `source_locationt`
attached to them, by virtue of inheriting from `exprt`.

For the most part, `source_locationt` is something that is only being used when we print
various nodes (for property listing, counterexample/trace printing, etc).

It might be possible that a specific source location might point to a CBMC instrumentation
primitive (which might be reported as a `built-in-addition`) or there might even be no-source
location (because it might be part of harnesses generated as an example, that have no presence
Copy link
Contributor

@thomasspriggs thomasspriggs Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ "harness" is not well a well defined at this point in the documentation. We use it to mean the cbmc generated code surrounding the entry point specified to cbmc. This generated harness initialises global values, and the arguments for the entry points parameters. However my understanding is that there are users who write their own "harness" functions in C around the function they are verifying. So in this context there would be 2 harnesses when cbmc runs. The cbmc harness which calls the user written harness which calls the function the user is verifying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty.

We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable.

in the user code).

## irept

`irep`s are the underlying data structure used to implement many of the data
structures in CBMC and the assorted tools. These include the `exprt`, `typet`,
`codet` and `source_locationt` classes. This is a tree data structure where
each node is expected to contain a string/ID and may have child nodes stored in
both a sequence of child nodes and a map of strings/IDs to child nodes. This
enables the singular `irept` data structure to be used to model graphs such as
ASTs, CFGs, etc.

The classes extending `irept` define how the higher level concepts are mapped
onto the underlying tree data structure. For this reason it is usually advised
that the member functions of the sub-classes should be used to access the data
held, rather than the member functions of the base `irept` class. This aids
potential future restructuring and associates accesses of the data with the
member functions which have the doxygen explaining what the data is.

The strings/IDs held within `irept` are of type `irep_idt`. These can be
converted to `std::string` using the `id2string` function. There is a mechanism
provided for casting expressions and types in
[`src/util/expr_cast.h`](../../src/util/expr_cast.h). In depth documentation
of the `irept` class itself can be found in
[`src/util/irep.h`](../../src/util/irep.h).
17 changes: 16 additions & 1 deletion doc/doxygen-root/doxygen-markdown/markdown.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,21 @@ $BINDIR/append-last-modified-dates.py $FILES
# $BINDIR/pandoc-codeblock-repair.sh > $file
# done

echo
echo "Rendering mermaid diagrams"
FILES=$(grep --include=\*.md -rl . -e "\`\`\`mermaid")
for file in $FILES; do
echo $file
tmp=/tmp/${file%.*}2.md
mkdir -p $(dirname $tmp)
cp $file $tmp
# Note that gfm is GitHub Flavour Markdown. The double \\ at the start of
# lines is unescaped using sed in order to fix doxygen tags, which are
# broken by pandoc.
pandoc --read=gfm --write=gfm --wrap=none --filter=mermaid-filter $tmp |
$BINDIR/pandoc-codeblock-repair.sh | sed 's/^\\\\/\\/' > $file
done

cprovers=$(find . -name cprover-manual)
cprover=${cprovers[0]}

Expand All @@ -49,7 +64,7 @@ echo "Running pandoc filter over cprover-manual markdown files"
FILES=$(find $cprover -name '*.md')
for file in $FILES; do
echo $file
tmp=/tmp/${file%.*}2.md
tmp=/tmp/${file%.*}3.md
mkdir -p $(dirname $tmp)
cp $file $tmp
pandoc --write=markdown_phpextra --wrap=none --filter=$BINDIR/pandoc-cprover-link-filter.py $tmp |
Expand Down