-
Notifications
You must be signed in to change notification settings - Fork 274
Add documentation for the central data structures in CBMC IR #7470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
80fc758
59bd183
b8544bb
2cc0467
7846485
692ab26
a0a6874
983e506
ea76349
8da4a6f
89647bb
634e7b1
9a6b9b4
87dc9a3
4c4174e
1b15dcc
29be381
8aef906
8b87460
d2093fd
3129464
e55d40e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,185 @@ | ||||
\ingroup module_hidden | ||||
|
||||
\page central-data-structures Central Data Structures | ||||
|
||||
# Central Data Structures | ||||
|
||||
The following is some light technical documentation of the major data structures | ||||
used in the input transformation to Intermediate Representation (IR) inside | ||||
CBMC and the assorted CProver tools. | ||||
|
||||
## GOTO models | ||||
|
||||
The `goto_modelt` class is the top-level data structure that CBMC (and the other | ||||
tools) use for holding the GOTO intermediate representation. The following | ||||
diagram is a simplified view of how the data structures relate to each other - | ||||
|
||||
```mermaid | ||||
erDiagram | ||||
goto_modelt { | ||||
symbol_tablet symbol_table | ||||
goto_functionst goto_functions | ||||
} | ||||
goto_modelt ||--|| symbol_tablet : "" | ||||
goto_modelt ||--|| goto_functionst : "" | ||||
symbol_tablet { | ||||
symbol_map symbols | ||||
} | ||||
symbol_tablet ||--o{ symbolt : "" | ||||
symbolt { | ||||
string name | ||||
} | ||||
goto_functionst { | ||||
function_map functions | ||||
} | ||||
goto_functionst ||--o{ goto_functiont : "" | ||||
goto_functiont { | ||||
goto_programt body | ||||
ordered_collection_of parameter_identifiers | ||||
} | ||||
goto_functiont ||--|| goto_programt : "" | ||||
goto_programt { | ||||
ordered_collection_of instructions | ||||
} | ||||
goto_programt ||--o{ instructiont : "" | ||||
instructiont { | ||||
enumeration instruction_type | ||||
goto_program_codet code | ||||
boolean_expression guard | ||||
source_locationt source_location | ||||
} | ||||
instructiont ||--|| goto_program_instruction_typet : "" | ||||
instructiont ||--o| goto_program_codet : "" | ||||
instructiont ||--o| source_locationt : "" | ||||
goto_program_codet ||--o| source_locationt : "" | ||||
``` | ||||
|
||||
A `goto_modelt` is effectively a pair, consisting of: | ||||
|
||||
* A collection of GOTO functions. | ||||
* A symbol table containing symbol definitions which may be referred to by | ||||
symbol expressions. Symbol expressions are found in the goto functions and the | ||||
(sub-)expressions for the definitions of symbols. | ||||
|
||||
In pseudocode, the type looks this: | ||||
|
||||
```js | ||||
type goto_modelt { | ||||
type goto_functionst = map<identifier, goto_functiont> | ||||
type symbol_tablet = map<identifier, symbolt> | ||||
} | ||||
``` | ||||
|
||||
There is an abstract interface for goto models provided by the | ||||
`abstract_goto_modelt` class. This is defined and documented in the file | ||||
[`src/goto-programs/abstract_goto_model.h`](../../src/goto-programs/abstract_goto_model.h). | ||||
Ideally code which takes a goto model as input should accept any implementation | ||||
of the abstract interface rather than accepting the concrete `goto_modelt` | ||||
exclusively. This helps with compatibility with `jbmc` which uses lazy loading. | ||||
See the `lazy_goto_modelt` class which is defined and documented in | ||||
[`jbmc/src/java_bytecode/lazy_goto_model.h`](../../jbmc/src/java_bytecode/lazy_goto_model.h) | ||||
for details of lazy loading. | ||||
|
||||
For further information about symbols see the `symbolt` class which is defined | ||||
and documented in [`src/util/symbol.h`](../../src/util/symbol.h) and the | ||||
`symbol_exprt` (symbol expression) class which is defined and documented in | ||||
[`src/util/std_expr.h`](../../src/util/std_expr.h). | ||||
|
||||
## goto_functiont | ||||
|
||||
A `goto_functiont` is also defined as a pair. It's designed to represent a function | ||||
at the goto level, and effectively it's the following data type (in pseudocode): | ||||
|
||||
```js | ||||
type goto_functiont { | ||||
goto_programt body | ||||
list<identifiers> parameters | ||||
} | ||||
``` | ||||
|
||||
The `goto_programt` denoting the `body` of the function will be the subject of | ||||
a more elaborate explanation in the next section. The in-memory structure of a | ||||
goto function allows the `body` to be optional, but only functions with bodies | ||||
are included in the serialised goto binaries. | ||||
|
||||
The `parameters` subcomponent is a list of identifiers that are to be looked-up | ||||
in the symbol-table for their definitions. | ||||
|
||||
## goto_programt | ||||
|
||||
A goto program is a sequence of GOTO instructions (`goto_instructiont`). For | ||||
details of the `goto_programt` class itself see the documentation in | ||||
[`goto_program.h`](../../src/goto-programs/goto_program.h). For further details | ||||
of the life cycle of goto programs see | ||||
[`src/goto-programs/README.md`](https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md). | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What will this be rendered as on diffblue.github.io?! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This link follows the same format as is found here -
this example is rendered to https://diffblue.github.io/cbmc/compilation-and-development.html The original link I used was causing the check-doxygen job to fail with the error - There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, but |
||||
|
||||
An instruction (`goto_instructiont`) is a triple (an element with three subcomponents), | ||||
describing a GOTO level instruction with the following 3 component subtypes, | ||||
again in pseudocode: | ||||
|
||||
```js | ||||
type goto_instructiont { | ||||
instruction_type instr_type | ||||
instruction statement | ||||
guard boolean_expr | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?! |
||||
} | ||||
``` | ||||
|
||||
The pseudocode subtypes above require some more elaboration, so we will provide some extra | ||||
detail to illuminate some details at the code level: | ||||
|
||||
* The `instruction_type` above corresponds to the `goto_program_instruction_typet` type | ||||
listed in [`goto_program.h`](../../src/goto-programs/goto_program.h) | ||||
* Some illustrative instruction types are `assign`, `function_call`, `return`, etc. | ||||
* The `instruction` is a statement represented by a `goto_instruction_codet`. | ||||
* A `goto_instruction_codet` is an alias of `codet` (a data structure broadly representing | ||||
a statement inside CBMC) that contains the actual code to be executed. | ||||
* You can distinguish different statements by using the result of the | ||||
`get_statement` member function of the `codet` class. | ||||
* The `guard` is an `exprt` (a data structure broadly representing an expression inside CBMC) | ||||
that is expected to have type `bool`. | ||||
* This is optional - not every instruction is expected to have a `guard` associated with it. | ||||
|
||||
## source_locationt | ||||
|
||||
Another important data structure that needs to be discussed at this point is | ||||
`source_locationt`. | ||||
|
||||
`source_locationt` are attached into various `exprt`s (the data structure representing | ||||
various expressions, usually the result of some early processing, e.g. the result of the | ||||
frontend parsing a file). | ||||
|
||||
This means that `codet`s, representing GOTO instructions *also* have a `source_locationt` | ||||
attached to them, by virtue of inheriting from `exprt`. | ||||
|
||||
For the most part, `source_locationt` is something that is only being used when we print | ||||
various nodes (for property listing, counterexample/trace printing, etc). | ||||
|
||||
It might be possible that a specific source location might point to a CBMC instrumentation | ||||
primitive (which might be reported as a `built-in-addition`) or there might even be no-source | ||||
location (because it might be part of harnesses generated as an example, that have no presence | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ⛏️ "harness" is not well a well defined at this point in the documentation. We use it to mean the cbmc generated code surrounding the entry point specified to cbmc. This generated harness initialises global values, and the arguments for the entry points parameters. However my understanding is that there are users who write their own "harness" functions in C around the function they are verifying. So in this context there would be 2 harnesses when cbmc runs. The cbmc harness which calls the user written harness which calls the function the user is verifying. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty. We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable. |
||||
in the user code). | ||||
|
||||
## irept | ||||
|
||||
`irep`s are the underlying data structure used to implement many of the data | ||||
structures in CBMC and the assorted tools. These include the `exprt`, `typet`, | ||||
`codet` and `source_locationt` classes. This is a tree data structure where | ||||
each node is expected to contain a string/ID and may have child nodes stored in | ||||
both a sequence of child nodes and a map of strings/IDs to child nodes. This | ||||
enables the singular `irept` data structure to be used to model graphs such as | ||||
ASTs, CFGs, etc. | ||||
|
||||
The classes extending `irept` define how the higher level concepts are mapped | ||||
onto the underlying tree data structure. For this reason it is usually advised | ||||
that the member functions of the sub-classes should be used to access the data | ||||
held, rather than the member functions of the base `irept` class. This aids | ||||
potential future restructuring and associates accesses of the data with the | ||||
member functions which have the doxygen explaining what the data is. | ||||
|
||||
The strings/IDs held within `irept` are of type `irep_idt`. These can be | ||||
converted to `std::string` using the `id2string` function. There is a mechanism | ||||
provided for casting expressions and types in | ||||
[`src/util/expr_cast.h`](../../src/util/expr_cast.h). In depth documentation | ||||
of the `irept` class itself can be found in | ||||
[`src/util/irep.h`](../../src/util/irep.h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to link this to the actual code or generated doxygen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can link to the file using raw markdown, but I don't think it can link to an actual code structure.
I need to look more into doxygen and its linking capabilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can link it, it might help increase the longevity. See #7470 (comment) for my experiences with the lifespan of documentation. The last thing I want is for you to put in the effort documenting things and then, in 2 years, have to do it all again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the final documentation,
doxygen
is automatically linking class names to the class documentation it produces, so when the class names are written in full they get automatically picked up by it and linked.However, given that these are central data structures and calcified by now (i.e. not experiencing significant change with regards to the actual components of the class), I'd rather the pseudocode representation stays as a fast-track way for someone to understand the essence of these data structures without the informational load and visual clutter that the class documentation pages in
doxygen
have.