Skip to content

Add documentation for the central data structures in CBMC IR #7470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 7, 2023

Conversation

NlightNFotis
Copy link
Contributor

This PR describes the central GOTO-IR data structures
at a light level of detail (containing some pseudocode).

Please let us know if the level of detail contained in this PR
is adequate, or if there's more that's needed.

  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

@codecov
Copy link

codecov bot commented Jan 10, 2023

Codecov Report

Base: 78.50% // Head: 78.50% // No change to project coverage 👍

Coverage data is based on head (e55d40e) compared to base (3f578e1).
Patch has no changes to coverable lines.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #7470   +/-   ##
========================================
  Coverage    78.50%   78.50%           
========================================
  Files         1663     1663           
  Lines       191297   191297           
========================================
  Hits        150174   150174           
  Misses       41123    41123           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

at the IR level, and effectively it's the following ADT (in pseudocode):

```js
type goto_functiont {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to link this to the actual code or generated doxygen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can link to the file using raw markdown, but I don't think it can link to an actual code structure.

I need to look more into doxygen and its linking capabilities.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can link it, it might help increase the longevity. See #7470 (comment) for my experiences with the lifespan of documentation. The last thing I want is for you to put in the effort documenting things and then, in 2 years, have to do it all again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the final documentation, doxygen is automatically linking class names to the class documentation it produces, so when the class names are written in full they get automatically picked up by it and linked.

However, given that these are central data structures and calcified by now (i.e. not experiencing significant change with regards to the actual components of the class), I'd rather the pseudocode representation stays as a fast-track way for someone to understand the essence of these data structures without the informational load and visual clutter that the class documentation pages in doxygen have.

This is an `irept`. `irep`s are the central data structure that model most entities inside
CBMC and the assorted tools - effectively a node/map like data structure that forms a hierachical
tree that ends up modeling graphs like ASTs, CFGs, etc. (This will be further discussed in
a dedicated page).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should commit TODO notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not so much as a TODO, rather, it's a small digression to guide user expectation (that they should expect a dedicated topic on this that they need to look for into the documentation).

Do you think this should leave, or is there a better way to phrase this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put in a link when it is available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on not adding a TODO/promise. Maybe give a really brief overview and link to the source code/doxygen?

# Central Data Structures

The following is some light technical documentation of the major data structures
represented in the input transformation to Intermediate Representation (IR) inside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"represented" -> "used"
Editing note, try not to use the same word for two different things in one sentence.

Comment on lines 9 to 10
The `goto_modelt` is the main data structure that CBMC (and the other tools) use to
represent GOTO-IR (the `GOTO` Intermediate Representation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"use to ..." -> "use for the GOTO-IR (GOTO Intermediate Representation)."

## goto_functiont

A `goto_functiont` is also a product type. It's designed to represent a function
at the IR level, and effectively it's the following ADT (in pseudocode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"IR" not defined. Above the definition was only for "GOTO-IR".

"and effectively it's the following ..." -> "like the following ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intermediate representation (IR) is defined in line 4 of the same file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, you defined "GOTO-IR", not "IR". I propose change line 4:
"GOTO-IR (GOTO Intermediate Representation)" to something like "GOTO intermediate representation (IR) denoted GOTO-IR"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why say IR when we could just say GOTO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why say IR when we could just say GOTO?

Because GOTO is a CBMC concept, and IR is a lot more likely to be a known concept to a computer scientist, so by virtue of connecting the two concepts, we are leveraging people's pre-existing understanding of a general concept to relate it to the internal one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already explained that in the first paragraph. Repeating it again is redundant.

Comment on lines 41 to 43
Of the two types listed above, the `parameters` one should be self-documenting,
(the values of these are later looked up in the symbol table), so we're going to
focus next on the type `goto_programt`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically you should describe the parameters in order and consistently (either the type or the name), maybe something like:
"The rest of this section details the body and goto_programt, the parameters are a self-documenting list of identifiers that are later looked up in the symbol table."

Comment on lines 69 to 70
* A `goto_instruction_codet` is a `codet` (a data structure broadly representing a statement
inside CBMC) that contains the actual GOTO-IR instruction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a GOTO-IR modelt contains GOTO-IR functionts which contains a list of GOTO-IR bodys which are GOTO-IR programts that contain GOTO-IR instructionts that contain an instruction that contains GOTO-IR instruction_codet that contain GOTO-IR codet that contain GOTO-IR instruction_codet that is a GOTO-IR codet that contains GOTO-IR instructiont... and this is not at all confusing or circular?

Or have I misread the documentation to this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't choose the class names 😅

Yes, your understanding is more or less correct. I'll see if I can simplify it further, but I can't promise massive improvements - this is just the description of the status quo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the type names involved seem a little obtuse, which is just why it is so important to document them. Some of the relationships as expressed by @TGWDB are not quite right however. Given that this has been misunderstood by one person already, it suggests that we haven't managed to explain the relationships sufficiently clearly in this documentation.

  • A model contains a collection of functions.
  • Each function contains a single "body" which is of type goto_programt.
  • A "body" contains an ordered collection of instructiont.
  • An instructiont may optionally contain a "code" expression which is of type codet. codet and goto_instruction_codet are currently defined to be the same type, but this is not guaranteed to be the case in future.

Note that a function has a single "body" not a collection of them. The fact that a function body is of type goto_programt is the especially confusing point, as it is not an entire program! The set of data structures are not circular. There are 4 container ship relations between 5 levels of data.

The instructiont class type works a bit like a tagged union. For some values of the instruction type, the code member is constrained to be a particular sub-class of codet. For others such as the SKIP instruction type, the code member is expected to be empty (is ID_nil). Given that instructions also have other data members intructiont is both a sum and product type.

This is an `irept`. `irep`s are the central data structure that model most entities inside
CBMC and the assorted tools - effectively a node/map like data structure that forms a hierachical
tree that ends up modeling graphs like ASTs, CFGs, etc. (This will be further discussed in
a dedicated page).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on not adding a TODO/promise. Maybe give a really brief overview and link to the source code/doxygen?


A `goto_modelt` is effectively a product type, consisting of:

* A list of GOTO functions (pseudocode: `type goto_functionst = list<goto_functiont>`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest describing this as "A collection of GOTO functions" rather than a list, in order to avoid readers potentially conflating this with the linked list data structure. The exact data structure used within the class is a map, but this is not relevant to the point here in the documentation.

Copy link
Contributor

@thomasspriggs thomasspriggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation in this PR provides a reasonable summary of each of the individual classes involved. However I think this is still missing an overview of the way they relate to each other. This should be separate from the sections documenting the individual classes. It would be nice to have these relationships shown as a diagram.

## goto_functiont

A `goto_functiont` is also a product type. It's designed to represent a function
at the IR level, and effectively it's the following ADT (in pseudocode):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why say IR when we could just say GOTO?

at the IR level, and effectively it's the following ADT (in pseudocode):

```js
type goto_functiont {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to link to either the actual code or the generated doxygen output for these things. Duplicating the documentation of the same thing in two places is not sustainable.

}
```

Of the two types listed above, the `parameters` one should be self-documenting,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with your assertion that "parameters" is self documenting. We should explain that a parameter identifier can be looked up in the symbol table (they do need to exist in the symbol table right?!). This may be obvious for an experienced CBMC developer but we should spell it out for those who are new to cbmc. This is important as an inexperienced developer could do something like constructing a goto function which has parameters which don't exist in the symbol table. We should also explain what information the parameter symbol should contain. Does the parameter symbol have the value set to be an instance of code_typet::parametert?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

type goto_instructiont {
instruction_type instr_type
instruction statement
guard boolean_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is source_location missing from this list?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?!

Comment on lines 69 to 70
* A `goto_instruction_codet` is a `codet` (a data structure broadly representing a statement
inside CBMC) that contains the actual GOTO-IR instruction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the type names involved seem a little obtuse, which is just why it is so important to document them. Some of the relationships as expressed by @TGWDB are not quite right however. Given that this has been misunderstood by one person already, it suggests that we haven't managed to explain the relationships sufficiently clearly in this documentation.

  • A model contains a collection of functions.
  • Each function contains a single "body" which is of type goto_programt.
  • A "body" contains an ordered collection of instructiont.
  • An instructiont may optionally contain a "code" expression which is of type codet. codet and goto_instruction_codet are currently defined to be the same type, but this is not guaranteed to be the case in future.

Note that a function has a single "body" not a collection of them. The fact that a function body is of type goto_programt is the especially confusing point, as it is not an entire program! The set of data structures are not circular. There are 4 container ship relations between 5 levels of data.

The instructiont class type works a bit like a tagged union. For some values of the instruction type, the code member is constrained to be a particular sub-class of codet. For others such as the SKIP instruction type, the code member is expected to be empty (is ID_nil). Given that instructions also have other data members intructiont is both a sum and product type.

Another important data structure that needs to be discussed at this point is
`source_locationt`.

This is an `irept`. `irep`s are the central data structure that model most entities inside
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fact that source_locationt inherits from irept relevant to this part of the documentation? I would consider this to be a low-level implementation detail and omit it from the high level overview.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but it might be important to cover irept by itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have separated the explanation of irept itself and added further details about it.


It might be possible that a specific source location might point to a CBMC instrumentation
primitive (which might be reported as a `built-in-addition`) or there might even be no-source
location (because it might be part of harnesses generated as an example, that have no presence
Copy link
Contributor

@thomasspriggs thomasspriggs Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⛏️ "harness" is not well a well defined at this point in the documentation. We use it to mean the cbmc generated code surrounding the entry point specified to cbmc. This generated harness initialises global values, and the arguments for the entry points parameters. However my understanding is that there are users who write their own "harness" functions in C around the function they are verifying. So in this context there would be 2 harnesses when cbmc runs. The cbmc harness which calls the user written harness which calls the function the user is verifying.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty.

We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable.

@thomasspriggs
Copy link
Contributor

FYI - For the benefit of those who may be watching this PR, the work to address the comments on this has been handed over to myself from Fotis.

@thomasspriggs thomasspriggs requested a review from a team as a code owner January 27, 2023 19:19

A `goto_modelt` is effectively a pair, consisting of:

* A list of GOTO functions,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps say "A collection" (the precise data type is immaterial here, but people might wrongly be led to think it actually is a list).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its actually a mapping from the function identifier to the GOTO function.

A `goto_modelt` is effectively a pair, consisting of:

* A list of GOTO functions,
* A symbol table containing symbol references for the symbols contained in the GOTO functions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "symbol reference?" Also, shouldn't the abstraction level be kept consistent and this just be "A collection of the symbols used in GOTO functions."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now reworked the explanation around symbols / the symbol_table.

Comment on lines 62 to 71
In pseudocode, the type looks this:

```js
type goto_modelt {
type goto_functionst = list<goto_functiont>
type symbol_tablet = map<identifier, symbolt>
}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this add value?

a more elaborate explanation in the next section.

The `parameters` subcomponent is a list of identifiers that are to be looked-up
in the symbol-table for their values.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "values" the most appropriate term here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"value" is a word which is a little too general / overloaded in this context. I have gone for "definition" instead for the moment.

Comment on lines 94 to 97
A `goto_programt` is a list of GOTO instructions. In pseudocode, it would
look like `type goto_programt = list<goto_instructiont>`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the deal with pseudocode? Is that sentence adding any value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this particular example of pseudocode.

type goto_instructiont {
instruction_type instr_type
instruction statement
guard boolean_expr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Among others? There are indeed various other data members, not all of which we should keep in there long-term, but if you are seeking to document the current state then all of them should be in here?!

Another important data structure that needs to be discussed at this point is
`source_locationt`.

This is an `irept`. `irep`s are the central data structure that model most entities inside
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but it might be important to cover irept by itself?

Copy link
Collaborator

@tautschnig tautschnig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only occurred to me (thanks to a reminder by @zhassan-aws) that we have https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md. Can I please ask that we do a decent editor's job and avoid having similar information in multiple places?

@thomasspriggs
Copy link
Contributor

It only occurred to me (thanks to a reminder by @zhassan-aws) that we have https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md. Can I please ask that we do a decent editor's job and avoid having similar information in multiple places?

The documentation being added in this PR is intended to be an architectural overview which explains how the different data structures fit together and where the detailed documentation for each of them can be found. Of a necessity, a less detailed overview will include information which is explained in more detail in the documentation which is specific to each data structure. It is also expected that this architectural overview will include points which are not directly relevant to goto-programs, such as the section we have now added on irept.

I can update the documentation added here to link to goto-programs/README.md

@tautschnig
Copy link
Collaborator

I can update the documentation added here to link to goto-programs/README.md

Yes, please make sure there is cross-referencing and also no contradicting information (redundant information likely is unavoidable, but that's ok).


It might be possible that a specific source location might point to a CBMC instrumentation
primitive (which might be reported as a `built-in-addition`) or there might even be no-source
location (because it might be part of harnesses generated as an example, that have no presence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of object to the notion that we allow code with no location. We don't just generate code out of thin air for no reason. I would think that every function that can generate code should be required to take a location parameter, and I'd like to see us add an assertion that it is non-empty.

We just went through this with the function contracts and had to go all the way back to the front end to make sure that the necessary location information information was being piped down to the point where code was being generated. But, the result is code that is user debuggable.

@@ -111,7 +111,7 @@ A goto program is a sequence of GOTO instructions (`goto_instructiont`). For
details of the `goto_programt` class itself see the documentation in
[`goto_program.h`](../../src/goto-programs/goto_program.h). For further details
of the life cycle of goto programs see
[`src/goto-programs/README.md`](../../src/goto-programs/README.md).
[`src/goto-programs/README.md`](https://github.com/diffblue/cbmc/blob/develop/src/goto-programs/README.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will this be rendered as on diffblue.github.io?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link follows the same format as is found here -

[COMPILING.md](https://github.com/diffblue/cbmc/blob/develop/COMPILING.md#what-architecture)

this example is rendered to https://diffblue.github.io/cbmc/compilation-and-development.html

The original link I used was causing the check-doxygen job to fail with the error - /home/runner/work/cbmc/cbmc/doc/architectural/central-data-structures.md:120: warning: unexpected token TK_EOF as the argument of ref

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, but COMPILING.md isn't actually rendered on diffblue.github.io, while goto-programs/README.md is (the latter as https://diffblue.github.io/cbmc/group__goto-programs.html). It would be great to find a way so that people aren't sent back&forth between Doxygen-rendered content and GitHub rendered pages.

NlightNFotis and others added 21 commits February 3, 2023 18:19
To avoid introducing the new term `GOTO-IR` which is not widely used
within the codebase.
Is was not previously clear that `goto_modelt` was used as a storage
class.
Main suggests that there are other entities which hold goto, rather than
expressing that all the other data structures are contained within the
model.
So that we can add a relationship diagram to the central data structures
documentation page.
In order to prevent it being included at the root level of the
documentation site.
Because the data structure used is not a list and because we should
avoid being specific here in order to maintain a high level of
abstraction.
Because the symbol table contains the symbols; not symbol expressions.
To draw attention to the link to the `goto_programt` classes'
documentation. Note that class names are automatically linked in
generating the website.
This existence of an abstract interface was previously mentioned but not
properly explained or justified.
In order to avoid the overloaded word "value".
In order to make it easier to find the documentation concerning these
classes.
Because `irept` itself should be considered to live at a separate level
of abstraction from the other classes.
In order to fix doxygen error.
Copy link
Contributor Author

@NlightNFotis NlightNFotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Approved.

(Cannot give green tick because Github UI doesn't allow the person who raised a PR to approve or request changes).

Copy link
Collaborator

@tautschnig tautschnig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with the agreement that #7497 will take care of cleanup.

@NlightNFotis NlightNFotis merged commit 28d235e into diffblue:develop Feb 7, 2023
@NlightNFotis NlightNFotis deleted the docs_goto_programs branch February 7, 2023 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

6 participants