GOTO language documentation #7471

remi-delmas-3000 · 2023-01-10T16:28:24Z

This thread is to collect needs for the GOTO language documentation effort we're starting in 2023.

The goal is to use this thread to scope the documentation and collect questions the documentation is supposed to answer.

which fragment of GOTO shall we document ?
what aspects of the language shall we document ?
- symbols and symbol tables,
- goto models,
- goto functions,
- goto programs,
- goto expressions abstract syntax,
- typchecking rules for instructions and expressions
- block structure and loops for goto programs,
- three level indexing (as needed to express trace semantics),
- stack vs heap model,
- built-in functions and predicates,
- supporting functions such as __CPROVER_start, __CPROVER_initialize,
- supporting state variables describing platform arch and memory safety check instrumentation,
- binary file format
- trace language syntax
- trace semantics (or rather, how to map a trace back to)

NlightNFotis · 2023-01-10T16:33:47Z

We already have a documentation PR we just raised as part of our internal efforts to document more of CBMC.

It's at #7470 and covers the GOTO IR data structures with some light documentation.

celinval · 2023-01-10T18:04:36Z

Awesome! Is it possible to also document the goto file format?

martin-cs · 2023-01-11T13:03:02Z

On Tue, 2023-01-10 at 10:04 -0800, Celina G. Val wrote: Awesome! Is it possible to also document the goto file format?

I don't want to be negative but I would not recommend it. GOTO files are a direct serialisation of a bunch of C++ data structures. This is intentional. They are intended to be "internal" to the tool, even more so that `.o` files are for compilers. They are not intended to be an interface point for external tools. If you want to read / write / manipulate them the intent is that you link the relevant bits of CPROVER. See also the discussions over cross language APIs and why JSON Symtab exists. Why? If we declare that an interface then we fix: 1. The syntax / fields / representation of the data structures. 2. How they are used to represent things. 1. is annoying. They do change from time to time and I don't think anyone is claiming they are so good that we should freeze them. There is a good-faith effort to change the version number of GOTO files when this happens but I don't think anyone tests it so... 2. is more insideous and troubling and why this documentation is so needed but also hard. How constructs are expressed changes as and when things need to be fixed. They are still structurally the same but they use the ireps differently. This is why JSON Symtab based tools tend to break with an inexplicable error deep inside an entirely unrelated part of the code. I would recommend getting the documentation and, more importantly, the testing and checking of the key data structures really good before you attempt this.

martin-cs · 2023-01-11T14:02:55Z

@remi-delmas-3000 I really want to support this effort because it is so very needed. So please read this as "lessons learnt for the language documentation projects 2014, 2015, etc.":

Decide if you are writing documentation for developers / system integrators, beginning users or advanced users. See earlier discussions on the purpose of different bits of documentation.
There is already quite a bit of documentation. It is scattered. Bringing it together and rationalising is a worthwhile thing to do. Maintenance of documentation must be included in your plan to avoid this happening again.
Lifespan of up-to-date documentation, from best to worse:

Checks run during normal program operation
Checks run in debug mode
Unit / regression tests run in CI
Comments inline.
Everything in doc/
Anything out of the repository

For developer documentation, I would recommend starting with goto-programt and working down (exprt) and up ('goto_modeltandsymbol_tablet`) as they are core to pretty much everything. See also RFC: Documenting and checking "normal form" properties of goto_models #6495
For 'must always' and 'must not' requirements, best would be a literate programming implementation in the various .validate() methods. That is exactly why they were introduced.
For 'may' requirements, best would be regression or unit tests, JSON symtab might be a route.
Be prepared inconsistency between different parts of the code about what are well structured programs / semantics ( for example What is a safe overapproximation of an OTHER instruction #4323 and Killing #return_value behaves inconsistently #4336 ), dead code and untested areas of the code base.

remi-delmas-3000 · 2023-01-11T17:08:09Z

@martin-cs thanks for jumping in !

There will be two parallel documentation efforts :

(led by Diffblue) document GOTO as seen from inside CBMC for developers
(led by AWS) document GOTO as abstract language for users who need to :

produce GOTO from their own source language (they need to understand the abstract syntax/semantics of GOTO),
would like to design/specify instrumentation rules for their programs,
would like to write a GOTO interpreter that would be capable of validating a counter example trace produced by a GOTO verification tool.

For 2. we envision a document in natural language.

I would recommend getting the documentation and, more importantly, the
testing and checking of the key data structures really good before you
attempt this.

Agreed. We need to understand what is there before attempting to describe it and I'm expecting this to be a bottom-up effort, starting from the implementation.

is annoying. They do change from time to time and I don't think
anyone is claiming they are so good that we should freeze them. There
is a good-faith effort to change the version number of GOTO files when
this happens but I don't think anyone tests it so...

We mostly want to describe the abstract syntax of GOTO models (what's a goto model, a goto function, a goto program, a goto instruction, what operators are available in the expression language, and their semantics, i.e. small-step transition relation over concrete states). This could be similar to GOTO semantics by translation to FOL for instance.

Selecting a concrete syntax could come in a second step.

is more insideous and troubling and why this documentation is so needed but also hard. How constructs are expressed changes as and when things need to be fixed. They are still structurally the same but they use the ireps differently. This is why JSON Symtab based tools tend to break with an inexplicable error deep inside an entirely unrelated part of the code.

I thought that defining how things are modelled is mostly the role source-to-GOTO conversion. Do you know how often the meaning basic GOTO operators or instruction types are changed by changing their interpretation in Symex or when translating the SSA format to SAT or SMT ?

martin-cs · 2023-01-11T17:38:10Z

On Wed, 2023-01-11 at 09:08 -0800, Rémi Delmas wrote: There will be two parallel documentation efforts : 1. (led by Diffblue) document GOTO as seen from inside CBMC for developers 2. (led by AWS) document GOTO as abstract language for users who need to : * produce GOTO from their own source language (they need to understand the abstract syntax/semantics of GOTO),

How do you envisage them creating these? Which interface are you going to use?

* would like to write a GOTO interpreter that would be capable of validating a counter example trace produced by a GOTO verification tool.

I trust you know about `goto-instrument --interpreter` right? Also, if you made use of `memory-analyzer/gdb_api.h` and @tautschnig's work on GOTO programs in ELF format, you could also probably run the actual binary back-to-back to check the modelling.

For 2. we envision a document in natural language.

I would encourage you to make examples into unit tests and the like. In the past natural language documentation has not aged well or stayed tidy.

> 1. is more insideous and troubling and why this documentation is > so needed but also hard. How constructs are expressed changes as > and when things need to be fixed. They are still structurally the > same but they use the ireps differently. This is why JSON Symtab > based tools tend to break with an inexplicable error deep inside an > entirely unrelated part of the code. I thought that defining how things are modelled is mostly the role source-to-GOTO conversion.

Yes... but... it's not just there. For example, the semantics of exceptions is really in a transformation pass. The semantics of the heap is ... kind of scattered, some in the front-end, some in `ansi- c/library/` some in the symex / interpreter, some in the instrumentation.

Do you know how often the meaning basic GOTO operators or instruction types are changed by changing their interpretation in Symex or when translating the SSA format to SAT or SMT ?

Ummm... that's a bit difficult to answer. It's more how they are interpreted. For example, in the parse tree an expression can include arbitrary side-effects, function calls, etc. By the time they are in a GOTO program, they are side-effect free, have some types removed and are used in a semantics where variables are mutable. Once they become VCs they are interpreted with an immutable, logic-based semantics. Is that what you mean? Cheers, - Martin

…

-- UCU staff are currently in dispute with employers over pay, pensions and working conditions. Part of the industrial action is 'action short of a strike'. This may affect the speed at which you get responses to e-mails and requests from some staff. Details here: https://www.ucu.org.uk/article/12469/FAQs

celinval · 2023-01-12T05:17:48Z

Thanks @martin-cs for jumping in. If not goto binaries, which language do you suggest that we use to interface with CBMC?

martin-cs · 2023-01-12T12:18:16Z

@celinval I think goto binaries are a really good way of passing programs between tools. The goto-cc / goto-instrument / analysis tool (cbmc, goto-analyzer, etc.) split has shown to be a good architecture.

What I had meant to communicate in my previous response was that I strongly suggest using the CPROVER code for creating / reading / writing these because it insulates you from a number of drift and updating issues.

In cases where this is not possible (for example, you are not writing a tool in C++) then it's a bit more awkward. symtab2gb and the JSON symtab was our best attempt at this. It isn't ideal but it has worked for a number of years. See #7042 (comment) and the following discussion for some more of the history / context for this.

HTH

celinval · 2023-01-12T20:52:19Z

Hi @martin-cs, our tool is written in Rust. We do currently use the JSON format and symtab2gb but it scales poorly. The JSON files are often above 1GB and symtab2gb takes a considerable amount to convert them to goto.

martin-cs · 2023-01-13T18:00:13Z

@celinval Some thoughts:

We ran into similar issues with symtab2gb while working on https://github.com/diffblue/gnat2goto . If you want a quick band-aid, may I suggest getting symtab2gb to read zipped files. CPROVER already links zlib so it should be pretty trivial to implement and will give you a significant saving in space. I suspect some careful looking at caching in symtab2gb will improve speed. However these are just plasters.
Proposal of Interface between Kani/CProver tools #7042 contains the best discussion of interfacing issues. The key points I took away from this were:

GOTO programs really are the best thing to pass between tools.
It is best to have the CPROVER C++ read/write these. This means you either have to A. wrap it using a foreign function interface or B. have a text interface and a tool which is basically text to GOTO program.
B is easier than A but less performant. symtab2gb is an example of a B-style approach.
There is a separate, maybe orthogonal question of interfaces to instrumentation and analyses passes (on GOTO programs). The current design is for C++ executables for combinations of these (jbmc for example). Another approach is another wrapped API.

I wish I had better answers / clear guidance but I think this is pretty much the problem. HTH.

celinval · 2023-01-16T20:13:44Z

What about using something like Protocol Buffers to make B more performant and less platform / language dependent?

If we are going to make GOTO programs the proper interface to model programs between tools as identified in #7042, I would hope that we would also create a well defined serialization / deserialization format for it.

martin-cs · 2023-01-18T12:52:29Z

Protocol buffers could work. TBH the big thing that json_symtab needs (IMHO) is a way of making reference to ireps. A lot of the verbosity is many copies of the same strings and expressions.

I agree that GOTO programs need either a (widely linkable, cross language) API or (more ideal) a documented format so people can write them. I guess my first point was that I think the current format is best accessed by the API. It may be less pain to produce a new serialisation format (maybe with protocol buffers?) than to document the current one and expect other API implementations to stay in sync.

( Obviously, all of this is easier if it is documented, particular if that is in an executable / testable form. )

One aspect of this that might be a good place to start and try out ideas are the various irept child classes. These are necessary for any kind of wrapped or documented-and-built-out-of-tree API for GOTO programs. They (std_expr.h, std_types.h etc.) are highly formulaic and could clearly be auto-generated. As referenced in #7042 there was even a system that did this. Anything that can be shown to work for these has a good chance of working for GOTO programs.

NlightNFotis added the documentation label Jan 10, 2023

NlightNFotis assigned NlightNFotis, esteffin, thomasspriggs and TGWDB Jan 10, 2023

remi-delmas-3000 added the aws Bugs or features of importance to AWS CBMC users label Jan 11, 2023

NlightNFotis added this to GOTO-program standardisation Jan 30, 2023

thomasspriggs moved this to Done in GOTO-program standardisation Mar 23, 2023

thomasspriggs closed this as completed Mar 23, 2023

martin-cs mentioned this issue May 8, 2024

RFC : turn the interpreter into the reference semantics for GOTO programs #8277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GOTO language documentation #7471

GOTO language documentation #7471

remi-delmas-3000 commented Jan 10, 2023

NlightNFotis commented Jan 10, 2023

celinval commented Jan 10, 2023

martin-cs commented Jan 11, 2023 via email

martin-cs commented Jan 11, 2023

remi-delmas-3000 commented Jan 11, 2023

martin-cs commented Jan 11, 2023 via email

celinval commented Jan 12, 2023

martin-cs commented Jan 12, 2023

celinval commented Jan 12, 2023

martin-cs commented Jan 13, 2023

celinval commented Jan 16, 2023 •

edited

Loading

martin-cs commented Jan 18, 2023

GOTO language documentation #7471

GOTO language documentation #7471

Comments

remi-delmas-3000 commented Jan 10, 2023

NlightNFotis commented Jan 10, 2023

celinval commented Jan 10, 2023

martin-cs commented Jan 11, 2023 via email

martin-cs commented Jan 11, 2023

remi-delmas-3000 commented Jan 11, 2023

martin-cs commented Jan 11, 2023 via email

celinval commented Jan 12, 2023

martin-cs commented Jan 12, 2023

celinval commented Jan 12, 2023

martin-cs commented Jan 13, 2023

celinval commented Jan 16, 2023 • edited Loading

martin-cs commented Jan 18, 2023

celinval commented Jan 16, 2023 •

edited

Loading