Skip to content

GOTO language documentation #7471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
remi-delmas-3000 opened this issue Jan 10, 2023 · 12 comments
Closed

GOTO language documentation #7471

remi-delmas-3000 opened this issue Jan 10, 2023 · 12 comments
Assignees
Labels
aws Bugs or features of importance to AWS CBMC users documentation

Comments

@remi-delmas-3000
Copy link
Collaborator

This thread is to collect needs for the GOTO language documentation effort we're starting in 2023.

The goal is to use this thread to scope the documentation and collect questions the documentation is supposed to answer.

  • which fragment of GOTO shall we document ?
  • what aspects of the language shall we document ?
    • symbols and symbol tables,
    • goto models,
    • goto functions,
    • goto programs,
    • goto expressions abstract syntax,
    • typchecking rules for instructions and expressions
    • block structure and loops for goto programs,
    • three level indexing (as needed to express trace semantics),
    • stack vs heap model,
    • built-in functions and predicates,
    • supporting functions such as __CPROVER_start, __CPROVER_initialize,
    • supporting state variables describing platform arch and memory safety check instrumentation,
    • binary file format
    • trace language syntax
    • trace semantics (or rather, how to map a trace back to)
@NlightNFotis
Copy link
Contributor

We already have a documentation PR we just raised as part of our internal efforts to document more of CBMC.

It's at #7470 and covers the GOTO IR data structures with some light documentation.

@celinval
Copy link
Collaborator

Awesome! Is it possible to also document the goto file format?

@martin-cs
Copy link
Collaborator

martin-cs commented Jan 11, 2023 via email

@martin-cs
Copy link
Collaborator

@remi-delmas-3000 I really want to support this effort because it is so very needed. So please read this as "lessons learnt for the language documentation projects 2014, 2015, etc.":

  • Decide if you are writing documentation for developers / system integrators, beginning users or advanced users. See earlier discussions on the purpose of different bits of documentation.
  • There is already quite a bit of documentation. It is scattered. Bringing it together and rationalising is a worthwhile thing to do. Maintenance of documentation must be included in your plan to avoid this happening again.
  • Lifespan of up-to-date documentation, from best to worse:
  1. Checks run during normal program operation
  2. Checks run in debug mode
  3. Unit / regression tests run in CI
  4. Comments inline.
  5. Everything in doc/
  6. Anything out of the repository

@remi-delmas-3000 remi-delmas-3000 added the aws Bugs or features of importance to AWS CBMC users label Jan 11, 2023
@remi-delmas-3000
Copy link
Collaborator Author

@martin-cs thanks for jumping in !

There will be two parallel documentation efforts :

  1. (led by Diffblue) document GOTO as seen from inside CBMC for developers
  2. (led by AWS) document GOTO as abstract language for users who need to :
  • produce GOTO from their own source language (they need to understand the abstract syntax/semantics of GOTO),
  • would like to design/specify instrumentation rules for their programs,
  • would like to write a GOTO interpreter that would be capable of validating a counter example trace produced by a GOTO verification tool.

For 2. we envision a document in natural language.

I would recommend getting the documentation and, more importantly, the
testing and checking of the key data structures really good before you
attempt this.

Agreed. We need to understand what is there before attempting to describe it and I'm expecting this to be a bottom-up effort, starting from the implementation.

  1. is annoying. They do change from time to time and I don't think
    anyone is claiming they are so good that we should freeze them. There
    is a good-faith effort to change the version number of GOTO files when
    this happens but I don't think anyone tests it so...

We mostly want to describe the abstract syntax of GOTO models (what's a goto model, a goto function, a goto program, a goto instruction, what operators are available in the expression language, and their semantics, i.e. small-step transition relation over concrete states). This could be similar to GOTO semantics by translation to FOL for instance.

Selecting a concrete syntax could come in a second step.

  1. is more insideous and troubling and why this documentation is so needed but also hard. How constructs are expressed changes as and when things need to be fixed. They are still structurally the same but they use the ireps differently. This is why JSON Symtab based tools tend to break with an inexplicable error deep inside an entirely unrelated part of the code.

I thought that defining how things are modelled is mostly the role source-to-GOTO conversion. Do you know how often the meaning basic GOTO operators or instruction types are changed by changing their interpretation in Symex or when translating the SSA format to SAT or SMT ?

@martin-cs
Copy link
Collaborator

martin-cs commented Jan 11, 2023 via email

@celinval
Copy link
Collaborator

Thanks @martin-cs for jumping in. If not goto binaries, which language do you suggest that we use to interface with CBMC?

@martin-cs
Copy link
Collaborator

@celinval I think goto binaries are a really good way of passing programs between tools. The goto-cc / goto-instrument / analysis tool (cbmc, goto-analyzer, etc.) split has shown to be a good architecture.

What I had meant to communicate in my previous response was that I strongly suggest using the CPROVER code for creating / reading / writing these because it insulates you from a number of drift and updating issues.

In cases where this is not possible (for example, you are not writing a tool in C++) then it's a bit more awkward. symtab2gb and the JSON symtab was our best attempt at this. It isn't ideal but it has worked for a number of years. See #7042 (comment) and the following discussion for some more of the history / context for this.

HTH

@celinval
Copy link
Collaborator

Hi @martin-cs, our tool is written in Rust. We do currently use the JSON format and symtab2gb but it scales poorly. The JSON files are often above 1GB and symtab2gb takes a considerable amount to convert them to goto.

@martin-cs
Copy link
Collaborator

@celinval Some thoughts:

  • We ran into similar issues with symtab2gb while working on https://github.com/diffblue/gnat2goto . If you want a quick band-aid, may I suggest getting symtab2gb to read zipped files. CPROVER already links zlib so it should be pretty trivial to implement and will give you a significant saving in space. I suspect some careful looking at caching in symtab2gb will improve speed. However these are just plasters.
  • Proposal of Interface between Kani/CProver tools #7042 contains the best discussion of interfacing issues. The key points I took away from this were:
  • GOTO programs really are the best thing to pass between tools.
  • It is best to have the CPROVER C++ read/write these. This means you either have to A. wrap it using a foreign function interface or B. have a text interface and a tool which is basically text to GOTO program.
  • B is easier than A but less performant. symtab2gb is an example of a B-style approach.
  • There is a separate, maybe orthogonal question of interfaces to instrumentation and analyses passes (on GOTO programs). The current design is for C++ executables for combinations of these (jbmc for example). Another approach is another wrapped API.

I wish I had better answers / clear guidance but I think this is pretty much the problem. HTH.

@celinval
Copy link
Collaborator

celinval commented Jan 16, 2023

What about using something like Protocol Buffers to make B more performant and less platform / language dependent?

If we are going to make GOTO programs the proper interface to model programs between tools as identified in #7042, I would hope that we would also create a well defined serialization / deserialization format for it.

@martin-cs
Copy link
Collaborator

Protocol buffers could work. TBH the big thing that json_symtab needs (IMHO) is a way of making reference to ireps. A lot of the verbosity is many copies of the same strings and expressions.

I agree that GOTO programs need either a (widely linkable, cross language) API or (more ideal) a documented format so people can write them. I guess my first point was that I think the current format is best accessed by the API. It may be less pain to produce a new serialisation format (maybe with protocol buffers?) than to document the current one and expect other API implementations to stay in sync.

( Obviously, all of this is easier if it is documented, particular if that is in an executable / testable form. )

One aspect of this that might be a good place to start and try out ideas are the various irept child classes. These are necessary for any kind of wrapped or documented-and-built-out-of-tree API for GOTO programs. They (std_expr.h, std_types.h etc.) are highly formulaic and could clearly be auto-generated. As referenced in #7042 there was even a system that did this. Anything that can be shown to work for these has a good chance of working for GOTO programs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aws Bugs or features of importance to AWS CBMC users documentation
Projects
Development

No branches or pull requests

7 participants