Skip to content

Commit a6d2e2a

Browse files
authored
Merge pull request #2802 from marek-trtik/doc_add_goto_binary
DOC-43: Documentation of GOTO model (de)serialisation.
2 parents d73d2db + 6cd614b commit a6d2e2a

File tree

2 files changed

+228
-0
lines changed

2 files changed

+228
-0
lines changed

src/goto-programs/README.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -329,3 +329,183 @@ This stage concludes the *analysis-independent* program transformations.
329329
\subsubsection goto-program-example-1-section Unsigned mult (unsigned a, unsigned b) { int acc, i; for (i = 0; i < b; i++) acc += a; return acc; }
330330

331331
To be documented.
332+
333+
334+
\section section-goto-binary Binary Representation
335+
336+
An instance of `::goto_modelt` can be serialised to a binary stream (which is
337+
typically a file on the disk), and later deserialised from that stream back to
338+
an equivalent `::goto_modelt` instance.
339+
340+
\subsection subsection-goto-binary-serialisation Serialisation
341+
342+
The serialisation is implemented in C++ modules:
343+
- `write_goto_binary.h`
344+
- `write_goto_binary.cpp`
345+
346+
To serialise a `::goto_modelt` instance `gm` to a stream `ostr` call the
347+
function `::write_goto_binary`, e.g. `write_goto_binary(ostr, gm)`.
348+
349+
The content of the written stream will have this structure:
350+
- The header:
351+
- A magic number: byte `0x7f` followed by 3 characters `GBF`.
352+
- A version number written in the 7-bit encoding (see [number serialisation](\ref irep-serialization-numbers)). Currently, only version `4` is supported.
353+
- The symbol table:
354+
- The number of symbols in the table in the 7-bit encoding.
355+
- The array of individual symbols in the table. Each written symbol `s` has this structure:
356+
- The `::irept` instance `s.type`.
357+
- The `::irept` instance `s.value`.
358+
- The `::irept` instance `s.location`.
359+
- The string `s.name`.
360+
- The string `s.module`.
361+
- The string `s.base_name`.
362+
- The string `s.mode`.
363+
- The string `s.pretty_name`.
364+
- The number `0` in the 7-bit encoding.
365+
- The flags word in the 7-bit encoding. The bits in the flags word correspond to the following `Boolean` fields (from the most significant bit):
366+
- `s.is_weak`
367+
- `s.is_type`
368+
- `s.is_property`
369+
- `s.is_macro`
370+
- `s.is_exported`
371+
- `s.is_input`
372+
- `s.is_output`
373+
- `s.is_state_var`
374+
- `s.is_parameter`
375+
- `s.is_auxiliary`
376+
- `0` (corresponding to `s.binding`, i.e. we always clear this info)
377+
- `s.is_lvalue`
378+
- `s.is_static_lifetime`
379+
- `s.is_thread_local`
380+
- `s.is_file_local`
381+
- `s.is_extern`
382+
- `s.is_volatile`
383+
- The functions with bodies, i.e. those missing a body are skipped.
384+
- The number of functions with bodies in the 7-bit encoding.
385+
- The array of individual functions with bodies. Each written function has this structure:
386+
- The string with the name of the function.
387+
- The number of instructions in the body of the function in the 7-bit encoding.
388+
- The array of individual instructions in function's body. Each written instruction `I` has this structure:
389+
- The `::irept` instance `I.code`, i.e. data of the instruction, like arguments.
390+
- The string `I.function`, i.e. the name of the function this instruction belongs to.
391+
- The `::irept` instance `I.source_location`, i.e. the reference to the original source code (file, line).
392+
- The word in the 7-bit encoding `I.type`, i.e. the op-code of the instruction.
393+
- The `::irept` instance `I.guard`.
394+
- The empty string (representing former `I.event`).
395+
- The word in the 7-bit encoding `I.target_number`, i.e. the jump target to this instruction from other instructions.
396+
- The word in the 7-bit encoding `I.targets.size()`, i.e. the count of jump targets from this instruction.
397+
- The array of individual jump targets from this instruction, each written as a word in the 7-bit encoding.
398+
- The word in the 7-bit encoding `I.labels.size()`.
399+
- The array of individual labels, each written as a word in the 7-bit encoding.
400+
401+
An important propery of the serialisation is that each serialised `::irept`
402+
instance occurs in the stream exactly once. Namely, in the position of
403+
its first serialisation query. All other such queries save only a hash
404+
code (i.e. reference) of the `::irept` instance.
405+
406+
A similar strategy is used for serialisation of string constants
407+
shared amongst `::irept` instances. Such a string is fully saved only in
408+
the first serialisation query of an `::irept` instance it appears in and
409+
all other queries only save its integer hash code.
410+
411+
Details about serialisation of `::irept` instances, strings, and words in
412+
7-bit encoding can be found [here](\ref irep-serialization).
413+
414+
\subsection subsection-goto-binary-deserialisation Deserialisation
415+
416+
The deserialisation is implemented in C++ modules:
417+
- `read_goto_binary.h`
418+
- `read_goto_binary.cpp`
419+
- `read_bin_goto_object.h`
420+
- `read_bin_goto_object.cpp`
421+
422+
The first two modules are responsible for location of the stream with the
423+
serialised data within a passed file. And the remaining two modules
424+
perform the actual deserialisation of a `::goto_modelt` instance from
425+
the located stream.
426+
427+
To deserialise a `::goto_modelt` instance `gm` from a file
428+
`/some/path/name.gbf` call the function `::read_goto_binary`, e.g.
429+
`read_goto_binary("/some/path/name.gbf", gm, message_handler)`, where
430+
`message_handler` must be an instance of `::message_handlert` and serves
431+
for reporting issues during the process.
432+
433+
The passed binary file is assumed to have the same structure as described in
434+
the [previous subsection](\ref subsection-goto-binary-serialisation).
435+
The process of the deserialisation does not involve any seeking in the file.
436+
The content is read linearly from the beginning to the end. `::irept` instances
437+
and their string constants are deserialised into the memory only once at their
438+
first occurrences in the stream. All subsequent deserialisation queries are
439+
resolved as in-memory references to already deserialised the `::irept`
440+
instances and/or strings, based on hash code matching.
441+
442+
NOTE: The first deserialisation is detected so that the loaded hash code
443+
is new. That implies that the full definition follows right after the hash.
444+
445+
Details about serialisation of `::irept` instances, strings, and words in
446+
7-bit encoding can be found [here](\ref irep-serialization).
447+
448+
\subsubsection subsection-goto-binary-deserialisation-from-elf Deserialisation from ELF image
449+
450+
One can decide to store the serialised stream as a separate section, named
451+
`goto-cc`, into an ELF image. Then the deserialisation has a support of
452+
automatic detection of that section in an ELF file and the deserialisation
453+
will be automatically started from that section.
454+
455+
For reading the ELF images there is used an instance of `::elf_readert`
456+
implemented in the C++ module:
457+
- `elf_reader.h`
458+
- `elf_reader.cpp`
459+
460+
\subsubsection subsection-goto-binary-deserialisation-from-mach-o-fat-image Deserialisation from Mach-O fat image
461+
462+
One can decide to store the serialised stream into Mach-O fat image as a
463+
separate non-empty section with flags `CPU_TYPE_HPPA` and
464+
`CPU_SUBTYPE_HPPA_7100LC`. Then the deserialisation has a support of
465+
automatic detection of that section in a Mach-O fat image, extraction
466+
of the section from the emage into a temporary file (this is done by
467+
calling `lipo` utility with `-thin hppa7100LC` option), and the
468+
deserialisation will be automatically started from that temporary
469+
file.
470+
471+
For reading the Mach-O fat images there is used an instance of
472+
`::osx_fat_readert` implemented in the C++ module:
473+
- `osx_fat_reader.h`
474+
- `osx_fat_reader.cpp`
475+
476+
NOTE: This functionality is available only when the modules are built
477+
on a MacOS machine.
478+
479+
\subsubsection subsection-goto-binary-is-binary-file Checking file type
480+
481+
You can use function `::is_goto_binary` to check whether a passed file contains
482+
a deserialised `::goto_modelt` instance or not. This is done by checking the
483+
magic number of the stream (see subsection
484+
[Serialisation](\ref subsection-goto-binary-serialisation)). However, in the
485+
case when the deserialised data were stored into ELF or Mach-O fat image, then
486+
only the check for presence of the concrete section in the image is performed.
487+
488+
\subsubsection subsection-goto-binary-deserialisation-linking Linking Goto Models
489+
490+
Similar to linking of object files together by C/C++ linker, the module
491+
provides linking of a dereserialised `::goto_modelt` instance into a given
492+
(i.e. previously deserialised or otherwise created) `::goto_modelt` instance.
493+
494+
This is implemented in function `::read_object_and_link`. The function first
495+
deserialises the passed file into a temporary `::goto_modelt` instance, and
496+
then it performs 'linking' of the temporary into a passed destination
497+
`::goto_modelt` instance.
498+
499+
Details about linking of `::goto_modelt` instances can be found
500+
[here](\ref section-linking-goto-models).
501+
502+
503+
\section section-linking-goto-models Linking Goto Models
504+
505+
C++ modules:
506+
- `link_goto_model.h`
507+
- `link_goto_model.cpp`
508+
509+
Dependencies:
510+
- [linking folder](\ref linking).
511+
- [typecheck](\ref section-goto-typecheck).

src/util/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -281,3 +281,51 @@ To be documented.
281281
\subsubsection ast-example-3-section Java arrays: struct Array { int length, int *data };
282282

283283
To be documented.
284+
285+
\subsection section-goto-typecheck Goto Model Typecheck
286+
287+
Class `typecheckt`.
288+
289+
\subsection irep-serialization `irept` Serialization
290+
291+
The module provides serialisation and deserialisation of
292+
integer numbers, strings, and `irept` instances.
293+
294+
This is implemented in C++ modules:
295+
- `irep_serialization.h`
296+
- `irep_serialization.cpp`
297+
- `irep_hash_container.h` (applies only to `irept` instances)
298+
- `irep_hash_container.cpp` (applies only to `irept` instances)
299+
300+
\subsubsection irep-serialization-numbers Serialization of Numbers
301+
302+
A number is serialiased in 7-bit encoding. For example, given a 2-byte
303+
number in base 2, like `10101010 01010101`, then it is saves in 3 bytes,
304+
where each byte takes only 7 bits from the number, reading from the
305+
left. The 8th bit in each output byte is set to 1 except in the last
306+
byte, where the bit is 0. That 0 bit indicates the end of the
307+
encoding of the number. So, the output bytes look like this:
308+
`11010101 11010100 00000010`.
309+
310+
This is implmented in the function `::write_gb_word`.
311+
312+
The deserialisation does the oposite process and it is implemented in
313+
`irep_serializationt::read_gb_word`.
314+
315+
\subsubsection irep-serialization-strings Serialization of Strings
316+
317+
A string is encoded as classic 0-terminated C string. However,
318+
characters `0` and `\\` are escaped by writing additional `\\`
319+
before them.
320+
321+
This is implmented in the function `::write_gb_string` and the
322+
deserialisation is implemented in `irep_serializationt::read_gb_string`.
323+
324+
Each string which is stored inside an `::irept` instance is saved (meaining
325+
its characters) into the ouptut stream, only in the first serialisation
326+
query of the string. In that case, before the string there is also saved
327+
a computed integer hash code of the string. Then, all subsequent
328+
serialisation queries save only that integer hash code of the string.
329+
330+
\subsubsection irep-serialization-ireps Serialization of `irept` Instances
331+

0 commit comments

Comments
 (0)