diff --git a/src/util/README.md b/src/util/README.md index d955f320e45..df3bacc40f8 100644 --- a/src/util/README.md +++ b/src/util/README.md @@ -3,25 +3,72 @@ # Folder util -\author Martin Brain, Owen Jones +\author Martin Brain, Owen Jones, Chris Smowton \section data_structures Data Structures \ref util contains some of the key data-structures used in the CPROVER codebase. -\subsection irept_section irept: a 4-tuple (data, named-sub, comments, sub) +\subsection irept_section irept: a general-purpose polymorphic tree -See documentation at \ref irept. +See detailed documentation at \ref irept. -As that documentation says, [irept](\ref irept)s are generic tree nodes. You -should think of them as having a single string ([data](irept::data), actually -an \ref irep_idt) and lots of child nodes, some of which are numbered +[irept](\ref irept)s are generic tree nodes. You +should think of each node as holding a single string ([data](irept::data), +actually an \ref irep_idt) and lots of child nodes, some of which are numbered ([sub](\ref irept::dt::sub)) and some of which are labelled, and the label can either start with a “\#” ([comments](\ref irept::dt::comments)) or without one ([named_sub](\ref irept::dt::named_sub)). The meaning of the “\#” is that -this child should not be considered important, for example it shouldn't be -counted when comparing two [irept](\ref irept)s for equality. +this child shouldn't be counted when comparing two [irept](\ref irept)s for +equality; this is usually used when making an advisory annotation which does +not alter the semantics of the program. + +They are used to represent many kinds of structured objects throughout the +CPROVER codebase, such as expressions, types and code. An \ref exprt represents +expressions, including for example \ref equal_exprt, an equality predicate, or +\ref dereference_exprt, which represents the `*` unary dereference operator +found in C. A \ref typet represents a type, and may have other \ref typet nodes +as children: for example, \ref array_typet represents an array, and has a single +numbered child that gives the type of the array elements. Finally, \ref codet +represents imperative statements, such as \ref code_assignt, which represents an +imperative assignment. It has two numbered operands, its left- and right-hand +sides. + +Note that \ref exprt, \ref codet and similar classes deriving from \ref irept +do so in order to add constraints to the general tree (for example, providing +accessor methods with user-friendly names, or enforcing invariants that a node +must have a certain number of children), but do not override virtual methods or +add fields. + +The implementation of \ref irept allows saving memory by sharing instances of +its internal storage using a +[copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write) scheme. For +example, the statement `irep1.sub()[0] = irep2;` will share rather than copy +`irep2` and its children, saving memory unless either irept is subsequently +modified, at which point a copy is made. + +\subsection irept_discussion_section Discussion + +Many other compiler infrastructures represent a polymorphic tree using nodes +specialised to a particular expression type: for example, perhaps a binary +addition operator could be represented using a tag plus two pointers to child +expressions, rather than allocating a whole irept (including a variable-length +expression vector and empty maps for storing named subexpressions). This may +save memory, but inhibits ad-hoc annotations such as tagging the addition +"does not overflow" without adding that field to addition operations globally +or maintaining a shadow data structure to track that information. This is easier +with a general irept structure that can store an arbitrary number of +arbitrarily-named child nodes. + +Another optimisation commonly seen when storing polymorphic trees is to use a +uniform node data structure (like irept) but to keep the node compact, for +example storing at most four pointers and transparently allocating extension +nodes when necessary for an unusually large expression. This provides the best +of both worlds, obtaining compact storage for common cases such as unannotated +binary expressions while permitting deviation at times. The downside is that +implementing such a system is more complex than straightforwardly using C++ +standard data structures as in irept. \subsection irep_idt_section Strings: dstringt, the string_container and the ID_*