-
Notifications
You must be signed in to change notification settings - Fork 274
Use let-expression for dereferenced pointers #4576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use let-expression for dereferenced pointers #4576
Conversation
5fc1cba
to
70b510c
Compare
@tautschnig @kroening @peterschrammel TODOs done -- only data-gathering remains. |
03aecf6
to
62cb205
Compare
Here's a poster-child example: #include <assert.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int **pptrs[50];
// Set up a straightforward vector of double-derefs that lead to
// seperate integer allocations, but do so in a complicated-enough
// way to avoid constant propagation solving the whole thing (in a
// real example some user-controlled control-flow writing to one or
// other of the arrays would provide the corresponding breakage):
for(int i = 0; i < 50; ++i) {
int **new_ptrs = malloc(2 * sizeof(int*));
new_ptrs[argc % 2] = (int*)malloc(sizeof(int) * 2);
new_ptrs[argc % 2][0] = 1;
new_ptrs[argc % 2][1] = 2;
new_ptrs[1 - (argc % 2)] = (int*)malloc(sizeof(int) * 2);
new_ptrs[1 - (argc % 2)][0] = 3;
new_ptrs[1 - (argc % 2)][1] = 4;
pptrs[i % argc] = new_ptrs;
}
// Now just try to read from the pointers -- this will generate a
// pretty big formula as the nested deref could point to any of 50
// objects, and then the outer deref must condition on its result to
// determine where the inner pointer leads.
assert(pptrs[argc % 50][argc % 2][argc % 2] == 5);
} Interestingly the multiple levels of arrays are necessary -- if straightforward derefs are used instead of array indices then this code (https://github.com/diffblue/cbmc/blob/develop/src/pointer-analysis/value_set_dereference.cpp#L40) recurses into the nested Result: Interestingly for this case the solver takes longer with the patch than without (26 seconds vs. 10), despite having very similar clause / variable counts after the simplifier / merge-ireps has got at them. This might be luck or might be systematic; if it is then I'd expect the main win here to come when the assertion violation we end up diagnosing doesn't directly depend on the big deref, or when there are a lot of pointer references meaning symex would be unlikely to finish without this patch. |
To elaborate on the remark about nested So, possibilities:
Advantage of 1: flexible -- the kind of expression is irrelevant, all pointer expressions use let binders |
62cb205
to
6f05022
Compare
Ran benchmarks on Romain's suite -- overall symex runtimes got worse by about 3-4%, though there were a couple of cases that got significantly faster. Will benchmark including the solver on Friday. I suspect the specific deref-of-deref hack is performing better than a let-bound temporary when it works correctly, so I should probably revise this to continue using that workaround when possible, and only resort to let-binding when the situation is more complicated. |
In fact it seems all the variation observed on that particular benchmark was due to the vagaries of Google's cloud, not any actual difference, as none of them reproduced on further inspection. Given this sort of thing only manifests in Java for multi-dimensional arrays it wouldn't be too surprising if the selection of Tika methods never uses the changed code. |
{ | ||
exprt let_value = let_expr.value(); | ||
clean_expr(let_value, state, false); | ||
let_value = state.rename(std::move(let_value), ns).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note: some time ago @kroening proposed to make let
expressions be a scope of their own. Thus care would need to be taken as the let
expression could have symbols of local scope even when the name overlaps with something occurring elsewhere. Not the case right now, but needs care moving forward.
/// Particularly we want to make sure to insist on a local definition of \p expr | ||
/// is a large if-expression, such as `p == &o1 ? o1 : p == &o2 ? o2 : ...`, as | ||
/// can result from dereferencing a subexpression. | ||
static bool should_use_local_definition_for(const exprt &expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least the implementation could just be replaced by using has_subexpr(ID_symbol)
, but maybe you want to remove the function altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this looks for multiple symbol expressions, looking to permit x
, (int*)x
etc but not x[y]
or *(x + (y ? z : w))
or similar
src/util/nodiscard.h
Outdated
\*******************************************************************/ | ||
|
||
/// \file | ||
/// No-discard-result macro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems independent and might warrant a PR of its own?
@@ -200,13 +200,14 @@ void goto_symext::symex_function_call_symbol( | |||
{ | |||
code_function_callt code = original_code; | |||
|
|||
symex_live_let_variablest live_let_variables(*this, state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when exactly do we (not) need a new such context? Can we possibly reuse frames for this purpose? It's another bit of state to track and the interplay with branching I'm not entirely sure of.
6f05022
to
65996f1
Compare
@tautschnig please re-review, have replaced the owning object with a |
5ad56b8
to
5b19b26
Compare
@tautschnig I added tests that check let-exprs are / are not introduced in several particular circumstances. I then noticed that one of the particular cases that's currently handled badly, I also have a fix for the |
f85ac5a
to
4ed2c1c
Compare
4ed2c1c
to
b733ab2
Compare
src/goto-symex/symex_dead.cpp
Outdated
ssa_exprt ssa = state.rename_ssa<L1>(ssa_exprt{code.symbol()}, ns).get(); | ||
void goto_symext::symex_dead(statet &state, const symbol_exprt &symbol_expr) | ||
{ | ||
ssa_exprt ssa = state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugly clang-format
{ | ||
symbolt fresh_binder = get_fresh_aux_symbol( | ||
pointer.type(), | ||
"derefd_pointer", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use the function_id here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not, the temporary should not be in the same namespace as the function's locals. Could just be "symex" though, like the namespace occupied by invalid-object symbols
This is needed to handle nested dereferences when symex_dereference uses let expressions. It depends on a cheap-to-copy value_sett so that we can easily record and then later discard the local definition.
There are a couple of key patterns involving members that we can understand, such as member-of-typecast and member-of-byte-extract, neither of which work properly if we have an intervening let expression.
This executes let expressions like ordinary assignments, avoiding the need for all other components of cbmc to understand the little-used local definition expression. It also means that symex's usual constant propagator, value-set and other analyses are brought to bear on the definition just as if the front-end had used an instruction to make the definition.
For the time being symex always lifts these immediately, to avoid having to teach all elements of cbmc about the let expression.
We'll use the symex_dead overload that takes a symbol expression rather than an instruction to kill let-bound local definitions
For example, we're willing to directly quote `p`, `(type *)p`, `p[some_constant]` and other simple expressions directly when comparing a pointer against its possible aliases-- the case we really must prevent is a large if-expression like `q == &o1 ? o1 : q == &o2 ? o2 : ...` that can result from a nested dereference -- in this case we must use a let-bound temporary to avoid producing a very large expression that can founder subsequent algorithms such as simplification, renaming and irep-merging.
Taking inspiration from #return_value variables, which are always marked as hidden, since all let-bound variables encountered by symex are synthetic we mark those hidden too.
This is in preparation for adding post-step cleanup
This avoids both wasted storage due to let-bound variables staying in the L2 renaming map, propagator and/or value-set, and avoids generating pointless PHI-node instructions for them.
This results from expressions such as *(int*)*p, and is very nearly as easy to handle as the already-special-cased deref-of-if case. In essence we transform *(type*)(x ? y : z) into x ? *(type*)y : *(type*)z, which should yield a more compact expression due to not conflating the alias sets that can be derived from the two arms of the conditional.
b733ab2
to
87bcf4a
Compare
Seems to be the right way of going about solving this; will review in detail when I have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. @smowton Could you please post some performance evaluation results? You did post some earlier, but there have been significant changes to this PR since. Feel free to merge once you've provided that additional data.
Total process-seconds to run the cbmc CORE tests: 700s (-20) |
Heh, I actually have another benchmark in progess... will post it here when it finishes anyhow |
Good news! It was also a neutral finding. |
Depends on #4574 and #4575.Since #4555 didn't really work out, here's my second stab at avoiding producing comically large expressions during nested dereferencing.In brief, if we have
**p
wherep
is the root of a binary tree with nodes L, R, LL, LR, RL, RR, then instead of*p
->p == &L ? L : R
and then**p
->(p == &L ? L : R) == LL ? LL : (p == &L ? L : R) == LR ? LR : ...
instead we getlet derefd = (p == &L ? L : R) in derefd == LL ? LL : derefd = LR ? LR : ...
. When we have a large alias set this avoids creating many duplicates of the inner dereference result.Anecdotally this speeds up a nested deref due to a two-dimensional array from "practically forever" to "around a second".
Putting up now for early review, but known TODOs:DataCheck whether the assignments associated with a let expression's temporary variable ought to be(gone for HIDDEN, given thatSTATE
orHIDDEN
assignments.#return_value
assignments use the same scheme)Forgo a(allowed expressions only involving onelet
in more cases when the existing pointer expression is small / simple (e.g. ifpointer
is(T*)p
there's not really any point in creating a temporary for it)symbol_exprt
to be used directly, such as(type *)p
andp[some_constant]
)Kill the temporaries created forlet
bound variables when they go out of scope, to avoid pointlessly carrying them around invalue_sett
, the constant propagator, etc for the rest of time(added an owning object returned by(added a tracking vector of instruction-local bound variables, which are killed at the end ofclean_expr
that kills the binders after the instruction is done executing)symex_step
)