-
Notifications
You must be signed in to change notification settings - Fork 273
Cache dereferences during symex to avoid repeatedly inserting large dereferences into SSA #5964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache dereferences during symex to avoid repeatedly inserting large dereferences into SSA #5964
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #5964 +/- ##
===========================================
- Coverage 75.12% 74.01% -1.11%
===========================================
Files 1435 1444 +9
Lines 156301 157344 +1043
===========================================
- Hits 117416 116466 -950
- Misses 38885 40878 +1993
Continue to review full report at Codecov.
|
361852a
to
d810eb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very interesting patch and it would be great to have it in. I think it would be good to have some extensive performance evaluation.
@martin-cs I added some additional explanation and some performance data we've measured on awslabs/s2n. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could do with a little clean up. But is otherwise great.
@martin-cs Also follow up with examples from
|
6f53774
to
f5fd138
Compare
@thomasspriggs I'm not that familiar with the jbmc regression tests. Would it be OK to add the changes there in a later PR? Manual checking shows the tests ought to still pass with the flag turned on, just don't have the time to do the edits right now. |
Yes. I am fine with that. Which is why I approved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, but I really would like to understand whether it was a conscious decision to not use sharing_map
; if it wasn't, then I'd ask for experiments to be re-run.
Also, could the commit message please not use the "CSE" abbreviation without providing any kind of context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for all the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can you simplify the option handling / default in the tests.
@@ -251,7 +252,8 @@ void run_property_decider( | |||
" iteration are allowed to fail due to\n" \ | |||
" complexity violations before the loop\n" \ | |||
" gets blacklisted\n" \ | |||
" --graphml-witness filename write the witness in GraphML format to filename\n" // NOLINT(*) | |||
" --graphml-witness filename write the witness in GraphML format to filename\n" /* NOLINT(*) */ \ | |||
" --symex-cache-dereferences enable caching of repeated dereferences" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good if new options followed the pattern of having OPT_
and HELP_
macros.
I'm not sure I really understand why this patch has an effect. This is bothering me because I wonder if there is actually some kind of bug and this is just covering it rather than resolving it. As I understand it, this only affects cases where there is a non-trivial expression generated by a dereference on the RHS of two or more expressions. For example if
would normally become:
but with this it would become
Am I on the right track? As far as I can see this could affect performance in a number of ways:
I do wonder whether this is an improvement of 1 and 2 but because it covers a bug that results in a cache miss in the solver for two dereferences of the same thing. Maybe something is re-SSA'd and so comes out with a new name unnecessarily? Maybe the ordering of the pointer set changes so it is @hannes-steffenhagen-diffblue can you pull out from your experimental runs the numbers of clauses and variables? That should help narrow down why this works. |
Deleting this due to an error in the runtime data. |
Replying to @martin-cs here. I'm going to add some comments and details here, apologies if this is incomplete (and please let me know if this i not clear). I'll try to respond inline.
Yes, but one subtle point here, but the dereference does not have to be in two or more expressions. The following is also possible for:
to become
to be rewritten
This may be significant to caching and optimisation in other places (see also responses below). Further, in the example raised here the performance is improved by (effectively) caching by splitting expressions and caching the dereference, which was why caching is also operating within a single expression.
I speculate this may be where the subtle difference above is most significant since this allows for caching of sub-expressions (i.e. dereferences). This may have some interaction with the solver expression cache, but it seems like dereferences should be cached consistently and so this should not have too much impact. (The exception would be when expressions are not cached in the solver due to appearing in a write or under quantification scope, but it is not clear to me that these would be be cachable anyway.) As for the number of clauses/variables, I don't have detailed data (I don't think it was logged, and I don't have the actual run data for some of the runs since they were done by others). However, I can comment that the result of using the dereference cache is to generally increase both the number of clauses and number of variables. (This seems entirely expected to me since as in the examples above there needs to be an extra variable for the dereference_cache variable and also a clause, but the generally improved runtime implies this is a simplification in most cases, or at least changing the form in a way that the SAT solver finds easier. Of course, most observations and guesses about the SAT solver are somewhat speculative.)
This does match quite well with observations. That said, as I mentioned above there seems to be a simplification of the expressions and so this would not appear to purely be a reroll of the SAT solver "dice".
I hope I've given some clarity above that would help clarify here. That said, I don't think there is a clear answer either above or in theory - it's difficult to prove the absence of potential bug. I can see reason why even with a bug-free SAT solver expression cache this PR would improve performance, but this does not rule out there being a bug there either.
I don't think Hannes will be working on this further, I may be able to gather some of this. |
On Mon, 2021-03-29 at 02:41 -0700, TGWDB wrote:
I'm going to add some comments and details here, apologies if this is
incomplete (and please let me know if this i not clear). I'll try to
respond inline.
Thanks for the fast and helpful responses. It hasn't resolved my
concerns but it is good to discuss them. I guess fundamentally I think
we shouldn't be merging something that improves performance if we don't
have evidence of why it works. Especially if it adds complexity to a
critical part of the code.
I'd like to try to help gain the understanding of why it works.
<snip>
> but with this it would become
>
> ```
> cached = (p = &a) ? a : b;
> x = cached;
> y = cached;
> ```
>
> Am I on the right track?
Yes, but one subtle point here, but the dereference does not have to
be in two or more expressions. The following is also possible for:
```
if(*p == 10 || *p == 11 || *p == 21)
```
to become
```
if(((p = &a) ? a : b) == 10 || ((p = &a) ? a : b) == 11 || ((p = &a)
? a : b) == 21)
```
to be rewritten
```
cached = ((p = &a) ? a : b);
if(cached == 10 || cached == 11 || cached == 21)
```
Thanks for the example.
This may be significant to caching and optimisation in other places
(see also responses below).
Further, in the example raised [here](
#5494) the performance is
improved by (effectively) caching by splitting expressions and
caching the dereference, which was why caching is also operating
within a single expression.
Maybe we should work with this example to try to explore why it helps.
> As far as I can see this could affect performance in a number of
> ways:
>
> 1. Save on rewriting / SSA. I can see that this could have an
> effect but if it is having the effect that @hannes-steffenhagen-
> diffblue is reporting at the high-end of the results then we should
> be doing this more widely and possibly profiling the SSA in those
> cases because it shouldn't be that slow.
> 2. It could change the number of clauses / variables used. I'd
> love to think that it reduces them but what I can't understand is
> how the code above doesn't result in a cache hit in the solver's
> expression cache. If it misses then we should fix that cache. If
> it hits then I don't see how this could reduce the number of
> clauses.
I *speculate* this may be where the subtle difference above is most
significant since this allows for caching of sub-expressions (i.e.
dereferences).
I can believe that this could be where it changes things but if that is
the case then it is a missed caching / optimisation in the solver and
should be fixed there.
This may have some interaction with the solver expression cache, but
it seems like dereferences should be cached consistently and so this
should not have too much impact. (The exception would be when
expressions are not cached in the solver due to appearing in a write
or under quantification scope, but it is not clear to me that these
would be be cachable anyway.)
Yes, as far as I understand this patch, it does not cache more than
what the solver's expression cache handles.
As for the number of clauses/variables, I don't have detailed data (I
don't think it was logged, and I don't have the actual run data for
some of the runs since they were done by others). However, I can
comment that the result of using the dereference cache is to
generally increase both the number of clauses and number of
variables. (This seems entirely expected to me since as in the
examples above there needs to be an extra variable for the
dereference_cache variable and also a clause, but the generally
improved runtime implies this is a simplification in most cases, or
at least changing the form in a way that the SAT solver finds easier.
Of course, most observations and guesses about the SAT solver are
somewhat speculative.)
Each cached dereference should add sizeof() + 1 variables and
4*sizeof() + 1 variables.
If there is a net decrease in the number of variables then it is clear
that the solver cache is not working correctly. If there is a slight
increase then it is not clear to me that there is simplification going
on.
> 3. It could just change the formula which rerolls the SAT solver
> "dice" so some will get faster (possibly a lot) and some will get
> slower.
This does match quite well with observations. That said, as I
mentioned above there seems to be a simplification of the expressions
and so this would not appear to purely be a reroll of the SAT solver
"dice".
I want there to be simplification but I am not sure I see the evidence
for it at the moment.
> @hannes-steffenhagen-diffblue can you pull out from your
> experimental runs the numbers of clauses and variables? That should
> help narrow down why this works.
I don't think Hannes will be working on this further, I may be able
to gather some of this.
I would appreciate that. I don't have time to do experiments myself
but some things that I would try:
*. Get @natasha-jeppu 's example working.
*. See if there is actual a decrease in clauses or variables generated.
*. See how it affects the run-time of various SMT solvers.
*. Use --show-vcc and diff to establish what is actually changing.
*. Modify the solver cache to log hits and see what is happening there.
|
@martin-cs Continuing on this discussion. Note that I've cut parts that are not important to this point of the discussion and responded inline.
I'll make the counter-argument here for discussion, don't take this as a disagreement with the principle of your position. The counter-argument is that we should merge something that improves performance when we have evidence that it works and information on why it works. We have several benchmarks that show this improves performance on programs that have dereferences that are (re)used multiple times. We can observe that the SSA is rewritten to reduce repetition of larger expressions and replace them with single variables (the cache symbols). To me it appears fairly clear how and why it works. The main challenges I see to the above as the following:
Again, the above are to make some counter arguments for the sake of discussion. I can see validity in both sides.
See below.
See above.
Following this.
I have a modified version working now (modified since both cbmc and the repository have evolved since the original issue).
There is an increase in both. Note that an increase is expected since the dereference caching adds new variables and clauses. The data is as follows
Observe that the
Runtime data is as follows:
Here we can observe that there is more time spent in Aside: note that this does NOT come close to the manual optimisation since the manual optimisation changes the control flow of the function. To give a rough understanding of why, the original (
and the manual conversion does (something like) the following
that not only "caches" the dereferences, but also changes the control flow.
A quick look at the log shows that adding the
Not done (yet?). |
@martin-cs I agree it would be good to explore some of this a bit more - however practical realities are that we have a limited amount of time available to complete this work, and we think we probably have enough evidence to show that where it works, it works consistently well enough that it is worth adding as a (default disabled) feature. There's definitely more that could be done, but not the budget to do so :-) Given that most of the PR changes are actually in options handling and tests, I don't think this PR adds that much more complexity (albeit, in a somewhat subtle area of code). If its a help to resolve the conversation I'm happy to setup a three-way call ? |
I've done a LOT more experiments and checking of various details related to the runtime performance. There are two main results that are very stable and confident now.
|
src/goto-symex/goto_state.h
Outdated
@@ -38,6 +38,11 @@ class goto_statet | |||
symex_level2t level2; | |||
|
|||
public: | |||
/// This is used for eliminating repeated complicated dereferences. | |||
/// This is pretty much just a simple wrapper around a map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence probably isn't needed anymore?
Thank you for doing further experiments. Would you mind sharing details about those that exhibit "big degradation?" Just so that someone can pick those up at a later point and investigate in more depth. In general, I think we might be solving the same problem that #4977 attempted to solve. That PR doesn't have nearly as much data, but has explanations on expressions that make such a change desirable. |
Trying to bring together two threads of discussion on this PR here: the example
suggests that it's indeed double dereferences that are causing trouble, which is what #4977 tried to address. I therefore remain inclined that this is ok to be merged as-is (except for the minor comments I added earlier), and we can likely close #4977 (as @smowton had to leave it to others to address this problem). @martin-cs: with the background of #4977, would you be happy to go ahead with this? |
I think there may have been some kind of problem with rebasing or merging as there are now a bunch of Z3 specific changes in this PR. I'll have another view when things are a bit more normal. |
If you can squash the fixes into the relevant commits it will make reviewing easier. |
6eb7c91
to
ca973e2
Compare
ca973e2
to
467b094
Compare
@martin-cs Yes, I made a mess in a rebase. I've pushed now and also tidied up the commit history. The reviewing should be a lot easier and cleaner now. |
@TGWDB thanks for tidying up the PR, I will have a read through everything and another review ASAP. |
This makes the dereference caching behaviour configurable
467b094
to
d554326
Compare
See a set of runs here. Note that these should be considered a little noisy and so small variances are not likely to be significant. The runs are averages of a couple of invocations and all done on the same system with very limited potential for contention of CPU/memory/disk/etc. Also these are not necessarily good samples to test with, just one I had set up to be able to evaluate.
Interesting, thanks for the link. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the performance data and your patience. Keeping in mind @chrisr-diffblue 's constraints, I am not going to oppose this. All of the hard-blocking issues around option handling have been fixed. The OPT_
and HELP_
macros would be nice but are a more minor thing.
I remain unconvinced about exactly what is being improved but if it is an option and is low-overhead when off, then, why not. I think @tautschnig 's point about @smowton 's PR #4977 is very interesting and I think that is definitely worth a look if you are looking to address these problems. TBH I would rate #4977 as a higher priority.
@peterschrammel Any chance of a review from a code owner? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another line of further investigation that was mentioned by @hannes-steffenhagen-diffblue: a significant difference between the manual optimisation in #5505 and the present PR is that the latter does not optimise away the pointer checks introduced by goto_check. Potentially, goto_check could be made smarter or goto_symex could skip over pointer checks relating to cached dereferences.
Another line of further investigation that was mentioned by @hannes-
steffenhagen-diffblue: a significant difference between the manual
optimisation in #5505 and the
present PR is that the latter does not optimise away the pointer
checks introduced by goto_check. Potentially, goto_check could be
made smarter or goto_symex skips over pointer checks relating to
cached dereferences.
@peterschrammel I would love to get to the stage where the overhead of
goto-analyze --simplify was low enough that we could run it before
symbolic execution to prune "obvious" checks.
|
Repeated large expressions within the SSA (and the resulting final formula) can be problem for both symex (because we need to e.g. simplify the same/similar large expressions over and over again) as well as for sat solvers. Especially for the latter case the performance impact of making changes to the formula are somewhat unpredictable, but in our testing seemed to average out to an improvement. However, since it can lead to performance regressions in some cases and does change what traces look like a bit (due to the extra assignments) this is opt-in behaviour.
The idea here comes from an interesting example originally observe by Natasha Jeppu and raised as an issue here. The approach doesn't replicate exactly what she did in the issue because we're changing things at the symex stage, but we still produce a noticeable (if not quite as a dramatic) speedup on her example in our testing.
Here's a table with performance measured on one test machine with the default miniSAT (times in seconds). Exact numbers here are going to be hard to reproduce, but manual checking on a variety of machines showed that these are roughly consistent between different machines. Since a lot of the runtime difference for these is in the sat solver swapping that out is expected to produce different results, but preliminary testing suggests that the caching is still often beneficial there. The examples are the proofs from S2N, specifically on commit bb0f4c9b3f7e7ec0192dca9521a37e72190d0d30.
The runtimes are one-shots, but random manual testing doesn't show a lot of run-to-run variation so that should be enough to evaluate a trend; However this does mean that small changes (<5 %) are mostly attributable to noise. As you can see for most examples there isn't a particularly large difference in runtime with most examples, however a couple of examples have dramatic improvements (e.g.
s2n_hash_copy
ors2n_hmac_free
), with a couple of small but also a handful of very notable regressions (e.g.s2n_stuffer_write_base64
ors2n_stuffer_read_expected_str
).Unfortunately I still don't have a great working theory to explain or predict the large performance differences observed. One thing we've tried is limiting the caching to only larger dereference sets (based on expression size), but the results from that were somewhat inconclusive (it does seem to help in some cases but make things worse in others). IMHO this is still interesting enough to ship out behind a flag for people to play around with. I imagine this could be useful for use cases where the same tests are run over and over again in CI (such as is the case in a lot of projects in awslabs), where it ought to be possible to run once with and without the flag and just set it for subsequent runs if it turns out to help.