-
Notifications
You must be signed in to change notification settings - Fork 274
Machine learning support #633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is this across different CPROVER commits or across different versions of the software being analyzed? As it stands it is not clear why we need a third hash for irepts. Also be careful with computing these recursively as sharing may cause exponential increase in the time taken. @tautschnig can comment on the fun of finding these kind of issues.
"Externally" as in another program or another component? The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.
Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool? |
src/util/dstring.h
Outdated
inline size_t const_hash_string(const dstringt &s) | ||
{ | ||
return hash_string(string_container.get_string(s.get_no())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with what @martin-cs said: what exactly does this seek to achieve? In particular, this actually invokes the much more costly std::string
version of hash_string
!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martin-cs, @tautschnig: There is a need for us to identify goto-program hashes across commits (of the target software) in the case of differential analysis (specifically, we record hash-tagged information from an analysis and need to relate it to a new analysis done after some changes to the code). The current hash uses the id no. which can change from build to build while we actually hash the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for clarifying. May I ask for such information to be included in commit messages for others to understand what's happening - that would be very helpful. Also, the const
in the name got me very confused; how about "stable"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Done.
src/util/source_location.h
Outdated
bool is_built_in=!file.compare(0,built_in.length(),built_in); | ||
is_built_in|=!file.compare(0,built_in2.length(),built_in2); | ||
return is_built_in; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very useful! A few suggestions, though: (1) Don't make it inline, implement this in source_location.cpp. (2) Once done so, make use of has_prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further to what I've said: when moving this to the cpp file, would you mind also looking into the issues raised by the linter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I missed that file, I will do this change and address the concern raised by @martin-cs regarding the symbol table stuff.
We're creating a dependency graph on the inputs which derives from this information. It would be costly to code and then decode from json.
CBMC should benefit from listing the inputs to the program. Everything should be available in the output but I think we might not have the necessary meta-data to distinguish static variables (not sure). In any case, parsing the output would be inefficient for us. |
On Tue, 2017-03-14 at 06:01 -0700, Kurt Degiorgio wrote:
Degiorgio commented on this pull request.
> @@ -158,6 +158,12 @@ inline size_t hash_string(const dstringt &s)
return s.hash();
}
+inline size_t const_hash_string(const dstringt &s)
+{
+ return hash_string(string_container.get_string(s.get_no()));
+}
@martin-cs, @tautschnig: There is a need for us to identify goto-program hashes across commits (of the target software) in the case of differential analysis (specifically, we record hash-tagged information from an analysis and need to relate it to a new analysis done after some changes to the code). The current hash uses the id no. which can change from build to build while we actually hash the string.
Aside from performance, is there a functional reason that the default
hash should not do this? Also, are you hashing the comments field as
well? If so then changes in source location might change the hash.
|
The default hash is faster but it uses the ID which is non-constant across commits. Our hash is slower but it is constant across commits. We are are hashing exactly the same data as the default hash but with a different hash function. |
On Tue, 2017-03-14 at 06:18 -0700, Kurt Degiorgio wrote:
@martin-cs
> "Externally" as in another program or another component?
Like I said; are we talking external as in another program / address
space / serialisation or external as in another component / same address
space / data structure? From the code and the commit messages it is not
clear to me. Given your comments about performance it seems to really
matter.
The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.
We're creating a dependency graph on the inputs which derives from this information. It would be costly to code and then decode from json.
OK; the JSON is pretty much just a serialisation of the data structure.
> Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?`
CBMC should benefit from listing the inputs to the program. Everything should be available in the output but I think we might not have the necessary meta-data to distinguish static variables (not sure).
$ cat ~/tmp/can-delete/static-vars.c
int global = 0;
static int static_global = 0;
int main (int argc, char **argv)
{
int local = 0;
static int static_local = 0;
return global + static_global + local + static_local;
}
$ cbmc --show-symbol-table static-vars.c
Looks OK to me...
In any case, parsing the output would be inefficient for us.
OK; this I find a little concerning. I've done some work on performance
optimisation of CPROVER and if parsing the output is a significant
performance bottleneck then you are doing something pretty weird and
there may well be a better way of achieving this.
|
On Tue, 2017-03-14 at 06:27 -0700, Kurt Degiorgio wrote:
> Aside from performance, is there a functional reason that the default hash should not do this? Also, are you hashing the comments field as well? If so then changes in source location might change the hash.
The default hash is faster but it uses the ID which is non-constant across commits. Our hash is slower but it is constant across commits. We are are hashing exactly the same data as the default hash but with a different hash function.
Yes. I understand this and that's not what I was asking.
1. You are using a different hash function for the dstrings, base don
content not position in the table. What I am asking is "Is there any
reason why the default hash could not be changed to do this rather than
use the position". I.E. should the default irept hash use your content
based hash. I can see (minor, fixable) performance reasons but I can't
see any intrinsic functionality reasons. If we could do this then it
would save having Yet Another Hash Function and improve the general
CPROVER functionality.
2. You are hashing the comments field as well. IIRC this can contain
source location information. Thus it may not match common expressions
between different versions. I can't say for sure that this is or isn't
what you want but you should be at least aware that this is an issue.
3. The caching of hash values in a much bigger issue than it appears.
Some irept's wind up with non-trivial sharing so if you don't traverse
them in a sharing aware manner then you wind up with an exponential
increase in the time taken. @tautschnig found, diagnosed and fixed this
issue for the previous hash, hence why I referenced him.
|
it is another module (the module is external to public CBMC but is tightly coupled to it) that links to dependence_graph.o and makes use of dependency tree. So it will have the same address space. With regards to symbol table, I should be able to do this externally. Parsing doesn't make sense for us since at the point where we want to use this we already have a instance of the symbol table in memory.
Our main concern here is performance and this why we did it as separate hash function. The ID itself is the hash so this will always be faster then hashing the entire string. Having said that I agree with you that it is cleaner to have a single hash function. So yes we can have only one hash function but is the performance (even if it is minor) cost acceptable ? (Whatever, we do using the ID will always be faster)
Yes this an issue. I think it would be better to hash without the comments. |
On Tue, 2017-03-14 at 07:25 -0700, Kurt Degiorgio wrote:
> Like I said; are we talking external as in another program / address space / serialisation or external as in another component / same address space / data structure? From the code and the commit messages it is not clear to me. Given your comments about performance it seems to really matter.
it is another module (the module is external to public CBMC but is tightly coupled to it) that links to dependence_graph.o and makes use of dependency tree. So it will have the same address space.
With regards to symbol table, I should be able to do this externally. Parsing doesn't make sense for us since at the point where we want to use this we already have a instance of the symbol table in memory.
OK; I just didn't know what you meant by "external".
> 1. You are using a different hash function for the dstrings, base don content not position in the table. What I am asking is "Is there any reason why the default hash could not be changed to do this rather than use the position". I.E. should the default irept hash use your content based hash. I can see (minor, fixable) performance reasons but I can't see any intrinsic functionality reasons. If we could do this then it would save having Yet Another Hash Function and improve the general CPROVER functionality.
Our main concern here is performance and this why we did it as separate hash function. The ID itself is the hash so this will always be faster then hashing the entire string. Having said that I agree with you that it is cleaner to have a single hash function. So yes we can have only one hash function but is the performance (even if it is minor) cost acceptable ? (Whatever, we do using the ID will always be faster)
I'm wary of making any arguments about performance without numbers BUT I
suspect that the difference in performance between string hash and ID
will be much smaller than the sharing awareness / caching. Even if it
isn't, I suspect that caching the dstring hashes as well will make the
performance difference non-measurable.
Could you try changing the existing hash and then profiling the
difference? I can see a number of advantages of having the hash be
purely structural / repeatable and the downsides aren't well served by
our current system.
> 2. You are hashing the comments field as well. IIRC this can contain source location information. Thus it may not match common expressions between different versions. I can't say for sure that this is or isn't what you want but you should be at least aware that this is an issue.
Yes this an issue. I think it would be better to hash without the comments.
This is the hash vs. full_hash distinction in the current code.
|
Sure, I will profile this and post back. In addition, do you have any standard benchmarks you typically run CBMC on? |
On Tue, 2017-03-14 at 09:03 -0700, Kurt Degiorgio wrote:
> Could you try changing the existing hash and then profiling the difference? I can see a number of advantages of having the hash be purely structural / repeatable and the downsides aren't well served by our current system.
Sure, I will profile this and post back. In addition, do you have any standard benchmarks you typically run CBMC on?
SV-COMP? Michael and Peter has the scripts to do this and we need
someone to set this up as a push button service for folks at DiffBlue.
|
…requested by reviewer
…en there is a need to identify goto-program hashes across commits (of the target software). This is useful for example in differential analysis where we record hash-tagged information from an analysis and need it relate it to a new analysis (done after some changes to the code). The traditional hash function cannot be used for such use-cases as the ID can change across commits.
Sure, we'll get this set up on a server. |
On Wed, 2017-03-15 at 12:00 -0700, Peter Schrammel wrote:
Sure, we'll get this set up on a server.
Thanks; I think we may have several interested users...
|
@tautschnig @martin-cs, I am planning on running the performance test for the hash functions as discussed next week, in the mean-time is it possible to move this along? |
src/util/source_location.cpp
Outdated
|
||
/*******************************************************************\ | ||
|
||
Function: is_build_in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/build/built/
Do we really not have this as a utility? If not then we really really should add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seems like it , I did a quick fuzzy search through the code-base and found a list of classes (listed below) that are basically partiality replicating the functionality above every time it is needed.
src/goto-programs/graphml_witness.cpp
src/goto-instrument/count_eloc.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another one here: https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087
I'm sure there might be yet a few more...
You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me. |
My gut (admittedly, a pessimistic gut) tells me that its better to wait until we have some numbers. But if you guys insist I will happily to combine them. |
Doesn't seems like it , I did a quick fuzzy search through the code-base and found a list of classes (listed below) that are basically partiality replicating the functionality above every time it is needed.
src/goto-programs/graphml_witness.cpp
src/goto-instrument/count_eloc.cpp
That's not good :-\ Any chance you could put together (as part of this
or later) a patch to get these to use your new function. Good catch.
|
On Fri, 2017-03-17 at 18:31 -0700, Kurt Degiorgio wrote:
> You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me.
My gut (admittedly, a pessimistic gut) tells me that its better to wait until we have some numbers.
If you'd be happier with that, then do so.
But if you guys insist I will happily to combine them.
I'm not insisting but, to my mind, the future should be more about how
do we get the performance we need rather than having yet another hash
function.
|
My concern here, is that the new function checks if a location is built-in or not. a lot of the code mentioned above is more specific. For instance, https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087848 checks for '<built-in-libary'. We could still improve this by extending the current functionality of 'is-built-in' however I think it should be separate PR. |
Note that #681 already has the built-ins stuff. |
As requested I carried out some benchmarks on SV-COMP for the Floating Task. The performance hit of replacing the ID hash with a content hash is roughly 6%. This obviously preliminary and I would need to carry out more tests and some profiling to get an accurate figure. With reference to:
How should we proceed which change goes through? I am happy to change my code to make use of the function introduced in that PR. Besides this any thing I can do to move this forward ? |
On Tue, 2017-03-28 at 12:54 -0700, Kurt Degiorgio wrote:
As requested I carried out some benchmarks on SV-COMP for the Floating Task. The performance hit of replacing the ID hash with a content hash is roughly 6%. This obviously preliminary and I would need to carry out more tests and some profiling to get an accurate figure.
This is high enough to be concerning / worth the extra look but low
enough that it feels fixable. Things to work out:
1. if it is a fixed overhead (adds on 0.6 seconds to everything) or a
percentage increase.
2. if caching the hash value of the dstrings would help.
With reference to:
> Note that #681 already has the built-ins stuff.
How should we proceed which change goes through? I am happy to change my code to make use of the function introduced in that PR.
We should bundle together the logic for working out if this is a
built-in / added / internal variable / function / location into utility
functions and then use them everywhere rather than having multiple
implementation of similar checks. Whether you use the interface from
#681, or you change it (and the other locations that Peter highlighted
if you're being super helpful) to yours, I care less about. The problem
is duplicate functionality.
See also:
#709 (comment)
Besides this any thing I can do to move this forward ?
Beyond that I don't see much. You could separate out the three features
so the changes to cover_goals, identification of built-ins and the hash
so that some could be merged sooner?
|
Oks agreed, will get this done. |
On Wed, 2017-03-29 at 03:19 -0700, Kurt Degiorgio wrote:
> We should bundle together the logic for working out if this is a built-in / added / internal variable / function / location into utility functions and then use them everywhere rather than having multiple implementation of similar checks. Whether you use the interface from #681962, or you change it (and the other locations that Peter highlighted if you're being super helpful) to yours, I care less about. The problem is duplicate functionality.
Oks agreed, will get this done.
Thank you!
|
@tautschnig, @peterschrammel , @martin-cs I removed the is_build_in function since a similar function has been merged into CBMC. Assuming there no more required changes, can we move this PR along please? |
src/util/source_location.cpp
Outdated
@@ -10,6 +10,7 @@ Author: Daniel Kroening, [email protected] | |||
|
|||
#include "source_location.h" | |||
#include "file_util.h" | |||
#include "prefix.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Artifact from 'is_built_in' function. Removed it, thx for the catch!
I'm still in favour of the hash becoming the default, if the performance impact can be reduced enough. To me that is the last remaining issue. Have you tried caching the hashes of the dstring and then checking the performance impact vs. the existing hash. I can believe it would be that much. |
@martin-cs The test I did shows that there is 6% performance difference, I concur, caching the hashes will erode this gap significantly. However, I have the view that this should be a separate PR. What do you think? |
On Wed, 2017-05-03 at 06:58 -0700, Kurt Degiorgio wrote:
@martin-cs The test I did shows that there is 6% performance difference, I concur, caching the hashes will erode this gap significantly.
There are two parts to the caching; caching in the irept which is
already done by existing hashes and caching of the hashes of dstring
which isn't.
However, I have the view that this should be a separate PR. What do you think?
Yes; it probably makes sense to do it that way. Can you prepare the
hash changes as a separate PR that replaces the default hashes with the
new scheme (and caches for the dstrings). Once that is moved out I
can't see there will be many issues with this PR so we can get it
moving.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from nit picks my main concern with the PR is that it conflates several unrelated changes. This isn't for master so I'll merely say this as advice.
{ | ||
return data_deps; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are those changes about and why do they belong in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to extract the dependency information from the graph in order to create our input dependency, we need access to the data (ie control_deps and data_deps). These are getters for those structures.
src/util/dstring.h
Outdated
return hash_string(string_container.get_string(s.get_no())); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit picking: a single blank line will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for the catch, fixed
@@ -50,6 +50,7 @@ void cover_goalst::mark() | |||
prop_conv.l_get(g.condition).is_true()) | |||
{ | |||
g.status=goalt::statust::COVERED; | |||
g.seconds=(current_time()-start_time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to track the progression of coverage (i.e: at what time is each goal covered).
src/util/irep.cpp
Outdated
|
||
Outputs: | ||
|
||
Purpose: Stable hash, hashes the actual content of the string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing: it will hash the actual contents of all contained strings (as is, the comment would rather fit stable_hash_string
than irept::stable_hash
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the comment to reflect this
This pull request does the following changes: