Machine learning support #633

Degiorgio · 2017-03-14T10:51:02Z

This pull request does the following changes:

Introduced a HASH function that works across commits. (irep.*, dstring.h,goto_program_template.h)
Exposed internal variables and definitions in analyses/dependence_graph.h. This change is required as we need to use these dependencies externally.
Small change in solvers/prop/cover_goals.* to track progression of coverage

…ui.cpp

martin-cs · 2017-03-14T11:05:43Z

Introduced a HASH function that works across commits. (irep.*, dstring.h,goto_program_template.h)

Is this across different CPROVER commits or across different versions of the software being analyzed? As it stands it is not clear why we need a third hash for irepts. Also be careful with computing these recursively as sharing may cause exponential increase in the time taken. @tautschnig can comment on the fun of finding these kind of issues.

Exposed internal variables and definitions in analyses/dependence_graph.h. This change is required
as we need to use these dependencies externally.

"Externally" as in another program or another component? The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.

New functions to extract a list of global, static variables. This required for instrumentation purposes
by external code. (source_location.h, language_ui.*)

Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?

tautschnig · 2017-03-14T11:18:11Z

src/util/dstring.h

+inline size_t const_hash_string(const dstringt &s)
+{
+  return hash_string(string_container.get_string(s.get_no()));
+}


In line with what @martin-cs said: what exactly does this seek to achieve? In particular, this actually invokes the much more costly std::string version of hash_string !?

@martin-cs, @tautschnig: There is a need for us to identify goto-program hashes across commits (of the target software) in the case of differential analysis (specifically, we record hash-tagged information from an analysis and need to relate it to a new analysis done after some changes to the code). The current hash uses the id no. which can change from build to build while we actually hash the string.

I see, thanks for clarifying. May I ask for such information to be included in commit messages for others to understand what's happening - that would be very helpful. Also, the const in the name got me very confused; how about "stable"?

Agreed. Done.

tautschnig · 2017-03-14T11:20:15Z

src/util/source_location.h

+    bool is_built_in=!file.compare(0,built_in.length(),built_in);
+    is_built_in|=!file.compare(0,built_in2.length(),built_in2);
+    return is_built_in;
+  }


This seems very useful! A few suggestions, though: (1) Don't make it inline, implement this in source_location.cpp. (2) Once done so, make use of has_prefix.

Further to what I've said: when moving this to the cpp file, would you mind also looking into the issues raised by the linter?

yes, I missed that file, I will do this change and address the concern raised by @martin-cs regarding the symbol table stuff.

Degiorgio · 2017-03-14T13:18:04Z

@martin-cs

"Externally" as in another program or another component? The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.

We're creating a dependency graph on the inputs which derives from this information. It would be costly to code and then decode from json.

Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?`

CBMC should benefit from listing the inputs to the program. Everything should be available in the output but I think we might not have the necessary meta-data to distinguish static variables (not sure). In any case, parsing the output would be inefficient for us.

martin-cs · 2017-03-14T13:19:58Z

On Tue, 2017-03-14 at 06:01 -0700, Kurt Degiorgio wrote: Degiorgio commented on this pull request. > @@ -158,6 +158,12 @@ inline size_t hash_string(const dstringt &s) return s.hash(); } +inline size_t const_hash_string(const dstringt &s) +{ + return hash_string(string_container.get_string(s.get_no())); +} @martin-cs, @tautschnig: There is a need for us to identify goto-program hashes across commits (of the target software) in the case of differential analysis (specifically, we record hash-tagged information from an analysis and need to relate it to a new analysis done after some changes to the code). The current hash uses the id no. which can change from build to build while we actually hash the string.

Aside from performance, is there a functional reason that the default hash should not do this? Also, are you hashing the comments field as well? If so then changes in source location might change the hash.

Degiorgio · 2017-03-14T13:27:49Z

Aside from performance, is there a functional reason that the default hash should not do this? Also, are you hashing the comments field as well? If so then changes in source location might change the hash.

The default hash is faster but it uses the ID which is non-constant across commits. Our hash is slower but it is constant across commits. We are are hashing exactly the same data as the default hash but with a different hash function.

martin-cs · 2017-03-14T13:42:31Z

On Tue, 2017-03-14 at 06:18 -0700, Kurt Degiorgio wrote: @martin-cs > "Externally" as in another program or another component?

Like I said; are we talking external as in another program / address space / serialisation or external as in another component / same address space / data structure? From the code and the commit messages it is not clear to me. Given your comments about performance it seems to really matter.

The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details. We're creating a dependency graph on the inputs which derives from this information. It would be costly to code and then decode from json.

OK; the JSON is pretty much just a serialisation of the data structure.

> Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?` CBMC should benefit from listing the inputs to the program. Everything should be available in the output but I think we might not have the necessary meta-data to distinguish static variables (not sure).

$ cat ~/tmp/can-delete/static-vars.c int global = 0; static int static_global = 0; int main (int argc, char **argv) { int local = 0; static int static_local = 0; return global + static_global + local + static_local; } $ cbmc --show-symbol-table static-vars.c Looks OK to me...

In any case, parsing the output would be inefficient for us.

OK; this I find a little concerning. I've done some work on performance optimisation of CPROVER and if parsing the output is a significant performance bottleneck then you are doing something pretty weird and there may well be a better way of achieving this.

martin-cs · 2017-03-14T13:57:51Z

On Tue, 2017-03-14 at 06:27 -0700, Kurt Degiorgio wrote: > Aside from performance, is there a functional reason that the default hash should not do this? Also, are you hashing the comments field as well? If so then changes in source location might change the hash. The default hash is faster but it uses the ID which is non-constant across commits. Our hash is slower but it is constant across commits. We are are hashing exactly the same data as the default hash but with a different hash function.

Yes. I understand this and that's not what I was asking. 1. You are using a different hash function for the dstrings, base don content not position in the table. What I am asking is "Is there any reason why the default hash could not be changed to do this rather than use the position". I.E. should the default irept hash use your content based hash. I can see (minor, fixable) performance reasons but I can't see any intrinsic functionality reasons. If we could do this then it would save having Yet Another Hash Function and improve the general CPROVER functionality. 2. You are hashing the comments field as well. IIRC this can contain source location information. Thus it may not match common expressions between different versions. I can't say for sure that this is or isn't what you want but you should be at least aware that this is an issue. 3. The caching of hash values in a much bigger issue than it appears. Some irept's wind up with non-trivial sharing so if you don't traverse them in a sharing aware manner then you wind up with an exponential increase in the time taken. @tautschnig found, diagnosed and fixed this issue for the previous hash, hence why I referenced him.

Degiorgio · 2017-03-14T14:25:01Z

Like I said; are we talking external as in another program / address space / serialisation or external as in another component / same address space / data structure? From the code and the commit messages it is not clear to me. Given your comments about performance it seems to really matter.

it is another module (the module is external to public CBMC but is tightly coupled to it) that links to dependence_graph.o and makes use of dependency tree. So it will have the same address space.

With regards to symbol table, I should be able to do this externally. Parsing doesn't make sense for us since at the point where we want to use this we already have a instance of the symbol table in memory.

You are using a different hash function for the dstrings, base don content not position in the table. What I am asking is "Is there any reason why the default hash could not be changed to do this rather than use the position". I.E. should the default irept hash use your content based hash. I can see (minor, fixable) performance reasons but I can't see any intrinsic functionality reasons. If we could do this then it would save having Yet Another Hash Function and improve the general CPROVER functionality.

Our main concern here is performance and this why we did it as separate hash function. The ID itself is the hash so this will always be faster then hashing the entire string. Having said that I agree with you that it is cleaner to have a single hash function. So yes we can have only one hash function but is the performance (even if it is minor) cost acceptable ? (Whatever, we do using the ID will always be faster)

You are hashing the comments field as well. IIRC this can contain source location information. Thus it may not match common expressions between different versions. I can't say for sure that this is or isn't what you want but you should be at least aware that this is an issue.

Yes this an issue. I think it would be better to hash without the comments.

martin-cs · 2017-03-14T15:53:55Z

On Tue, 2017-03-14 at 07:25 -0700, Kurt Degiorgio wrote: > Like I said; are we talking external as in another program / address space / serialisation or external as in another component / same address space / data structure? From the code and the commit messages it is not clear to me. Given your comments about performance it seems to really matter. it is another module (the module is external to public CBMC but is tightly coupled to it) that links to dependence_graph.o and makes use of dependency tree. So it will have the same address space. With regards to symbol table, I should be able to do this externally. Parsing doesn't make sense for us since at the point where we want to use this we already have a instance of the symbol table in memory.

OK; I just didn't know what you meant by "external".

> 1. You are using a different hash function for the dstrings, base don content not position in the table. What I am asking is "Is there any reason why the default hash could not be changed to do this rather than use the position". I.E. should the default irept hash use your content based hash. I can see (minor, fixable) performance reasons but I can't see any intrinsic functionality reasons. If we could do this then it would save having Yet Another Hash Function and improve the general CPROVER functionality. Our main concern here is performance and this why we did it as separate hash function. The ID itself is the hash so this will always be faster then hashing the entire string. Having said that I agree with you that it is cleaner to have a single hash function. So yes we can have only one hash function but is the performance (even if it is minor) cost acceptable ? (Whatever, we do using the ID will always be faster)

I'm wary of making any arguments about performance without numbers BUT I suspect that the difference in performance between string hash and ID will be much smaller than the sharing awareness / caching. Even if it isn't, I suspect that caching the dstring hashes as well will make the performance difference non-measurable. Could you try changing the existing hash and then profiling the difference? I can see a number of advantages of having the hash be purely structural / repeatable and the downsides aren't well served by our current system.

> 2. You are hashing the comments field as well. IIRC this can contain source location information. Thus it may not match common expressions between different versions. I can't say for sure that this is or isn't what you want but you should be at least aware that this is an issue. Yes this an issue. I think it would be better to hash without the comments.

This is the hash vs. full_hash distinction in the current code.

Degiorgio · 2017-03-14T16:03:28Z

Could you try changing the existing hash and then profiling the difference? I can see a number of advantages of having the hash be purely structural / repeatable and the downsides aren't well served by our current system.

Sure, I will profile this and post back. In addition, do you have any standard benchmarks you typically run CBMC on?

martin-cs · 2017-03-14T16:08:50Z

On Tue, 2017-03-14 at 09:03 -0700, Kurt Degiorgio wrote: > Could you try changing the existing hash and then profiling the difference? I can see a number of advantages of having the hash be purely structural / repeatable and the downsides aren't well served by our current system. Sure, I will profile this and post back. In addition, do you have any standard benchmarks you typically run CBMC on?

SV-COMP? Michael and Peter has the scripts to do this and we need someone to set this up as a push button service for folks at DiffBlue.

…requested by reviewer

…en there is a need to identify goto-program hashes across commits (of the target software). This is useful for example in differential analysis where we record hash-tagged information from an analysis and need it relate it to a new analysis (done after some changes to the code). The traditional hash function cannot be used for such use-cases as the ID can change across commits.

peterschrammel · 2017-03-15T19:00:06Z

Sure, we'll get this set up on a server.

martin-cs · 2017-03-15T19:07:56Z

On Wed, 2017-03-15 at 12:00 -0700, Peter Schrammel wrote: Sure, we'll get this set up on a server.

Thanks; I think we may have several interested users...

Degiorgio · 2017-03-17T11:01:18Z

@tautschnig @martin-cs, I am planning on running the performance test for the hash functions as discussed next week, in the mean-time is it possible to move this along?

martin-cs · 2017-03-17T21:34:35Z

src/util/source_location.cpp

+
+/*******************************************************************\
+
+Function: is_build_in


s/build/built/

Do we really not have this as a utility? If not then we really really should add it.

Doesn't seems like it , I did a quick fuzzy search through the code-base and found a list of classes (listed below) that are basically partiality replicating the functionality above every time it is needed.

src/goto-programs/graphml_witness.cpp
src/goto-instrument/count_eloc.cpp

There's another one here: https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087
I'm sure there might be yet a few more...

martin-cs · 2017-03-17T21:35:47Z

You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me.

Degiorgio · 2017-03-18T01:31:07Z

You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me.

My gut (admittedly, a pessimistic gut) tells me that its better to wait until we have some numbers. But if you guys insist I will happily to combine them.

martin-cs · 2017-03-18T01:33:42Z

Doesn't seems like it , I did a quick fuzzy search through the code-base and found a list of classes (listed below) that are basically partiality replicating the functionality above every time it is needed. src/goto-programs/graphml_witness.cpp src/goto-instrument/count_eloc.cpp

That's not good :-\ Any chance you could put together (as part of this or later) a patch to get these to use your new function. Good catch.

martin-cs · 2017-03-18T01:34:57Z

On Fri, 2017-03-17 at 18:31 -0700, Kurt Degiorgio wrote: > You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me. My gut (admittedly, a pessimistic gut) tells me that its better to wait until we have some numbers.

If you'd be happier with that, then do so.

But if you guys insist I will happily to combine them.

I'm not insisting but, to my mind, the future should be more about how do we get the performance we need rather than having yet another hash function.

Degiorgio · 2017-03-24T11:14:30Z

That's not good :-\ Any chance you could put together (as part of this or later) a patch to get these to use your new function. Good catch.

My concern here, is that the new function checks if a location is built-in or not. a lot of the code mentioned above is more specific. For instance, https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087848 checks for '<built-in-libary'. We could still improve this by extending the current functionality of 'is-built-in' however I think it should be separate PR.

tautschnig · 2017-03-24T14:48:55Z

Note that #681 already has the built-ins stuff.

Degiorgio · 2017-03-28T19:54:12Z

As requested I carried out some benchmarks on SV-COMP for the Floating Task. The performance hit of replacing the ID hash with a content hash is roughly 6%. This obviously preliminary and I would need to carry out more tests and some profiling to get an accurate figure.

With reference to:

Note that #681 already has the built-ins stuff.

How should we proceed which change goes through? I am happy to change my code to make use of the function introduced in that PR. Besides this any thing I can do to move this forward ?

martin-cs · 2017-03-28T23:11:24Z

On Tue, 2017-03-28 at 12:54 -0700, Kurt Degiorgio wrote: As requested I carried out some benchmarks on SV-COMP for the Floating Task. The performance hit of replacing the ID hash with a content hash is roughly 6%. This obviously preliminary and I would need to carry out more tests and some profiling to get an accurate figure.

This is high enough to be concerning / worth the extra look but low enough that it feels fixable. Things to work out: 1. if it is a fixed overhead (adds on 0.6 seconds to everything) or a percentage increase. 2. if caching the hash value of the dstrings would help.

With reference to: > Note that #681 already has the built-ins stuff. How should we proceed which change goes through? I am happy to change my code to make use of the function introduced in that PR.

We should bundle together the logic for working out if this is a built-in / added / internal variable / function / location into utility functions and then use them everywhere rather than having multiple implementation of similar checks. Whether you use the interface from #681, or you change it (and the other locations that Peter highlighted if you're being super helpful) to yours, I care less about. The problem is duplicate functionality. See also: #709 (comment)

Besides this any thing I can do to move this forward ?

Beyond that I don't see much. You could separate out the three features so the changes to cover_goals, identification of built-ins and the hash so that some could be merged sooner?

Degiorgio · 2017-03-29T10:19:16Z

We should bundle together the logic for working out if this is a built-in / added / internal variable / function / location into utility functions and then use them everywhere rather than having multiple implementation of similar checks. Whether you use the interface from #681962, or you change it (and the other locations that Peter highlighted if you're being super helpful) to yours, I care less about. The problem is duplicate functionality.

Oks agreed, will get this done.

martin-cs · 2017-03-29T10:23:50Z

On Wed, 2017-03-29 at 03:19 -0700, Kurt Degiorgio wrote: > We should bundle together the logic for working out if this is a built-in / added / internal variable / function / location into utility functions and then use them everywhere rather than having multiple implementation of similar checks. Whether you use the interface from #681962, or you change it (and the other locations that Peter highlighted if you're being super helpful) to yours, I care less about. The problem is duplicate functionality. Oks agreed, will get this done.

Thank you!

Degiorgio · 2017-05-03T10:52:18Z

@tautschnig, @peterschrammel , @martin-cs I removed the is_build_in function since a similar function has been merged into CBMC. Assuming there no more required changes, can we move this PR along please?

martin-cs · 2017-05-03T11:27:04Z

src/util/source_location.cpp

@@ -10,6 +10,7 @@ Author: Daniel Kroening, [email protected]

 #include "source_location.h"
 #include "file_util.h"
+#include "prefix.h"


What's this for?

Artifact from 'is_built_in' function. Removed it, thx for the catch!

martin-cs · 2017-05-03T11:28:47Z

I'm still in favour of the hash becoming the default, if the performance impact can be reduced enough. To me that is the last remaining issue. Have you tried caching the hashes of the dstring and then checking the performance impact vs. the existing hash. I can believe it would be that much.

Degiorgio · 2017-05-03T13:58:04Z

@martin-cs The test I did shows that there is 6% performance difference, I concur, caching the hashes will erode this gap significantly. However, I have the view that this should be a separate PR. What do you think?

martin-cs · 2017-05-04T07:45:28Z

On Wed, 2017-05-03 at 06:58 -0700, Kurt Degiorgio wrote: @martin-cs The test I did shows that there is 6% performance difference, I concur, caching the hashes will erode this gap significantly.

There are two parts to the caching; caching in the irept which is already done by existing hashes and caching of the hashes of dstring which isn't.

However, I have the view that this should be a separate PR. What do you think?

Yes; it probably makes sense to do it that way. Can you prepare the hash changes as a separate PR that replaces the default hashes with the new scheme (and caches for the dstrings). Once that is moved out I can't see there will be many issues with this PR so we can get it moving.

tautschnig

Apart from nit picks my main concern with the PR is that it conflates several unrelated changes. This isn't for master so I'll merely say this as advice.

tautschnig · 2017-05-04T07:54:08Z

src/analyses/dependence_graph.h

+  {
+    return data_deps;
+  }
+


What are those changes about and why do they belong in here?

In order to extract the dependency information from the graph in order to create our input dependency, we need access to the data (ie control_deps and data_deps). These are getters for those structures.

tautschnig · 2017-05-04T07:55:32Z

src/util/dstring.h

+  return hash_string(string_container.get_string(s.get_no()));
+}
+
+


Nit picking: a single blank line will do.

thx for the catch, fixed

tautschnig · 2017-05-04T07:55:53Z

src/solvers/prop/cover_goals.cpp

@@ -50,6 +50,7 @@ void cover_goalst::mark()
       prop_conv.l_get(g.condition).is_true())
    {
      g.status=goalt::statust::COVERED;
+      g.seconds=(current_time()-start_time);


What is this about?

This is used to track the progression of coverage (i.e: at what time is each goal covered).

tautschnig · 2017-05-04T07:58:15Z

src/util/irep.cpp

+
+ Outputs:
+
+ Purpose: Stable hash, hashes the actual content of the string. 


This is a bit confusing: it will hash the actual contents of all contained strings (as is, the comment would rather fit stable_hash_string than irept::stable_hash).

Changed the comment to reflect this

…io/cbmc into machine-learning-support

dcattaruzza and others added 9 commits March 7, 2017 10:23

const_hash for irep and dstring

f5f7b24

goto-template const_hash

23287ed

fixed missing and duplicate errors caused on merge

456088a

Minor change that exposes some typedefs and internal variables

ddc580d

Variable instrumentation code.

928865f

Tracking progression of coverage.

611c905

Simplified and removed C-specific instrumentation code from language_…

def160b

…ui.cpp

Printing of coverage goals has been removed from bmc_cover.cpp

14fde02

CPPLINT conformance

09565f7

tautschnig reviewed Mar 14, 2017

View reviewed changes

Degiorgio added 2 commits March 15, 2017 06:58

Removed changes to langapi, minor changes to util/source_location as …

7d6191b

…requested by reviewer

martin-cs reviewed Mar 17, 2017

View reviewed changes

martin-cs mentioned this pull request Mar 28, 2017

set internal field in specific variables in the counterexample trace for test-gen-support #709

Merged

Removed 'is_built_in' function.

3ceed49

martin-cs reviewed May 3, 2017

View reviewed changes

Removed unnecessary include directive.

d0a5d31

tautschnig reviewed May 4, 2017

View reviewed changes

Degiorgio and others added 3 commits May 4, 2017 07:00

Removed extra blank line.

3c33f66

Fix to comment in stable_hash

27da466

Merge branch 'machine-learning-support' of https://github.com/Degiorg…

e47fd39

…io/cbmc into machine-learning-support

tautschnig changed the base branch from machine-learning-support to develop August 22, 2017 12:32

tautschnig assigned Degiorgio Aug 23, 2017

Degiorgio closed this Feb 17, 2018


		/*******************************************************************\

		Function: is_build_in

		return hash_string(string_container.get_string(s.get_no()));
		}


		Outputs:

		Purpose: Stable hash, hashes the actual content of the string.

Machine learning support #633

Machine learning support #633

Uh oh!

Conversation

Degiorgio commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-cs commented Mar 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Degiorgio commented Mar 14, 2017

Uh oh!

martin-cs commented Mar 14, 2017 via email

Uh oh!

Degiorgio commented Mar 14, 2017

Uh oh!

martin-cs commented Mar 14, 2017 via email

Uh oh!

martin-cs commented Mar 14, 2017 via email

Uh oh!

Degiorgio commented Mar 14, 2017

Uh oh!

martin-cs commented Mar 14, 2017 via email

Uh oh!

Degiorgio commented Mar 14, 2017

Uh oh!

martin-cs commented Mar 14, 2017 via email

Uh oh!

peterschrammel commented Mar 15, 2017

Uh oh!

martin-cs commented Mar 15, 2017 via email

Uh oh!

Degiorgio commented Mar 17, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-cs commented Mar 17, 2017

Uh oh!

Degiorgio commented Mar 18, 2017

Uh oh!

martin-cs commented Mar 18, 2017 via email

Uh oh!

martin-cs commented Mar 18, 2017 via email

Uh oh!

Degiorgio commented Mar 24, 2017

Uh oh!

tautschnig commented Mar 24, 2017

Uh oh!

Degiorgio commented Mar 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-cs commented Mar 28, 2017 via email

Uh oh!

Degiorgio commented Mar 29, 2017

Uh oh!

martin-cs commented Mar 29, 2017 via email

Uh oh!

Degiorgio commented May 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Degiorgio commented Mar 14, 2017 •

edited

Loading

Degiorgio commented Mar 28, 2017 •

edited

Loading