Skip to content

Machine learning support #633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

Degiorgio
Copy link
Contributor

@Degiorgio Degiorgio commented Mar 14, 2017

This pull request does the following changes:

  1. Introduced a HASH function that works across commits. (irep.*, dstring.h,goto_program_template.h)
  2. Exposed internal variables and definitions in analyses/dependence_graph.h. This change is required as we need to use these dependencies externally.
  3. Small change in solvers/prop/cover_goals.* to track progression of coverage

@martin-cs
Copy link
Collaborator

Introduced a HASH function that works across commits. (irep.*, dstring.h,goto_program_template.h)

Is this across different CPROVER commits or across different versions of the software being analyzed? As it stands it is not clear why we need a third hash for irepts. Also be careful with computing these recursively as sharing may cause exponential increase in the time taken. @tautschnig can comment on the fun of finding these kind of issues.

Exposed internal variables and definitions in analyses/dependence_graph.h. This change is required
as we need to use these dependencies externally.

"Externally" as in another program or another component? The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.

New functions to extract a list of global, static variables. This required for instrumentation purposes
by external code. (source_location.h, language_ui.*)

Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?

inline size_t const_hash_string(const dstringt &s)
{
return hash_string(string_container.get_string(s.get_no()));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with what @martin-cs said: what exactly does this seek to achieve? In particular, this actually invokes the much more costly std::string version of hash_string !?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-cs, @tautschnig: There is a need for us to identify goto-program hashes across commits (of the target software) in the case of differential analysis (specifically, we record hash-tagged information from an analysis and need to relate it to a new analysis done after some changes to the code). The current hash uses the id no. which can change from build to build while we actually hash the string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying. May I ask for such information to be included in commit messages for others to understand what's happening - that would be very helpful. Also, the const in the name got me very confused; how about "stable"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Done.

bool is_built_in=!file.compare(0,built_in.length(),built_in);
is_built_in|=!file.compare(0,built_in2.length(),built_in2);
return is_built_in;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very useful! A few suggestions, though: (1) Don't make it inline, implement this in source_location.cpp. (2) Once done so, make use of has_prefix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further to what I've said: when moving this to the cpp file, would you mind also looking into the issues raised by the linter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I missed that file, I will do this change and address the concern raised by @martin-cs regarding the symbol table stuff.

@Degiorgio
Copy link
Contributor Author

@martin-cs

"Externally" as in another program or another component? The goto-analyze branch @danpoe and @thk123 are working on has export of the dependencies to JSON with all details.

We're creating a dependency graph on the inputs which derives from this information. It would be costly to code and then decode from json.

Is this information present in the usual symbol table output? If so then maybe the reduction / formatting could be done in the external tool?`

CBMC should benefit from listing the inputs to the program. Everything should be available in the output but I think we might not have the necessary meta-data to distinguish static variables (not sure). In any case, parsing the output would be inefficient for us.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 14, 2017 via email

@Degiorgio
Copy link
Contributor Author

Aside from performance, is there a functional reason that the default hash should not do this? Also, are you hashing the comments field as well? If so then changes in source location might change the hash.

The default hash is faster but it uses the ID which is non-constant across commits. Our hash is slower but it is constant across commits. We are are hashing exactly the same data as the default hash but with a different hash function.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 14, 2017 via email

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 14, 2017 via email

@Degiorgio
Copy link
Contributor Author

Like I said; are we talking external as in another program / address space / serialisation or external as in another component / same address space / data structure? From the code and the commit messages it is not clear to me. Given your comments about performance it seems to really matter.

it is another module (the module is external to public CBMC but is tightly coupled to it) that links to dependence_graph.o and makes use of dependency tree. So it will have the same address space.

With regards to symbol table, I should be able to do this externally. Parsing doesn't make sense for us since at the point where we want to use this we already have a instance of the symbol table in memory.

  1. You are using a different hash function for the dstrings, base don content not position in the table. What I am asking is "Is there any reason why the default hash could not be changed to do this rather than use the position". I.E. should the default irept hash use your content based hash. I can see (minor, fixable) performance reasons but I can't see any intrinsic functionality reasons. If we could do this then it would save having Yet Another Hash Function and improve the general CPROVER functionality.

Our main concern here is performance and this why we did it as separate hash function. The ID itself is the hash so this will always be faster then hashing the entire string. Having said that I agree with you that it is cleaner to have a single hash function. So yes we can have only one hash function but is the performance (even if it is minor) cost acceptable ? (Whatever, we do using the ID will always be faster)

  1. You are hashing the comments field as well. IIRC this can contain source location information. Thus it may not match common expressions between different versions. I can't say for sure that this is or isn't what you want but you should be at least aware that this is an issue.

Yes this an issue. I think it would be better to hash without the comments.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 14, 2017 via email

@Degiorgio
Copy link
Contributor Author

Could you try changing the existing hash and then profiling the difference? I can see a number of advantages of having the hash be purely structural / repeatable and the downsides aren't well served by our current system.

Sure, I will profile this and post back. In addition, do you have any standard benchmarks you typically run CBMC on?

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 14, 2017 via email

…en there is a need to identify goto-program hashes across commits (of the target software). This is useful for example in differential analysis where we record hash-tagged information from an analysis and need it relate it to a new analysis (done after some changes to the code). The traditional hash function cannot be used for such use-cases as the ID can change across commits.
@peterschrammel
Copy link
Member

Sure, we'll get this set up on a server.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 15, 2017 via email

@Degiorgio
Copy link
Contributor Author

@tautschnig @martin-cs, I am planning on running the performance test for the hash functions as discussed next week, in the mean-time is it possible to move this along?


/*******************************************************************\

Function: is_build_in
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/build/built/

Do we really not have this as a utility? If not then we really really should add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seems like it , I did a quick fuzzy search through the code-base and found a list of classes (listed below) that are basically partiality replicating the functionality above every time it is needed.

src/goto-programs/graphml_witness.cpp
src/goto-instrument/count_eloc.cpp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another one here: https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087
I'm sure there might be yet a few more...

@martin-cs
Copy link
Collaborator

You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me.

@Degiorgio
Copy link
Contributor Author

You could be optimisitic and change the default hash function over? Beyond that it looks kinda OK to me.

My gut (admittedly, a pessimistic gut) tells me that its better to wait until we have some numbers. But if you guys insist I will happily to combine them.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 18, 2017 via email

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 18, 2017 via email

@Degiorgio
Copy link
Contributor Author

That's not good :-\ Any chance you could put together (as part of this or later) a patch to get these to use your new function. Good catch.

My concern here, is that the new function checks if a location is built-in or not. a lot of the code mentioned above is more specific. For instance, https://github.com/diffblue/cbmc/blob/master/src/goto-instrument/cover.cpp#L1087848 checks for '<built-in-libary'. We could still improve this by extending the current functionality of 'is-built-in' however I think it should be separate PR.

@tautschnig
Copy link
Collaborator

Note that #681 already has the built-ins stuff.

@Degiorgio
Copy link
Contributor Author

Degiorgio commented Mar 28, 2017

As requested I carried out some benchmarks on SV-COMP for the Floating Task. The performance hit of replacing the ID hash with a content hash is roughly 6%. This obviously preliminary and I would need to carry out more tests and some profiling to get an accurate figure.

With reference to:

Note that #681 already has the built-ins stuff.

How should we proceed which change goes through? I am happy to change my code to make use of the function introduced in that PR. Besides this any thing I can do to move this forward ?

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 28, 2017 via email

@Degiorgio
Copy link
Contributor Author

We should bundle together the logic for working out if this is a built-in / added / internal variable / function / location into utility functions and then use them everywhere rather than having multiple implementation of similar checks. Whether you use the interface from #681962, or you change it (and the other locations that Peter highlighted if you're being super helpful) to yours, I care less about. The problem is duplicate functionality.

Oks agreed, will get this done.

@martin-cs
Copy link
Collaborator

martin-cs commented Mar 29, 2017 via email

@Degiorgio
Copy link
Contributor Author

@tautschnig, @peterschrammel , @martin-cs I removed the is_build_in function since a similar function has been merged into CBMC. Assuming there no more required changes, can we move this PR along please?

@@ -10,6 +10,7 @@ Author: Daniel Kroening, [email protected]

#include "source_location.h"
#include "file_util.h"
#include "prefix.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifact from 'is_built_in' function. Removed it, thx for the catch!

@martin-cs
Copy link
Collaborator

I'm still in favour of the hash becoming the default, if the performance impact can be reduced enough. To me that is the last remaining issue. Have you tried caching the hashes of the dstring and then checking the performance impact vs. the existing hash. I can believe it would be that much.

@Degiorgio
Copy link
Contributor Author

@martin-cs The test I did shows that there is 6% performance difference, I concur, caching the hashes will erode this gap significantly. However, I have the view that this should be a separate PR. What do you think?

@martin-cs
Copy link
Collaborator

martin-cs commented May 4, 2017 via email

Copy link
Collaborator

@tautschnig tautschnig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from nit picks my main concern with the PR is that it conflates several unrelated changes. This isn't for master so I'll merely say this as advice.

{
return data_deps;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those changes about and why do they belong in here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to extract the dependency information from the graph in order to create our input dependency, we need access to the data (ie control_deps and data_deps). These are getters for those structures.

return hash_string(string_container.get_string(s.get_no()));
}


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit picking: a single blank line will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for the catch, fixed

@@ -50,6 +50,7 @@ void cover_goalst::mark()
prop_conv.l_get(g.condition).is_true())
{
g.status=goalt::statust::COVERED;
g.seconds=(current_time()-start_time);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to track the progression of coverage (i.e: at what time is each goal covered).


Outputs:

Purpose: Stable hash, hashes the actual content of the string.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing: it will hash the actual contents of all contained strings (as is, the comment would rather fit stable_hash_string than irept::stable_hash).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the comment to reflect this

@tautschnig tautschnig changed the base branch from machine-learning-support to develop August 22, 2017 12:32
@Degiorgio Degiorgio closed this Feb 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants