Skip to content

Performance optimisation - Use unordered_map in numbering instead of map #5539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 10, 2020

Conversation

thomasspriggs
Copy link
Contributor

@thomasspriggs thomasspriggs commented Oct 27, 2020

The commit which specifically gives the performance improvement is titled "Make union_find use hashes for numbering". The other commits are related refactoring.

Using unordered_map in numbering instead of map, changes union_find, which changes arrays, where the performance issue was seen. This change avoids slow lookups into std::map caused by long running irept::operator< calls. I have an example where cbmc spent approximately 70% of runtime in irept::operator< before this change and approximately 1% of runtime in hashing/ordering after this change.

Run time on example exercising arrays code before this PR -
real 2m39.018s
user 2m37.394s
sys 0m1.608s

Run time on example exercising arrays code after this PR -
real 0m30.897s
user 0m29.476s
sys 0m1.404s

Gprof before (top results up to has_byte_operator + irept::hash()) -

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 44.90     21.08    21.08  2025113     0.00     0.00  irept::compare(irept const&) const
 26.84     33.68    12.60 4483321688     0.00     0.00  irept::number_of_non_comments(std::map<dstringt, irept, std::less<dstringt>, std::allocator<std::pair<dstringt const, irept> > > const&)
  6.54     36.75     3.07 52261322     0.00     0.00  Minisat::SimpSolver::addClause_(Minisat::vec<Minisat::Lit>&)
  3.43     38.36     1.61 52261322     0.00     0.00  Minisat::Solver::addClause_(Minisat::vec<Minisat::Lit>&)
  2.26     39.42     1.06    58881     0.00     0.00  has_byte_operator(exprt const&)
  0.06     46.03     0.03  3365106     0.00     0.00  irept::hash() const

Gprof after (top results up to has_byte_operator + irept::hash()) -

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.40      3.11     3.11 52261322     0.00     0.00  Minisat::SimpSolver::addClause_(Minisat::vec<Minisat::Lit>&)
 13.74      4.67     1.56 52261322     0.00     0.00  Minisat::Solver::addClause_(Minisat::vec<Minisat::Lit>&)
  5.73      5.32     0.65 10600597     0.00     0.00  Minisat::SimpSolver::newVar(Minisat::lbool, bool)
  5.29      5.92     0.60    58881     0.00     0.00  has_byte_operator(exprt const&)
  0.18     10.51     0.02  3424407     0.00     0.00  irept::hash() const
  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

@codecov
Copy link

codecov bot commented Oct 27, 2020

Codecov Report

Merging #5539 (2bf69bf) into develop (076b72e) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #5539   +/-   ##
========================================
  Coverage    68.53%   68.53%           
========================================
  Files         1187     1187           
  Lines        98259    98259           
========================================
  Hits         67345    67345           
  Misses       30914    30914           
Flag Coverage Δ
cproversmt2 43.09% <100.00%> (ø)
regression 65.71% <100.00%> (ø)
unit 32.26% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/analyses/custom_bitvector_analysis.h 65.38% <ø> (ø)
src/analyses/escape_analysis.h 68.75% <ø> (ø)
src/analyses/local_bitvector_analysis.h 89.47% <ø> (ø)
src/analyses/local_may_alias.h 100.00% <ø> (ø)
src/pointer-analysis/value_set_fi.h 97.56% <ø> (ø)
src/solvers/flattening/arrays.h 90.00% <ø> (ø)
src/solvers/flattening/boolbv.h 78.57% <ø> (ø)
src/solvers/flattening/pointer_logic.h 50.00% <ø> (ø)
src/util/irep_hash_container.h 100.00% <ø> (ø)
src/pointer-analysis/value_set_fi.cpp 63.49% <100.00%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 076b72e...2bf69bf. Read the comment docs.

Copy link
Collaborator

@martin-cs martin-cs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. One thing to have a look at is the status of caching of the hashes in dt. IIRC @tautschnig and @peterschrammel worked on this some time ago and I'm not sure if all of the options were enabled.

@thomasspriggs thomasspriggs force-pushed the tas/arrays_optimisation branch from de45b2d to f91394b Compare October 27, 2020 18:27
@kroening
Copy link
Member

It's odd that irep.h is a dependency of numbering.h.

I'd prefer an 'unordered_numbering' and an 'ordered_numbering', following the naming style of the STL. There may not be a use-case for the ordered numbering.

@hannes-steffenhagen-diffblue
Copy link
Contributor

I agree with @kroening here, this can probably be done without special casing ireps. I believe @thomasspriggs is investigating if we can basically just use hash_numbering everywhere we can use numbering right now (I'd imagine so?).

@thomasspriggs thomasspriggs force-pushed the tas/arrays_optimisation branch from f91394b to c425d29 Compare October 28, 2020 18:13
@thomasspriggs
Copy link
Contributor Author

I have now refactored this PR to avoid either union_find or numbering depending on irep.h. This PR removes the possibility of using an ordered map as the backing for numbering on the basis that this this variation is both unneeded and a potential performance foot-gun. This has the added benefit that we no longer need 3 different names for different numbering variations.

@thomasspriggs thomasspriggs force-pushed the tas/arrays_optimisation branch from c425d29 to 7d75607 Compare November 6, 2020 15:20
@tautschnig
Copy link
Collaborator

@thomasspriggs What is the source of those irept::operator< calls? Is #1997 perhaps a way to actually address the root cause?

@thomasspriggs
Copy link
Contributor Author

@tautschnig For the particular example I have been examining, much of the overall runtime is spent in src/solvers/flattening/arrays.cpp. The arrayst class has a union_find<exprt> arrays; field. This instance of union_find<exprt> was based on a numbering<exprt>, which in turn uses a std::map<exprt, std::size_t>, which in turn uses the irept::operator<.

From looking at the code I would say that the improvement in this PR looks unrelated to the improvement in #1997. As far as I can see, arrays.cpp is not using sort_and_join.

@tautschnig
Copy link
Collaborator

@thomasspriggs Thank you for clarifying!

Removing the duplicated types will enable changing the type of numbering
which backs union find, with fewer lines of code changed.
Because this matches the current coding standard and because it improves
readability.
This commit defines a new `numberingt`, which is equivalent to the
existing `hash_numbering` type. This new name and its template
parameters match the cbmc coding standards by ending the type names with
a letter 't'. It also supports using `std::hash` by default, which will
saves passing a hash function in the case where there is a standard hash
function defined for a given type of `keyt`.

The intention is to add this new type in this commit, swap over
the existing usages of numbering in the subsequent commits and then
finally remove the old variations of numbering. This will improve
performance in one instance and will improve maintainability in others.
Maintainability will be improved with a single version, because there
won't be multiple versions to maintain and it will take away the
possibility of accidentally choosing a variation which uses `irept` as
the key in the backing `std::map`.
This avoids slow lookups into `std::map` caused by long running
`irept::operator<` calls.

The type of hash to use is included as a template parameter in order to
make the data structure used to implement `union_find` more explicit.

Run time on example exercising arrays code before this commit -
real	2m39.018s
user	2m37.394s
sys	0m1.608s

Run time on example exercising arrays code after this commit -
real	0m30.897s
user	0m29.476s
sys	0m1.404s
Because the naming follows our coding standards and it reduces the
number of numbering variations to maintain.
Because the naming follows our coding standards and it reduces the
number of numbering variations to maintain. This does change the type of
backing container used in these places, however the performance should
be at least equivalent.
There is no need for separate a separe type alias and implementaion if
numbering is always done using the same kind of map. Therefore we can
use a single name instead of two, which means less to maintain.
@thomasspriggs thomasspriggs marked this pull request as ready for review November 9, 2020 21:34
@thomasspriggs thomasspriggs force-pushed the tas/arrays_optimisation branch from 7d75607 to 2bf69bf Compare November 10, 2020 10:45
@thomasspriggs thomasspriggs changed the title Performance optimisation - Make numbering use unordered_map if numbering ireps Performance optimisation - Use unordered_map in numbering instead of map Nov 10, 2020
@tautschnig tautschnig merged commit 610761e into diffblue:develop Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants