Performance optimisation - Use `unordered_map` in `numbering` instead of `map` #5539

thomasspriggs · 2020-10-27T15:28:27Z

The commit which specifically gives the performance improvement is titled "Make union_find use hashes for numbering". The other commits are related refactoring.

Using unordered_map in numbering instead of map, changes union_find, which changes arrays, where the performance issue was seen. This change avoids slow lookups into std::map caused by long running irept::operator< calls. I have an example where cbmc spent approximately 70% of runtime in irept::operator< before this change and approximately 1% of runtime in hashing/ordering after this change.

Run time on example exercising arrays code before this PR -
real 2m39.018s
user 2m37.394s
sys 0m1.608s

Run time on example exercising arrays code after this PR -
real 0m30.897s
user 0m29.476s
sys 0m1.404s

Gprof before (top results up to has_byte_operator + irept::hash()) -

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 44.90     21.08    21.08  2025113     0.00     0.00  irept::compare(irept const&) const
 26.84     33.68    12.60 4483321688     0.00     0.00  irept::number_of_non_comments(std::map<dstringt, irept, std::less<dstringt>, std::allocator<std::pair<dstringt const, irept> > > const&)
  6.54     36.75     3.07 52261322     0.00     0.00  Minisat::SimpSolver::addClause_(Minisat::vec<Minisat::Lit>&)
  3.43     38.36     1.61 52261322     0.00     0.00  Minisat::Solver::addClause_(Minisat::vec<Minisat::Lit>&)
  2.26     39.42     1.06    58881     0.00     0.00  has_byte_operator(exprt const&)
  0.06     46.03     0.03  3365106     0.00     0.00  irept::hash() const

Gprof after (top results up to has_byte_operator + irept::hash()) -

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 27.40      3.11     3.11 52261322     0.00     0.00  Minisat::SimpSolver::addClause_(Minisat::vec<Minisat::Lit>&)
 13.74      4.67     1.56 52261322     0.00     0.00  Minisat::Solver::addClause_(Minisat::vec<Minisat::Lit>&)
  5.73      5.32     0.65 10600597     0.00     0.00  Minisat::SimpSolver::newVar(Minisat::lbool, bool)
  5.29      5.92     0.60    58881     0.00     0.00  has_byte_operator(exprt const&)
  0.18     10.51     0.02  3424407     0.00     0.00  irept::hash() const

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
White-space or formatting changes outside the feature-related changed lines are in commits of their own.

codecov · 2020-10-27T16:05:49Z

Codecov Report

Merging #5539 (2bf69bf) into develop (076b72e) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           develop    #5539   +/-   ##
========================================
  Coverage    68.53%   68.53%           
========================================
  Files         1187     1187           
  Lines        98259    98259           
========================================
  Hits         67345    67345           
  Misses       30914    30914

Flag	Coverage Δ
cproversmt2	`43.09% <100.00%> (ø)`
regression	`65.71% <100.00%> (ø)`
unit	`32.26% <100.00%> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/analyses/custom_bitvector_analysis.h	`65.38% <ø> (ø)`
src/analyses/escape_analysis.h	`68.75% <ø> (ø)`
src/analyses/local_bitvector_analysis.h	`89.47% <ø> (ø)`
src/analyses/local_may_alias.h	`100.00% <ø> (ø)`
src/pointer-analysis/value_set_fi.h	`97.56% <ø> (ø)`
src/solvers/flattening/arrays.h	`90.00% <ø> (ø)`
src/solvers/flattening/boolbv.h	`78.57% <ø> (ø)`
src/solvers/flattening/pointer_logic.h	`50.00% <ø> (ø)`
src/util/irep_hash_container.h	`100.00% <ø> (ø)`
src/pointer-analysis/value_set_fi.cpp	`63.49% <100.00%> (ø)`
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 076b72e...2bf69bf. Read the comment docs.

martin-cs

Good catch. One thing to have a look at is the status of caching of the hashes in dt. IIRC @tautschnig and @peterschrammel worked on this some time ago and I'm not sure if all of the options were enabled.

kroening · 2020-10-28T10:17:12Z

It's odd that irep.h is a dependency of numbering.h.

I'd prefer an 'unordered_numbering' and an 'ordered_numbering', following the naming style of the STL. There may not be a use-case for the ordered numbering.

hannes-steffenhagen-diffblue · 2020-10-28T14:37:12Z

I agree with @kroening here, this can probably be done without special casing ireps. I believe @thomasspriggs is investigating if we can basically just use hash_numbering everywhere we can use numbering right now (I'd imagine so?).

thomasspriggs · 2020-10-28T18:19:44Z

I have now refactored this PR to avoid either union_find or numbering depending on irep.h. This PR removes the possibility of using an ordered map as the backing for numbering on the basis that this this variation is both unneeded and a potential performance foot-gun. This has the added benefit that we no longer need 3 different names for different numbering variations.

tautschnig · 2020-11-07T21:48:04Z

@thomasspriggs What is the source of those irept::operator< calls? Is #1997 perhaps a way to actually address the root cause?

thomasspriggs · 2020-11-09T18:00:16Z

@tautschnig For the particular example I have been examining, much of the overall runtime is spent in src/solvers/flattening/arrays.cpp. The arrayst class has a union_find<exprt> arrays; field. This instance of union_find<exprt> was based on a numbering<exprt>, which in turn uses a std::map<exprt, std::size_t>, which in turn uses the irept::operator<.

From looking at the code I would say that the improvement in this PR looks unrelated to the improvement in #1997. As far as I can see, arrays.cpp is not using sort_and_join.

tautschnig · 2020-11-09T19:02:21Z

@thomasspriggs Thank you for clarifying!

Removing the duplicated types will enable changing the type of numbering which backs union find, with fewer lines of code changed.

Because this matches the current coding standard and because it improves readability.

This commit defines a new `numberingt`, which is equivalent to the existing `hash_numbering` type. This new name and its template parameters match the cbmc coding standards by ending the type names with a letter 't'. It also supports using `std::hash` by default, which will saves passing a hash function in the case where there is a standard hash function defined for a given type of `keyt`. The intention is to add this new type in this commit, swap over the existing usages of numbering in the subsequent commits and then finally remove the old variations of numbering. This will improve performance in one instance and will improve maintainability in others. Maintainability will be improved with a single version, because there won't be multiple versions to maintain and it will take away the possibility of accidentally choosing a variation which uses `irept` as the key in the backing `std::map`.

This avoids slow lookups into `std::map` caused by long running `irept::operator<` calls. The type of hash to use is included as a template parameter in order to make the data structure used to implement `union_find` more explicit. Run time on example exercising arrays code before this commit - real 2m39.018s user 2m37.394s sys 0m1.608s Run time on example exercising arrays code after this commit - real 0m30.897s user 0m29.476s sys 0m1.404s

Because the naming follows our coding standards and it reduces the number of numbering variations to maintain.

Because the naming follows our coding standards and it reduces the number of numbering variations to maintain. This does change the type of backing container used in these places, however the performance should be at least equivalent.

There is no need for separate a separe type alias and implementaion if numbering is always done using the same kind of map. Therefore we can use a single name instead of two, which means less to maintain.

thomasspriggs added the do not review label Oct 27, 2020

martin-cs approved these changes Oct 27, 2020

View reviewed changes

thomasspriggs force-pushed the tas/arrays_optimisation branch from de45b2d to f91394b Compare October 27, 2020 18:27

thomasspriggs force-pushed the tas/arrays_optimisation branch from f91394b to c425d29 Compare October 28, 2020 18:13

thomasspriggs force-pushed the tas/arrays_optimisation branch from c425d29 to 7d75607 Compare November 6, 2020 15:20

thomasspriggs added 7 commits November 9, 2020 21:11

Reuse existing type definitions in union_find

6610eef

Removing the duplicated types will enable changing the type of numbering which backs union find, with fewer lines of code changed.

Use using instead of typedef in union_find

3d99edf

Because this matches the current coding standard and because it improves readability.

Use numberingt in place of hash_numberingt

30957a4

Because the naming follows our coding standards and it reduces the number of numbering variations to maintain.

Use numberingt in place of numbering

d3f5cda

Because the naming follows our coding standards and it reduces the number of numbering variations to maintain. This does change the type of backing container used in these places, however the performance should be at least equivalent.

Define numberingt as the implementation

2bf69bf

There is no need for separate a separe type alias and implementaion if numbering is always done using the same kind of map. Therefore we can use a single name instead of two, which means less to maintain.

thomasspriggs marked this pull request as ready for review November 9, 2020 21:34

thomasspriggs requested review from chrisr-diffblue, danpoe, kroening, peterschrammel, pkesseli, smowton and tautschnig as code owners November 9, 2020 21:34

thomasspriggs removed the do not review label Nov 9, 2020

thomasspriggs force-pushed the tas/arrays_optimisation branch from 7d75607 to 2bf69bf Compare November 10, 2020 10:45

thomasspriggs changed the title ~~Performance optimisation - Make numbering use unordered_map if numbering ireps~~ Performance optimisation - Use unordered_map in numbering instead of map Nov 10, 2020

peterschrammel approved these changes Nov 10, 2020

View reviewed changes

tautschnig approved these changes Nov 10, 2020

View reviewed changes

tautschnig merged commit 610761e into diffblue:develop Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimisation - Use `unordered_map` in `numbering` instead of `map` #5539

Performance optimisation - Use `unordered_map` in `numbering` instead of `map` #5539

thomasspriggs commented Oct 27, 2020 •

edited

Loading

codecov bot commented Oct 27, 2020 •

edited

Loading

martin-cs left a comment

kroening commented Oct 28, 2020

hannes-steffenhagen-diffblue commented Oct 28, 2020

thomasspriggs commented Oct 28, 2020

tautschnig commented Nov 7, 2020

thomasspriggs commented Nov 9, 2020

tautschnig commented Nov 9, 2020

Performance optimisation - Use unordered_map in numbering instead of map #5539

Performance optimisation - Use unordered_map in numbering instead of map #5539

Conversation

thomasspriggs commented Oct 27, 2020 • edited Loading

codecov bot commented Oct 27, 2020 • edited Loading

Codecov Report

martin-cs left a comment

Choose a reason for hiding this comment

kroening commented Oct 28, 2020

hannes-steffenhagen-diffblue commented Oct 28, 2020

thomasspriggs commented Oct 28, 2020

tautschnig commented Nov 7, 2020

thomasspriggs commented Nov 9, 2020

tautschnig commented Nov 9, 2020

Performance optimisation - Use `unordered_map` in `numbering` instead of `map` #5539

Performance optimisation - Use `unordered_map` in `numbering` instead of `map` #5539

thomasspriggs commented Oct 27, 2020 •

edited

Loading

codecov bot commented Oct 27, 2020 •

edited

Loading