WIP: New Lookahead sampling method #1367

acomodi · 2020-06-16T15:16:24Z

Description

This PR adds a new node sampling method, taken from #1351.

It adds a region-based sampling, with each region of the grid containing a set of representative grid points, which on their behalf have a set of nodes to be expanded using dijkstra.

Related Issue

#1325

Motivation and Context

The current sampling method in the router_lookahead_map.cpp code does not provide enough entries in the cost map for Series 7 devices.

The sampling method proposed in this PR has been extensively tested in SymbiFlow, and proved to have the lookahead generator to produce a correct and efficient lookahead.

The main issue is performance. In fact, the number of samples to expand is much greater than the previous method, hence the lookahead computation takes longer time, hence the addition of the per-region parallelization.
Unfortunately, even with a parallel lookahead computation, all the regression tests take longer time to compute.
However, the idea behind the lookahead is to have it computed once only per device. While this is ok for already designed devices (such as Series 7), this may be impossible for devices that are being generated using VtR, where the lookahead cannot be precomputed.

It may be possible to switch the sampling method and the lookahead computation with a command line argument, so that the faster original lookahead generation can still be used.

How Has This Been Tested?

Basic regression tests

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

Signed-off-by: Dusty DeWeese <[email protected]> Signed-off-by: Alessandro Comodi <[email protected]> Signed-off-by: Keith Rothman <[email protected]>

Signed-off-by: Alessandro Comodi <[email protected]>

litghost · 2020-06-16T15:57:51Z

vpr/src/route/router_lookahead_map.cpp

@@ -193,7 +199,7 @@ struct t_reachable_wire_inf {

 /* used during Dijkstra expansion to store delay/congestion info lists for each relative coordinate for a given segment and channel type.
 * the list at each coordinate is later boiled down to a single representative cost entry to be stored in the final cost map */
-typedef vtr::Matrix<Expansion_Cost_Entry> t_routing_cost_map; //[0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1]
+typedef vtr::NdMatrix<Expansion_Cost_Entry, 4> t_routing_cost_map; //[0..1][0..num_segments-1][0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1]


What is the first index modeling?

0 for CHANX and 1 for CHANY

Please update the comment to be clearer.

kmurray

At a high level this seems promising. However its not very clear how the proposed sampling method actually works. I think this will need some careful commenting to outline how this sampling method works, and why it is constructed the way it is. At the moment this is not very easy to follow.

I've provided detailed comments below.

kmurray · 2020-06-17T12:05:43Z

vpr/src/route/router_lookahead_map.cpp

-                                 &dijkstra_data);
-                }
+#if defined(VPR_USE_TBB)
+        all_costs_mutex.lock();


Rather than manually locking/unlocking the mutex which is error prone. We should probably be using a lock guard to ensure it is automatically unlocked in all scenarios.

kmurray · 2020-06-17T12:09:29Z

vpr/src/route/router_lookahead_sampling.cpp

+#include "vtr_math.h"
+#include "vtr_geometry.h"
+#include "vtr_time.h"
+


I think we need a file-scope comment here which outlines at a high-level how this sampling method works, as it is currently not clear.

For instance, I get that there is some kind of spatial partitioning and statistics collection about number of segment instances going on, which guides the selection of sample points. But it is not clear how these different ideas fit together, or what the motivation/intuition is behind this approach.

Since I wrote this, here's a high level overview:

For each segment type:

Compute bounding box for that segment type. Within this bounding box:

Count the number of nodes at each location.

For each of NxN subsections:
a. Compute a histogram of counts for each location.
b. Choose a specific point within a quantile, e.g. 50-70%. This is to select a representative location while avoiding extremities.

Sort the points. I'm using a Z-curve for spatial locality, see the comments for details.

Find all the nodes that fall within each sample point of the given type.

Maybe the inline comments should be expanded, but the code in find_sample_regions() seems pretty readable, although reading the datatypes in the header first might help.

kmurray · 2020-06-17T12:10:35Z

vpr/src/route/router_lookahead_sampling.cpp

+    uint64_t i = x;
+    i = (i ^ (i << 16)) & 0x0000ffff0000ffff;
+    i = (i ^ (i << 8)) & 0x00ff00ff00ff00ff;
+    i = (i ^ (i << 4)) & 0x0f0f0f0f0f0f0f0f;
+    i = (i ^ (i << 2)) & 0x3333333333333333;
+    i = (i ^ (i << 1)) & 0x5555555555555555;


This is pretty mysterious and so needs some detailed comments about how it works, and where the magic numbers come from.

Those aren't magic numbers, they are just a bit pattern like this:

00001111 00110011 01010101

This is a method to interleave bits with 0, so given bits xyzw you get 0x0y0z0w. Then interleaving two words together is just using this twice, shifting one of them left, and then ORing them.

https://lemire.me/blog/2018/01/08/how-fast-can-you-bit-interleave-32-bit-integers/

This is used to sort points on a Morton (Z)-order curve - It interleaves the two spatial dimensions by literally interleaving their bits.

kmurray · 2020-06-17T12:12:18Z

vpr/src/route/router_lookahead_sampling.cpp

+// the smallest bounding box containing a node
+static vtr::Rect<int> bounding_box_for_node(int node_ind) {
+    auto& device_ctx = g_vpr_ctx.device();
+    auto& rr_graph = device_ctx.rr_nodes;
+    int x = rr_graph.node_xlow(RRNodeId(node_ind));
+    int y = rr_graph.node_ylow(RRNodeId(node_ind));
+
+    return vtr::Rect<int>(vtr::Point<int>(x, y));
+}


The naming of this isn't clear. This appears to just be the start point of the node, not its actual bounding box.

It might be best to consistently use low or high coordinates when sampling - the original code used annotations that were just points instead of rectangles.

kmurray · 2020-06-17T12:12:51Z

vpr/src/route/router_lookahead_sampling.cpp

+static vtr::Rect<int> sample_window(const vtr::Rect<int>& bounding_box, int sx, int sy, int n) {
+    return vtr::Rect<int>(sample(bounding_box, sx, sy, n),
+                          sample(bounding_box, sx + 1, sy + 1, n));
+}


Not clear what this is doing, needs comments.

This returns a sub-rectangle from bounding_box split into n by n sub-rectangles, where sx and sy select the specific sub-rectangle e.g. for the bounding box (0,0) - (2,2) and n = 2:

sx = 0 sx = 1

sy = 0 (0,0) - (1,1) (1,0) - (2,1)

sy = 1 (0,1) - (1,2) (1,1) - (2,2)

kmurray · 2020-06-17T12:23:03Z

vpr/src/route/router_lookahead_sampling.cpp

+                     * This means subsequent expansions on the same thread are likely
+                     * to cover a similar set of nodes, so they are more likely to be
+                     * cached. This improves performance by about 7%, which isn't a lot,
+                     * but not a bad improvement for a few lines of code. */


If we're going to put performance numbers in the comments, we likely need to give more context: what benchmarks, what FPGA architecture, what system were they run on?

kmurray · 2020-06-17T12:27:28Z

vpr/src/route/router_lookahead_sampling.cpp

+                    vtr::Point<int> location = region.points[0].location;
+
+                    // interleave bits of X and Y for a Z-curve ordering.
+                    region.order = interleave(location.x()) | (interleave(location.y()) << 1);


Again, this is somewhat mysterious.

I'd suggest wrapping the bit-wise operations inside interleave (or perhaps a new function like get_region_order(location)).

kmurray · 2020-06-17T12:31:57Z

vpr/src/route/router_lookahead_map.cpp

+    tbb::mutex all_costs_mutex;
+    tbb::parallel_for_each(sample_regions, [&](const SampleRegion& region) {


This appears to parallelize over the sample regions. What is a sample region?

How many are there? Is it a fixed number, or does it scale with device size? These are important considerations for how scalable this parallelization will be.

There are a (mostly) fixed number of sample regions: SAMPLE_GRID_SIZE^2 * num_segments.

kmurray · 2020-06-17T12:36:20Z

vpr/src/route/router_lookahead_sampling.h

+    // used to sort the regions to improve caching
+    uint64_t order;


Since the ordering is an implementation detail only used by find_sample_regions(), would it be possible to just store it as a local vector in find_sample_regions()? That would avoid exposing it to outside code through this struct.

These structs shouldn't be exposed, and I doubt it will be faster or save memory.

kmurray · 2020-06-17T12:38:19Z

vpr/src/route/router_lookahead_sampling.cpp

+// for each segment type, find the nearest nodes to an equally spaced grid of points
+// within the bounding box for that segment type


This seems to be the key routine which ties everything about this sampling method together. This needs higher level comments which describe the overall approach it takes (i.e. the why), and how it all fits together.

kmurray · 2020-06-17T12:47:45Z

The main issue is performance. In fact, the number of samples to expand is much greater than the previous method, hence the lookahead computation takes longer time, hence the addition of the per-region parallelization.

Its not clear to me from reading the code exactly how many samples this new method produces. What are the parameters which control that (both in the code and with respect to architectural parameters like the number of segment types)? (This would also be good to include in the code comments).

How much benefit is there from the additional samples? Has anyone done a study to figure out how lookahead quality varies with the number of samples? If not that might be worth investigating to see if it could be reduced.

acomodi · 2020-08-04T16:26:46Z

Closing, superseeded by #1449

probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jun 16, 2020

acomodi added 2 commits June 16, 2020 17:17

lookahead: enhanced node sampling method

f8245ce

Signed-off-by: Dusty DeWeese <[email protected]> Signed-off-by: Alessandro Comodi <[email protected]> Signed-off-by: Keith Rothman <[email protected]>

lookahead: add new sampling method to router_lookahead

b80858c

Signed-off-by: Alessandro Comodi <[email protected]>

acomodi force-pushed the lookahead-sampling branch from ff8b91c to b80858c Compare June 16, 2020 15:17

acomodi requested review from kmurray, litghost and HackerFoo June 16, 2020 15:19

acomodi mentioned this pull request Jun 16, 2020

WIP: Enhancing VPR lookahead map #1351

Closed

7 tasks

litghost reviewed Jun 16, 2020

View reviewed changes

acomodi mentioned this pull request Jun 17, 2020

New Lookahead map status SymbiFlow/vtr-verilog-to-routing#517

Open

kmurray reviewed Jun 17, 2020

View reviewed changes

acomodi closed this Aug 4, 2020

	sx = 0	sx = 1
sy = 0	(0,0) - (1,1)	(1,0) - (2,1)
sy = 1	(0,1) - (1,2)	(1,1) - (2,2)

		tbb::mutex all_costs_mutex;
		tbb::parallel_for_each(sample_regions, [&](const SampleRegion& region) {

		// used to sort the regions to improve caching
		uint64_t order;

		// for each segment type, find the nearest nodes to an equally spaced grid of points
		// within the bounding box for that segment type

WIP: New Lookahead sampling method #1367

WIP: New Lookahead sampling method #1367

Uh oh!

Conversation

acomodi commented Jun 16, 2020

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acomodi Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmurray left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HackerFoo Jun 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kmurray commented Jun 17, 2020

Uh oh!

acomodi commented Aug 4, 2020

Uh oh!

Uh oh!

acomodi Jun 16, 2020 •

edited

Loading

HackerFoo Jun 18, 2020 •

edited

Loading