Skip to content

WIP: New Lookahead sampling method #1367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

acomodi
Copy link
Collaborator

@acomodi acomodi commented Jun 16, 2020

Description

This PR adds a new node sampling method, taken from #1351.

It adds a region-based sampling, with each region of the grid containing a set of representative grid points, which on their behalf have a set of nodes to be expanded using dijkstra.

Related Issue

#1325

Motivation and Context

The current sampling method in the router_lookahead_map.cpp code does not provide enough entries in the cost map for Series 7 devices.

The sampling method proposed in this PR has been extensively tested in SymbiFlow, and proved to have the lookahead generator to produce a correct and efficient lookahead.

The main issue is performance. In fact, the number of samples to expand is much greater than the previous method, hence the lookahead computation takes longer time, hence the addition of the per-region parallelization.
Unfortunately, even with a parallel lookahead computation, all the regression tests take longer time to compute.
However, the idea behind the lookahead is to have it computed once only per device. While this is ok for already designed devices (such as Series 7), this may be impossible for devices that are being generated using VtR, where the lookahead cannot be precomputed.

It may be possible to switch the sampling method and the lookahead computation with a command line argument, so that the faster original lookahead generation can still be used.

How Has This Been Tested?

Basic regression tests

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jun 16, 2020
acomodi added 2 commits June 16, 2020 17:17
Signed-off-by: Dusty DeWeese <[email protected]>
Signed-off-by: Alessandro Comodi <[email protected]>
Signed-off-by: Keith Rothman <[email protected]>
@acomodi acomodi force-pushed the lookahead-sampling branch from ff8b91c to b80858c Compare June 16, 2020 15:17
@acomodi acomodi mentioned this pull request Jun 16, 2020
7 tasks
@@ -193,7 +199,7 @@ struct t_reachable_wire_inf {

/* used during Dijkstra expansion to store delay/congestion info lists for each relative coordinate for a given segment and channel type.
* the list at each coordinate is later boiled down to a single representative cost entry to be stored in the final cost map */
typedef vtr::Matrix<Expansion_Cost_Entry> t_routing_cost_map; //[0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1]
typedef vtr::NdMatrix<Expansion_Cost_Entry, 4> t_routing_cost_map; //[0..1][0..num_segments-1][0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the first index modeling?

Copy link
Collaborator Author

@acomodi acomodi Jun 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 for CHANX and 1 for CHANY

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment to be clearer.

Copy link
Contributor

@kmurray kmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this seems promising. However its not very clear how the proposed sampling method actually works. I think this will need some careful commenting to outline how this sampling method works, and why it is constructed the way it is. At the moment this is not very easy to follow.

I've provided detailed comments below.

&dijkstra_data);
}
#if defined(VPR_USE_TBB)
all_costs_mutex.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than manually locking/unlocking the mutex which is error prone. We should probably be using a lock guard to ensure it is automatically unlocked in all scenarios.

#include "vtr_math.h"
#include "vtr_geometry.h"
#include "vtr_time.h"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a file-scope comment here which outlines at a high-level how this sampling method works, as it is currently not clear.

For instance, I get that there is some kind of spatial partitioning and statistics collection about number of segment instances going on, which guides the selection of sample points. But it is not clear how these different ideas fit together, or what the motivation/intuition is behind this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I wrote this, here's a high level overview:

For each segment type:

  1. Compute bounding box for that segment type. Within this bounding box:
  2. Count the number of nodes at each location.
  3. For each of NxN subsections:
    a. Compute a histogram of counts for each location.
    b. Choose a specific point within a quantile, e.g. 50-70%. This is to select a representative location while avoiding extremities.
  4. Sort the points. I'm using a Z-curve for spatial locality, see the comments for details.
  5. Find all the nodes that fall within each sample point of the given type.

Maybe the inline comments should be expanded, but the code in find_sample_regions() seems pretty readable, although reading the datatypes in the header first might help.

Comment on lines +105 to +110
uint64_t i = x;
i = (i ^ (i << 16)) & 0x0000ffff0000ffff;
i = (i ^ (i << 8)) & 0x00ff00ff00ff00ff;
i = (i ^ (i << 4)) & 0x0f0f0f0f0f0f0f0f;
i = (i ^ (i << 2)) & 0x3333333333333333;
i = (i ^ (i << 1)) & 0x5555555555555555;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty mysterious and so needs some detailed comments about how it works, and where the magic numbers come from.

Copy link
Contributor

@HackerFoo HackerFoo Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't magic numbers, they are just a bit pattern like this:

00001111
00110011
01010101

This is a method to interleave bits with 0, so given bits xyzw you get 0x0y0z0w. Then interleaving two words together is just using this twice, shifting one of them left, and then ORing them.

https://lemire.me/blog/2018/01/08/how-fast-can-you-bit-interleave-32-bit-integers/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to sort points on a Morton (Z)-order curve - It interleaves the two spatial dimensions by literally interleaving their bits.

Comment on lines +23 to +31
// the smallest bounding box containing a node
static vtr::Rect<int> bounding_box_for_node(int node_ind) {
auto& device_ctx = g_vpr_ctx.device();
auto& rr_graph = device_ctx.rr_nodes;
int x = rr_graph.node_xlow(RRNodeId(node_ind));
int y = rr_graph.node_ylow(RRNodeId(node_ind));

return vtr::Rect<int>(vtr::Point<int>(x, y));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this isn't clear. This appears to just be the start point of the node, not its actual bounding box.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to consistently use low or high coordinates when sampling - the original code used annotations that were just points instead of rectangles.

Comment on lines +33 to +36
static vtr::Rect<int> sample_window(const vtr::Rect<int>& bounding_box, int sx, int sy, int n) {
return vtr::Rect<int>(sample(bounding_box, sx, sy, n),
sample(bounding_box, sx + 1, sy + 1, n));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what this is doing, needs comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns a sub-rectangle from bounding_box split into n by n sub-rectangles, where sx and sy select the specific sub-rectangle e.g. for the bounding box (0,0) - (2,2) and n = 2:

sx = 0 sx = 1
sy = 0 (0,0) - (1,1) (1,0) - (2,1)
sy = 1 (0,1) - (1,2) (1,1) - (2,2)

Comment on lines +179 to +182
* This means subsequent expansions on the same thread are likely
* to cover a similar set of nodes, so they are more likely to be
* cached. This improves performance by about 7%, which isn't a lot,
* but not a bad improvement for a few lines of code. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to put performance numbers in the comments, we likely need to give more context: what benchmarks, what FPGA architecture, what system were they run on?

vtr::Point<int> location = region.points[0].location;

// interleave bits of X and Y for a Z-curve ordering.
region.order = interleave(location.x()) | (interleave(location.y()) << 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is somewhat mysterious.

I'd suggest wrapping the bit-wise operations inside interleave (or perhaps a new function like get_region_order(location)).

Comment on lines +441 to +442
tbb::mutex all_costs_mutex;
tbb::parallel_for_each(sample_regions, [&](const SampleRegion& region) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to parallelize over the sample regions. What is a sample region?

How many are there? Is it a fixed number, or does it scale with device size? These are important considerations for how scalable this parallelization will be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a (mostly) fixed number of sample regions: SAMPLE_GRID_SIZE^2 * num_segments.

Comment on lines +29 to +30
// used to sort the regions to improve caching
uint64_t order;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the ordering is an implementation detail only used by find_sample_regions(), would it be possible to just store it as a local vector in find_sample_regions()? That would avoid exposing it to outside code through this struct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These structs shouldn't be exposed, and I doubt it will be faster or save memory.

Comment on lines +114 to +115
// for each segment type, find the nearest nodes to an equally spaced grid of points
// within the bounding box for that segment type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the key routine which ties everything about this sampling method together. This needs higher level comments which describe the overall approach it takes (i.e. the why), and how it all fits together.

@kmurray
Copy link
Contributor

kmurray commented Jun 17, 2020

The main issue is performance. In fact, the number of samples to expand is much greater than the previous method, hence the lookahead computation takes longer time, hence the addition of the per-region parallelization.

Its not clear to me from reading the code exactly how many samples this new method produces. What are the parameters which control that (both in the code and with respect to architectural parameters like the number of segment types)? (This would also be good to include in the code comments).

How much benefit is there from the additional samples? Has anyone done a study to figure out how lookahead quality varies with the number of samples? If not that might be worth investigating to see if it could be reduced.

@acomodi
Copy link
Collaborator Author

acomodi commented Aug 4, 2020

Closing, superseeded by #1449

@acomodi acomodi closed this Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants