-
Notifications
You must be signed in to change notification settings - Fork 415
WIP: New Lookahead sampling method #1367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Dusty DeWeese <[email protected]> Signed-off-by: Alessandro Comodi <[email protected]> Signed-off-by: Keith Rothman <[email protected]>
Signed-off-by: Alessandro Comodi <[email protected]>
ff8b91c
to
b80858c
Compare
@@ -193,7 +199,7 @@ struct t_reachable_wire_inf { | |||
|
|||
/* used during Dijkstra expansion to store delay/congestion info lists for each relative coordinate for a given segment and channel type. | |||
* the list at each coordinate is later boiled down to a single representative cost entry to be stored in the final cost map */ | |||
typedef vtr::Matrix<Expansion_Cost_Entry> t_routing_cost_map; //[0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1] | |||
typedef vtr::NdMatrix<Expansion_Cost_Entry, 4> t_routing_cost_map; //[0..1][0..num_segments-1][0..device_ctx.grid.width()-1][0..device_ctx.grid.height()-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the first index modeling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 for CHANX and 1 for CHANY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the comment to be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this seems promising. However its not very clear how the proposed sampling method actually works. I think this will need some careful commenting to outline how this sampling method works, and why it is constructed the way it is. At the moment this is not very easy to follow.
I've provided detailed comments below.
&dijkstra_data); | ||
} | ||
#if defined(VPR_USE_TBB) | ||
all_costs_mutex.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than manually locking/unlocking the mutex which is error prone. We should probably be using a lock guard to ensure it is automatically unlocked in all scenarios.
#include "vtr_math.h" | ||
#include "vtr_geometry.h" | ||
#include "vtr_time.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a file-scope comment here which outlines at a high-level how this sampling method works, as it is currently not clear.
For instance, I get that there is some kind of spatial partitioning and statistics collection about number of segment instances going on, which guides the selection of sample points. But it is not clear how these different ideas fit together, or what the motivation/intuition is behind this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I wrote this, here's a high level overview:
For each segment type:
- Compute bounding box for that segment type. Within this bounding box:
- Count the number of nodes at each location.
- For each of NxN subsections:
a. Compute a histogram of counts for each location.
b. Choose a specific point within a quantile, e.g. 50-70%. This is to select a representative location while avoiding extremities. - Sort the points. I'm using a Z-curve for spatial locality, see the comments for details.
- Find all the nodes that fall within each sample point of the given type.
Maybe the inline comments should be expanded, but the code in find_sample_regions()
seems pretty readable, although reading the datatypes in the header first might help.
uint64_t i = x; | ||
i = (i ^ (i << 16)) & 0x0000ffff0000ffff; | ||
i = (i ^ (i << 8)) & 0x00ff00ff00ff00ff; | ||
i = (i ^ (i << 4)) & 0x0f0f0f0f0f0f0f0f; | ||
i = (i ^ (i << 2)) & 0x3333333333333333; | ||
i = (i ^ (i << 1)) & 0x5555555555555555; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty mysterious and so needs some detailed comments about how it works, and where the magic numbers come from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those aren't magic numbers, they are just a bit pattern like this:
00001111
00110011
01010101
This is a method to interleave bits with 0, so given bits xyzw
you get 0x0y0z0w
. Then interleaving two words together is just using this twice, shifting one of them left, and then ORing them.
https://lemire.me/blog/2018/01/08/how-fast-can-you-bit-interleave-32-bit-integers/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to sort points on a Morton (Z)-order curve - It interleaves the two spatial dimensions by literally interleaving their bits.
// the smallest bounding box containing a node | ||
static vtr::Rect<int> bounding_box_for_node(int node_ind) { | ||
auto& device_ctx = g_vpr_ctx.device(); | ||
auto& rr_graph = device_ctx.rr_nodes; | ||
int x = rr_graph.node_xlow(RRNodeId(node_ind)); | ||
int y = rr_graph.node_ylow(RRNodeId(node_ind)); | ||
|
||
return vtr::Rect<int>(vtr::Point<int>(x, y)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of this isn't clear. This appears to just be the start point of the node, not its actual bounding box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to consistently use low or high coordinates when sampling - the original code used annotations that were just points instead of rectangles.
static vtr::Rect<int> sample_window(const vtr::Rect<int>& bounding_box, int sx, int sy, int n) { | ||
return vtr::Rect<int>(sample(bounding_box, sx, sy, n), | ||
sample(bounding_box, sx + 1, sy + 1, n)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what this is doing, needs comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns a sub-rectangle from bounding_box
split into n
by n
sub-rectangles, where sx
and sy
select the specific sub-rectangle e.g. for the bounding box (0,0) - (2,2) and n = 2
:
sx = 0 | sx = 1 | |
---|---|---|
sy = 0 | (0,0) - (1,1) | (1,0) - (2,1) |
sy = 1 | (0,1) - (1,2) | (1,1) - (2,2) |
* This means subsequent expansions on the same thread are likely | ||
* to cover a similar set of nodes, so they are more likely to be | ||
* cached. This improves performance by about 7%, which isn't a lot, | ||
* but not a bad improvement for a few lines of code. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to put performance numbers in the comments, we likely need to give more context: what benchmarks, what FPGA architecture, what system were they run on?
vtr::Point<int> location = region.points[0].location; | ||
|
||
// interleave bits of X and Y for a Z-curve ordering. | ||
region.order = interleave(location.x()) | (interleave(location.y()) << 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this is somewhat mysterious.
I'd suggest wrapping the bit-wise operations inside interleave
(or perhaps a new function like get_region_order(location)
).
tbb::mutex all_costs_mutex; | ||
tbb::parallel_for_each(sample_regions, [&](const SampleRegion& region) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to parallelize over the sample regions. What is a sample region?
How many are there? Is it a fixed number, or does it scale with device size? These are important considerations for how scalable this parallelization will be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a (mostly) fixed number of sample regions: SAMPLE_GRID_SIZE^2 * num_segments
.
// used to sort the regions to improve caching | ||
uint64_t order; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the ordering is an implementation detail only used by find_sample_regions()
, would it be possible to just store it as a local vector in find_sample_regions()
? That would avoid exposing it to outside code through this struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These structs shouldn't be exposed, and I doubt it will be faster or save memory.
// for each segment type, find the nearest nodes to an equally spaced grid of points | ||
// within the bounding box for that segment type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the key routine which ties everything about this sampling method together. This needs higher level comments which describe the overall approach it takes (i.e. the why), and how it all fits together.
Its not clear to me from reading the code exactly how many samples this new method produces. What are the parameters which control that (both in the code and with respect to architectural parameters like the number of segment types)? (This would also be good to include in the code comments). How much benefit is there from the additional samples? Has anyone done a study to figure out how lookahead quality varies with the number of samples? If not that might be worth investigating to see if it could be reduced. |
Closing, superseeded by #1449 |
Description
This PR adds a new node sampling method, taken from #1351.
It adds a region-based sampling, with each region of the grid containing a set of representative grid points, which on their behalf have a set of nodes to be expanded using dijkstra.
Related Issue
#1325
Motivation and Context
The current sampling method in the
router_lookahead_map.cpp
code does not provide enough entries in the cost map for Series 7 devices.The sampling method proposed in this PR has been extensively tested in SymbiFlow, and proved to have the lookahead generator to produce a correct and efficient lookahead.
The main issue is performance. In fact, the number of samples to expand is much greater than the previous method, hence the lookahead computation takes longer time, hence the addition of the per-region parallelization.
Unfortunately, even with a parallel lookahead computation, all the regression tests take longer time to compute.
However, the idea behind the lookahead is to have it computed once only per device. While this is ok for already designed devices (such as Series 7), this may be impossible for devices that are being generated using VtR, where the lookahead cannot be precomputed.
It may be possible to switch the sampling method and the lookahead computation with a command line argument, so that the faster original lookahead generation can still be used.
How Has This Been Tested?
Basic regression tests
Types of changes
Checklist: