verilog-to-routing · MohamedElgammal · Dec 4, 2024 · Dec 22, 2024 · Dec 17, 2024 · Dec 23, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -47,6 +47,64 @@ _The following are changes which have been implemented in the VTR master branch
 
 ### Removed
 
+
+## v9.0.0 - 2024-12-23
+
+### Added
+  * Support for Advanced Architectures:
+    * 3D FPGA and RAD architectures.
+    * Architectures with hard Networks-on-Chip (NoCs).
+    * Distinct horizontal and vertical channel widths and types.
+    * Diagonal routing wires and other complex wire shapes (L-shaped, T-shaped, ....).
+
+  * New Benchmark Suites:
+    * Koios: A deep-learning-focused benchmark suite with various design sizes.
+    * Hermes: Benchmarks utilizing hard NoCs.
+    * TitanNew: Large benchmarks targeting the Stratix 10 architecture.
+
+  * Commercial FPGAs Architecture Captures:
+    * Intel’s Stratix 10 FPGA architecture.
+    * AMD’s 7-series FPGA architecture.
+
+  * Parmys Logic Synthesis Flow:
+    * Better Verilog language coverage
+    * More efficient hard block mapping
+
+  * VPR Graphics Visualizations:
+    * New interface for improved usability and underlying graphics rewritten using EZGL/GTK to allow more UI widgets.
+    * Algorithm breakpoint visualizations for placement and routing algorithm debugging.
+    * User-guided (manual) placement optimization features.
+    * Enabled a live connection for client graphical application to VTR engines through sockets (server mode).
+    * Interactive timing path analysis (IPA) client using server mode.
+
+  * Performance Enhancements:
+    * Parallel router for faster inter-cluster routing or flat routing.
+
+  * Re-clustering API to modify packing decisions during the flow.
+  * Support for floorplanning and placement constraints.
+  * Unified intra- and inter-cluster (flat) routing.
+  * Comprehensive web-based VTR utilities and API documentation.
+
+### Changed
+  * The default values of many command line options (e.g. inner_num is 0.5 instead of 1.0)
+  * Changes to placement engine
+    * Smart centroid initial placement algorithm.
+    * Multiple smart placement directed moves.
+    * Reinforcement learning-based placement algorithm.
+  * Changes to routing engine
+    * Faster lookahead creation.
+    * More accurate lookahead for large blocks.
+    * More efficient heap and pruning strategies.
+    * max `pres_fac` capped to avoid possible numeric issues.
+
+
+### Fixed
+  * Many algorithmic and coding bugs are fixed in this release
+
+### Removed
+  * Breadth-first (non-timing-driven) router.
+  * Non-linear congestion placement cost.
+
 ## v8.0.0 - 2020-03-24
 
 ### Added

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -62,8 +62,8 @@ option(ODIN_SANITIZE "Enable building odin with sanitize flags" OFF)
 option(WITH_PARMYS "Enable Yosys as elaborator and parmys-plugin as partial mapper" ON)
 option(YOSYS_F4PGA_PLUGINS "Enable building and installing Yosys SystemVerilog and UHDM plugins" OFF)
 
-set(VTR_VERSION_MAJOR 8)
-set(VTR_VERSION_MINOR 1)
+set(VTR_VERSION_MAJOR 9)
+set(VTR_VERSION_MINOR 0)
 set(VTR_VERSION_PATCH 0)
 set(VTR_VERSION_PRERELEASE "dev")
 

diff --git a/README.developers.md b/README.developers.md
@@ -637,6 +637,10 @@ They can be used for FPGA architecture exploration for DL and also for tuning CA
 
 A typical approach to evaluating an algorithm change would be to run `koios_medium` (or `koios_medium_no_hb`) tasks from the nightly regression test (vtr_reg_nightly_test4), the `koios_large` (or `koios_large_no_hb`) and the `koios_proxy` (or `koios_proxy_no_hb`) tasks from the weekly regression test (vtr_reg_weekly). The nightly test contains smaller benchmarks, whereas the large designs are in the weekly regression test. To measure QoR for the entire benchmark suite, both nightly and weekly tests should be run and the results should be concatenated.
 
+As 3 of the `koios_large` circuits require special settings due to having long DSP chains, they are split in separate tasks as follows:
+  * `bwave_like.float.large.v` and `bwave_like.fixed.large.v` are in `vtr_reg_weekly/koios_bwave_large` task
+  * `dla_like.large.v` is in `vtr_reg_weekly/koios_dla_large` task
+
 For evaluating an algorithm change in the Odin frontend, run `koios_medium` (or `koios_medium_no_hb`) tasks from the nightly regression test (vtr_reg_nightly_test4_odin) and the `koios_large_odin` (or `koios_large_no_hb_odin`) tasks from the weekly regression test (vtr_reg_weekly).
 
 The `koios_medium`, `koios_large`, and `koios_proxy` regression tasks run these benchmarks with complex_dsp functionality enabled, whereas `koios_medium_no_hb`, `koios_large_no_hb` and `koios_proxy_no_hb` regression tasks run these benchmarks without complex_dsp functionality. Normally, only the `koios_medium`, `koios_large`, and `koios_proxy` tasks should be enough for QoR.
@@ -651,6 +655,8 @@ The following table provides details on available Koios settings in VTR flow:
 | Nightly       | Medium designs     | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml | &#10003; | vtr_reg_nightly_test4_odin/koios_medium | Odin | |
 | Nightly       | Medium designs     | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml |          | vtr_reg_nightly_test4_odin/koios_medium_no_hb | Odin | |
 | Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml | &#10003; | vtr_reg_weekly/koios_large | Parmys | |
+| Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml | &#10003; | vtr_reg_weekly/koios_dla_large | Parmys | |
+| Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml | &#10003; | vtr_reg_weekly/koios_bwave_large | Parmys | |
 | Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml |          | vtr_reg_weekly/koios_large_no_hb | Parmys | |
 | Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml | &#10003; | vtr_reg_weekly/koios_large_odin | Odin | |
 | Weekly        | Large designs      | k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml |          | vtr_reg_weekly/koios_large_no_hb_odin | Odin | |
@@ -661,7 +667,15 @@ The following table provides details on available Koios settings in VTR flow:
 
 For more information refer to the [Koios benchmark home page](vtr_flow/benchmarks/verilog/koios/README.md).
 
-The following steps show a sequence of commands to run the `koios` tasks on the Koios benchmarks:
+To make running all the koios benchmarks easier, especially with thos circuits scattered between different tasks, there is an overall task list that runs all the 40 circuits of Koios as follows (this will run all the circuits with complex DSP functionality enabled. If you want to disable the complex DSP, edit the file to point to the `koios_*_no_hb` tasks):
+
+```shell
+$ ../scripts/run_vtr_task.py -l koios_task_list.txt 
+
+#Several hours later... they complete
+#
+
+If you want to run a subset of the koios benchmarks or run them without hard DSP blocks, you can run lower-level 'koios' tasks as follows:
 
 ```shell
 #From the VTR root
@@ -681,17 +695,6 @@ $ ../scripts/run_vtr_task.py regression_tests/vtr_reg_weekly/koios_sv_no_hb &
 
 #Several hours later... they complete
 
-#Parse the results
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_nightly_test4/koios_medium
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_large
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_proxy
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_sv
-
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_nightly_test4/koios_medium_no_hb
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_large_no_hb
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_proxy_no_hb
-$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_weekly/koios_sv_no_hb
-
 #The run directory should now contain a summary parse_results.txt file
 $ head -5 vtr_reg_nightly_test4/koios_medium/<latest_run_dir>/parse_results.txt
 arch	  circuit	  script_params	  vtr_flow_elapsed_time	  vtr_max_mem_stage	  vtr_max_mem	  error	  odin_synth_time	  max_odin_mem	  parmys_synth_time	  max_parmys_mem	  abc_depth	  abc_synth_time	  abc_cec_time	  abc_sec_time	  max_abc_mem	  ace_time	  max_ace_mem	  num_clb	  num_io	  num_memories	  num_mult	  vpr_status	  vpr_revision	  vpr_build_info	  vpr_compiler	  vpr_compiled	  hostname	  rundir	  max_vpr_mem	  num_primary_inputs	  num_primary_outputs	  num_pre_packed_nets	  num_pre_packed_blocks	  num_netlist_clocks	  num_post_packed_nets	  num_post_packed_blocks	  device_width	  device_height	  device_grid_tiles	  device_limiting_resources	  device_name	  pack_mem	  pack_time	  placed_wirelength_est	  place_mem	  place_time	  place_quench_time	  placed_CPD_est	  placed_setup_TNS_est	  placed_setup_WNS_est	  placed_geomean_nonvirtual_intradomain_critical_path_delay_est	  place_delay_matrix_lookup_time	  place_quench_timing_analysis_time	  place_quench_sta_time	  place_total_timing_analysis_time	  place_total_sta_time	  min_chan_width	  routed_wirelength	  min_chan_width_route_success_iteration	  logic_block_area_total	  logic_block_area_used	  min_chan_width_routing_area_total	  min_chan_width_routing_area_per_tile	  min_chan_width_route_time	  min_chan_width_total_timing_analysis_time	  min_chan_width_total_sta_time	  crit_path_routed_wirelength	  crit_path_route_success_iteration	  crit_path_total_nets_routed	  crit_path_total_connections_routed	  crit_path_total_heap_pushes	  crit_path_total_heap_pops	  critical_path_delay	  geomean_nonvirtual_intradomain_critical_path_delay	  setup_TNS	  setup_WNS	  hold_TNS	  hold_WNS	  crit_path_routing_area_total	  crit_path_routing_area_per_tile	  router_lookahead_computation_time	  crit_path_route_time	  crit_path_total_timing_analysis_time	  crit_path_total_sta_time	 

diff --git a/README.md b/README.md
@@ -36,15 +36,15 @@ See the [full license](LICENSE.md) for details.
 ## How to Cite
 The following paper may be used as a general citation for VTR:
 
-K. E. Murray, O. Petelin, S. Zhong, J. M. Wang, M. ElDafrawy, J.-P. Legault, E. Sha, A. G. Graham, J. Wu, M. J. P. Walker, H. Zeng, P. Patros, J. Luu, K. B. Kent and V. Betz "VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling", ACM TRETS, 2020.
+M. A. Elgammal, A. Mohaghegh, S. G. Shahrouz, F. Mahmoudi, F. Kosar, K. Talaei, J. Fife, D. Khadivi, K. Murray, A. Boutros, K. B. Kent, J. Geoders, and V. Betz "VTR 9: Open-Source CAD for Fabric and Beyond FPGA Architecture Exploration", ACM TRETS, 2025.
 
 Bibtex:
 ```
-@article{vtr8,
-  title={VTR 8: High Performance CAD and Customizable FPGA Architecture Modelling},
-  author={Murray, Kevin E. and Petelin, Oleg and Zhong, Sheng and Wang, Jai Min and ElDafrawy, Mohamed and Legault, Jean-Philippe and Sha, Eugene and Graham, Aaron G. and Wu, Jean and Walker, Matthew J. P. and Zeng, Hanqing and Patros, Panagiotis and Luu, Jason and Kent, Kenneth B. and Betz, Vaughn},
+@article{vtr9,
+  title={VTR 9: Open-Source CAD for Fabric and Beyond FPGA Architecture Exploration},
+  author={Elgammal, Mohamed A. and Mohaghegh, Amin and Shahrouz, Soheil G. and Mahmoudi, Fatemehsadat and Kosar, Fahrican and Talaei, Kimia and Fife, Joshua and Khadivi, Daniel and Murray, Kevin and Boutros, Andrew and Kent, Kenneth B. and Goeders, Jeff and Betz, Vaughn},
   journal={ACM Trans. Reconfigurable Technol. Syst.},
-  year={2020}
+  year={2025}
 }
 ```
 

diff --git a/doc/src/vpr/command_line_usage.rst b/doc/src/vpr/command_line_usage.rst
@@ -1284,13 +1284,17 @@ VPR uses a negotiated congestion algorithm (based on Pathfinder) to perform rout
 
     This option attempts to verify the minimum by routing at successively lower channel widths until two consecutive routing failures are observed.
 
-.. option:: --router_algorithm {parallel | timing_driven}
+.. option:: --router_algorithm {timing_driven | parallel | parallel_decomp}
 
-    Selects which router algorithm to use.
+    Selects which router algorithm to use. 
 
-    .. warning::
+    * ``timing_driven`` is the default single-threaded PathFinder algorithm.
+
+    * ``parallel`` partitions the device to route non-overlapping nets in parallel. Use with the ``-j`` option to specify the number of threads.
+
+    * ``parallel_decomp`` decomposes nets for aggressive parallelization :cite:`kosar2024parallel`. This imposes additional constraints and may result in worse QoR for difficult circuits.
 
-        The ``parallel`` router is experimental. (TODO: more explanation)
+    Note that both ``parallel`` and ``parallel_decomp`` are timing-driven routers.
 
     **Default:** ``timing_driven``
 

diff --git a/doc/src/z_references.bib b/doc/src/z_references.bib
@@ -430,3 +430,9 @@ @inproceedings{koios_benchmarks
   year={2021}
 }
 
+@inproceedings{kosar2024parallel,
+  title={Parallel FPGA Routing with On-the-Fly Net Decomposition},
+  author={Kosar, Fahrican and Stojilovic, Mirjana and Betz, Vaughn},
+  booktitle={The 23rd International Conference on Field-Programmable Technology},
+  year={2024}
+}
diff --git a/vpr/src/base/vpr_types.h b/vpr/src/base/vpr_types.h
@@ -1641,7 +1641,7 @@ typedef t_routing_status<AtomNetId> t_atom_net_routing_status;
 
 /** Edge between two RRNodes */
 struct t_node_edge {
-    t_node_edge(RRNodeId fnode, RRNodeId tnode)
+    t_node_edge(RRNodeId fnode, RRNodeId tnode) noexcept
         : from_node(fnode)
         , to_node(tnode) {}
 
@@ -1654,10 +1654,18 @@ struct t_node_edge {
     }
 };
 
-///@brief Non-configurably connected nodes and edges in the RR graph
+/**
+ * @brief Groups of non-configurably connected nodes and edges in the RR graph.
+ * @note Each group is represented by a node set and an edge set, stored at the same index.
+ *
+ * For example, in an architecture with L-shaped wires formed by an x- and y-directed segment
+ * connected by an electrical short, each L-shaped wire corresponds to a new group. The group's
+ * index provides access to its node set (containing two RRNodeIds) and edge set (containing two
+ * directed edge in opposite directions).
+ */
 struct t_non_configurable_rr_sets {
-    std::set<std::set<RRNodeId>> node_sets;
-    std::set<std::set<t_node_edge>> edge_sets;
+    std::vector<std::set<RRNodeId>> node_sets;
+    std::vector<std::set<t_node_edge>> edge_sets;
 };
 
 ///@brief Power estimation options
@@ -1669,11 +1677,11 @@ struct t_power_opts {
  * @param max= Maximum channel width between x_max and y_max.
  * @param x_min= Minimum channel width of horizontal channels. Initialized when init_chan() is invoked in rr_graph2.cpp
  * @param y_min= Same as above but for vertical channels.
- * @param x_max= Maximum channel width of horiozntal channels. Initialized when init_chan() is invoked in rr_graph2.cpp
+ * @param x_max= Maximum channel width of horizontal channels. Initialized when init_chan() is invoked in rr_graph2.cpp
  * @param y_max= Same as above but for vertical channels.
  * @param x_list= Stores the channel width of all horizontal channels and thus goes from [0..grid.height()]
  * (imagine a 2D Cartesian grid with horizontal lines starting at every grid point on a line parallel to the y-axis)
- * @param y_list= Stores the channel width of all verical channels and thus goes from [0..grid.width()]
+ * @param y_list= Stores the channel width of all vertical channels and thus goes from [0..grid.width()]
  * (imagine a 2D Cartesian grid with vertical lines starting at every grid point on a line parallel to the x-axis)
  */