Skip to content

Commit 1c5de6e

Browse files
committed
1. Moving complex_dsp macro out from benchmark verilog files into a separate include file.
2. Updated task files to include this new include file. 3. Added new tests that run these benchmarks without the complex_dsp macro defined. 4. Updated documentation to cleanly describe the usage of these benchmarks (including measuring QoR)
1 parent 35f0068 commit 1c5de6e

File tree

27 files changed

+245
-20
lines changed

27 files changed

+245
-20
lines changed

ODIN_II/regression_test/benchmark/task/koios/task.conf

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
########################
2-
# large benchmarks config
2+
# Koios benchmarks config
33
########################
44

55
regression_params=--disable_simulation --disable_parallel_jobs --verbose
@@ -8,13 +8,13 @@ script_simulation_params=--limit_ressource --time_limit 14400s
88

99
# setup the architecture
1010
archs_dir=../vtr_flow/arch/COFFE_22nm
11-
12-
# one arch allows it to run faster given it is single threaded
1311
arch_list_add=k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml
1412

13+
# setup the benchmarks
1514
circuits_dir=../../../../vtr_flow/benchmarks/verilog/koios
15+
includes_dir=benchmarks/verilog/koios
16+
include_list_add=complex_dsp_include.v
1617

17-
# glob the large benchmark and the vtr ones to prevent duplicate run
1818
circuit_list_add=tpu_like.small.v
1919
circuit_list_add=dla_like.small.v
2020
circuit_list_add=bnn.v

README.developers.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -626,6 +626,36 @@ stratixiv_arch.timing.xml stereo_vision_stratixiv_arch_timing.blif 0208312
626626
stratixiv_arch.timing.xml cholesky_mc_stratixiv_arch_timing.blif 0208312 success 140214 108592 67410 5444 121 90 -1 111 151 -1 -1 5221059 8.16972 -454610 -8.16972 1518597 15 0 0 2.38657e+08 21915.3 9.34704 -531231 -9.34704 0 0 211.12 364.32 490.24 6356252 -1 -1
627627
```
628628

629+
### Example: Koios Benchmarks QoR Measurement
630+
631+
The [Koios benchmarks](https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/master/vtr_flow/benchmarks/verilog/koios) are a group of Deep Learning benchmark circuits distributed with the VTR project.
632+
The are provided as synthesizable verilog and can be re-mapped to VTR supported architectures.
633+
They consist mostly of medium to large sized circuits from Deep Learning (DL).
634+
They can be used for FPGA architecture exploration for DL and also for tuning CAD tools.
635+
636+
A typical approach to evaluating an algorithm change would be to run `koios` or `koios_no_complex_dsp` task from the nightly regression test (vtr_reg_nightly_test4) and the `koios` or `koios_no_complex_dsp` task from the weekly regression test (vtr_reg_weekly). The nightly test contains smaller benchmarks, whereas the large designs are in the weekly regression test. The following steps show an example sequence of commands.
637+
638+
```shell
639+
#From the VTR root
640+
$ cd vtr_flow/tasks
641+
642+
#Run the VTR benchmarks
643+
$ ../scripts/run_vtr_task.py regression_tests/vtr_reg_nightly_test4/koios
644+
645+
#Several hours later... they complete
646+
647+
#Parse the results
648+
$ ../scripts/python_libs/vtr/parse_vtr_task.py regression_tests/vtr_reg_nightly_test4/koios
649+
650+
#The run directory should now contain a summary parse_results.txt file
651+
$ head -5 vtr_reg_nightly_test4/koios/latest/parse_results.txt
652+
arch circuit script_params vtr_flow_elapsed_time error odin_synth_time max_odin_mem abc_depth abc_synth_time abc_cec_time abc_sec_time max_abc_mem ace_time max_ace_mem num_clb num_io num_memories num_mult vpr_status vpr_revision vpr_build_info vpr_compiler vpr_compiled hostname rundir max_vpr_mem num_primary_inputs num_primary_outputs num_pre_packed_nets num_pre_packed_blocks num_netlist_clocks num_post_packed_nets num_post_packed_blocks device_width device_height device_grid_tiles device_limiting_resources device_name pack_time placed_wirelength_est place_time place_quench_time placed_CPD_est placed_setup_TNS_est placed_setup_WNS_est placed_geomean_nonvirtual_intradomain_critical_path_delay_est place_delay_matrix_lookup_time place_quench_timing_analysis_time place_quench_sta_time place_total_timing_analysis_time place_total_sta_time min_chan_width routed_wirelength min_chan_width_route_success_iteration logic_block_area_total logic_block_area_used min_chan_width_routing_area_total min_chan_width_routing_area_per_tile min_chan_width_route_time min_chan_width_total_timing_analysis_time min_chan_width_total_sta_time crit_path_routed_wirelength crit_path_route_success_iteration crit_path_total_nets_routed crit_path_total_connections_routed crit_path_total_heap_pushes crit_path_total_heap_pops critical_path_delay geomean_nonvirtual_intradomain_critical_path_delay setup_TNS setup_WNS hold_TNS hold_WNS crit_path_routing_area_total crit_path_routing_area_per_tile router_lookahead_computation_time crit_path_route_time crit_path_total_timing_analysis_time crit_path_total_sta_time
653+
k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml tpu_like.small.v common 2871.10 9.36 235096 5 619.21 -1 -1 159760 -1 -1 1119 355 14 -1 success v8.0.0-4161-g8f4b3e9ca release IPO VTR_ASSERT_LEVEL=2 GNU 7.5.0 on Linux-4.15.0-124-generic x86_64 2021-05-28T23:09:34 jupiter0 /export/aman/vtr_aman/vtr-verilog-to-routing/vtr_flow/tasks 2568860 355 289 50215 41827 2 23224 2053 136 136 18496 dsp_top auto 1233.72 457725 91.70 0.38 7.24742 -105267 -7.24742 2.59789 14.13 0.101267 0.0738583 24.91 18.6865 -1 561916 17 5.92627e+08 1.03195e+08 4.09037e+08 22114.9 16.37 32.3744 25.1979 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
654+
k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml dla_like.small.v common 7527.41 42.24 729876 5 3941.31 -1 -1 630244 -1 -1 5545 194 828 -1 success v8.0.0-4161-g8f4b3e9ca release IPO VTR_ASSERT_LEVEL=2 GNU 7.5.0 on Linux-4.15.0-124-generic x86_64 2021-05-28T23:09:34 jupiter0 /export/aman/vtr_aman/vtr-verilog-to-routing/vtr_flow/tasks 4409476 194 13 217044 174718 1 91037 6708 164 164 26896 memory auto 1604.22 969627 663.41 2.84 5.61569 -424718 -5.61569 5.61569 21.49 0.584073 0.385993 104.796 73.1698 -1 1450542 14 8.6211e+08 3.01197e+08 5.93540e+08 22068.0 53.97 132.203 96.049 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
655+
k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml bnn.v common 2028.52 40.37 577472 3 240.94 -1 -1 513656 -1 -1 5695 260 0 -1 success v8.0.0-4161-g8f4b3e9ca release IPO VTR_ASSERT_LEVEL=2 GNU 7.5.0 on Linux-4.15.0-124-generic x86_64 2021-05-28T23:09:34 jupiter0 /export/aman/vtr_aman/vtr-verilog-to-routing/vtr_flow/tasks 2195980 260 122 231647 179602 1 86181 6140 83 83 6889 clb auto 613.32 940951 503.35 2.87 6.4402 -131403 -6.4402 6.4402 5.41 0.753268 0.564332 85.331 60.8639 -1 1224690 16 2.13666e+08 1.74902e+08 1.51359e+08 21971.1 50.49 114.382 84.8538 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
656+
k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml attention_layer.v common 1330.99 11.83 1095592 7 59.16 -1 -1 560612 -1 -1 1248 1058 161 -1 success v8.0.0-4161-g8f4b3e9ca release IPO VTR_ASSERT_LEVEL=2 GNU 7.5.0 on Linux-4.15.0-124-generic x86_64 2021-05-28T23:09:34 jupiter0 /export/aman/vtr_aman/vtr-verilog-to-routing/vtr_flow/tasks 1180420 1058 16 47407 39134 1 26605 2588 86 86 7396 dsp_top auto 728.70 234151 118.11 0.71 5.89837 -78343.6 -5.89837 5.89837 6.64 0.181478 0.146942 31.9659 24.5807 -1 366899 17 2.32446e+08 8.36361e+07 1.62201e+08 21930.9 16.25 40.6352 32.1556 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
657+
```
658+
629659
## Comparing QoR Measurements
630660
Once you have two (or more) sets of QoR measurements they now need to be compared.
631661

doc/src/vtr/benchmarks.rst

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -97,17 +97,12 @@ These designs use many precisions including binary, different fixed point types
9797
eltwise_layer Matrix elementwise add/sub/mult
9898
================= ======================================
9999

100-
Koios benchmarks are fully compatible with the full VTR flow. Some Koios benchmarks use advanced DSP features that are available in only a few FPGA architectures provided with VTR. This is because they instantiate DSP macros to implement native FP16 multiplications or use the hard dedicated chains, and these are architecture-specific. If users want to use a different FPGA architecture file, they can replace the macro instantiations in the benchmarks with their equivalents from the FPGA architectures they wish to use.
101-
102-
Alternatively, users can disable these advanced features. The macro ``complex_dsp`` can be used for this purpose. If complex_dsp is defined in a benchmark file (using ```define complex_dsp`` in the beginning of the benchmark file), then advanced DSP features mentioned above will be used. If a user wants to run a Koios benchmark with FPGA architectures that don't have these advanced DSP features (for example, the flagship architectures: ``$VTR_ROOT/vtr_flow/arch/timing/k6_frac_N10_*_mem32K_40nm*``), then they can remove the line defining the complex_dsp macro. This enables the same functionality with behavioral Verilog that is mapped to the FPGA soft logic when an architecture without the required macro definitions is used.
103-
104100
The VTR benchmarks are provided as Verilog (enabling full flexibility to modify and change how the designs are implemented) under: ::
105101

106102
$VTR_ROOT/vtr_flow/benchmarks/verilog/koios
107103

108-
The FPGA architectures with advanced DSP that work out-of-the-box with Koios benchmarks are available here: ::
104+
To use these benchmarks, please see the documentation in the README file at: https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/master/vtr_flow/benchmarks/verilog/koios
109105

110-
$VTR_ROOT/vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.*
111106

112107
MCNC20 Benchmarks
113108
-----------------

vtr_flow/benchmarks/verilog/koios/README.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,39 @@ Koios benchmarks are a set of Deep Learning (DL) benchmarks for FPGA architectur
66
## Documentation
77
A brief documentation of Koios benchmarks is available [here](https://docs.verilogtorouting.org/en/latest/vtr/benchmarks/#koios-benchmarks).
88

9+
## How to Use
10+
Koios benchmarks are fully compatible with the full VTR flow. They can be used using the standard VTR flow described [here](https://docs.verilogtorouting.org/en/latest/vtr/running_vtr/).
11+
12+
Some Koios benchmarks use advanced DSP features that are available in only a few FPGA architectures provided with VTR. These benchmarks instantiate DSP macros to implement native FP16 or BF16 multiplications or use the hard dedicated chains, and these are architecture-specific. However, these advanced/complex DSP features can be enabled or disabled. The macro ``complex_dsp`` can be used for this purpose. If `complex_dsp` is defined in a benchmark file (using ```define complex_dsp``), then advanced DSP features mentioned above will be used. If `complex_dsp` is not defined, then equivalent functionality is obtained through behavioral Verilog that gets mapped to the FPGA soft logic.
13+
14+
From a flow perspective, a feature was recently added in VTR (June 2021) that makes it easy to enable/disable a macro (like `complex_dsp`). The feature provides for specifying a separate Verilog header file while running a flow/task, so a benchmark's Verilog file doesn't have to be modified. For `run_vtr_flow` users, `-include <filename>` needs to be added. For `run_vtr_task` users, `includes_dir` and `include_lis_add` need to be specified in the task file. An example task file can be seen [here](https://github.com/verilog-to-routing/vtr-verilog-to-routing/blob/master/vtr_flow/tasks/regression_tests/vtr_reg_basic/hdl_include/config/config.txt].
15+
16+
Using such advanced DSP features is common in modern designs used with contemporary FPGAs. When using these benchmarks and enabling these advanced features, an FPGA architecture that supports these features must be provided. Supporting these features implies that the architecture XML file provided to VTR must describe such features (e.g. by defining a hard block macro DSP slice). We provide such architectures with Koios. The FPGA architectures with advanced DSP that work out-of-the-box with Koios benchmarks are available here: ::
17+
18+
$VTR_ROOT/vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.*
19+
20+
21+
When disabling these advanced features (by not defining `complex_dsp` as mentioned above), users can run these benchmarks with FPGA architectures that don't have these advanced DSP features. That is, an architecture XML file without the required hard macro definitions can be used. For example, the flagship architectures available here: ::
22+
23+
$VTR_ROOT/vtr_flow/arch/timing/k6_frac_N10_*_mem32K_40nm*
24+
25+
If users want to use a different FPGA architecture file, they can replace the macro instantiations in the benchmarks with their equivalents from the FPGA architectures they wish to use.
26+
27+
## Regressions
28+
Koios benchmarks are tested by the following regression tests in VTR:
29+
| Suite |Test Description | Config file | Wall-clock time |
30+
|---------------|----------------------|---------------|-------------------|
31+
| Strong | A test circuit. Goal is to check the architecture files. | tasks/regression_tests/vtr_reg_strong/koios | 6 seconds |
32+
| Strong | Same test circuit without enabling complex dsp features | tasks/regression_tests/vtr_reg_strong/koios_no_complex_dsp | 6 seconds|
33+
| Nightly | Small-to-medium sized designs from Koios run with one arch file | tasks/regression_tests/vtr_reg_nightly_test4/koios | 2 hours with -j3 |
34+
| Nightly | Small-to-medium sized designs from Koios run with an arch file without enabling complex dsp features | tasks/regression_tests/vtr_reg_nightly_test4/koios_no_complex_dsp | 2 hours with -j3 |
35+
| Nightly | A small design from Koios run with various flavors of the arch file that enables complex dsp features | tasks/regression_tests/vtr_reg_nightly_test4/koios_multi_arch | 2 hours with -j3 |
36+
| Weekly | Large designs from Koios run with one arch file | tasks/regression_tests/vtr_reg_weekly/koios | a little over 24 hours with -j4 |
37+
| Weekly | Large designs from Koios run with an arch file without enabling complex dsp features | tasks/regression_tests/vtr_reg_weekly/koios_no_complex_dsp | a little over 24 hours with -j4 |
38+
39+
## Collecting QoR measurements
40+
For collecting QoR measurements on Koios benchmarks, follow the instructions [here](https://docs.verilogtorouting.org/en/latest/dev/developing/#collecting-qor-measurements).
41+
942
## How to Cite
1043
The following paper may be used as a citation for Koios:
1144

vtr_flow/benchmarks/verilog/koios/attention_layer.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
//`define SIMULATION_MEMORY
66
//`define SIMULATION_addfp
7-
`define complex_dsp
7+
88
`define VECTOR_DEPTH 64 //Q,K,V vector size
99
`define DATA_WIDTH 16
1010
`define VECTOR_BITS 1024 // 16 bit each (16x64)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
`define complex_dsp

vtr_flow/benchmarks/verilog/koios/conv_layer_hls.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
// Abridged for VTR by: Daniel Rauch
1919
//////////////////////////////////////////////////////////////////////////////
2020

21-
`define complex_dsp
21+
2222
module dpram (
2323

2424
    clk,

vtr_flow/benchmarks/verilog/koios/dla_like.medium.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
//4. Double-buffering after each layer.
1616
///////////////////////////////////////////////////////////////////////////////
1717

18-
`define complex_dsp
18+
1919
module DLA (
2020
input clk,
2121
input i_reset,

vtr_flow/benchmarks/verilog/koios/dla_like.small.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
//4. Double-buffering after each layer.
1616
///////////////////////////////////////////////////////////////////////////////
1717

18-
`define complex_dsp
18+
1919
module DLA (
2020
input clk,
2121
input i_reset,

vtr_flow/benchmarks/verilog/koios/eltwise_layer.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
//section by section. The number of rows will be programmed
5757
//in the "iterations" register in the design.
5858

59-
`define complex_dsp
59+
6060
`define BFLOAT16
6161

6262
// IEEE Half Precision => EXPONENT = 5, MANTISSA = 10

vtr_flow/benchmarks/verilog/koios/gemm_layer.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
// with a simpler DSP (just a fixed point multiplier) like in the
2020
// flagship arch timing/k6_frac_N10_frac_chain_depop50_mem32K_40nm.xml
2121
/////////////////////////////////////////////////////////////////////////
22-
`define complex_dsp
22+
2323
`define BFLOAT16
2424

2525
// IEEE Half Precision => EXPONENT = 5, MANTISSA = 10

vtr_flow/benchmarks/verilog/koios/softmax.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
//////////////////////////////////////////////////////////////////////////////
1515

1616
//softmax_p8_smem_rfloat16_alut_v512_b2_-0.1_0.1.v
17-
`define complex_dsp
17+
1818
`ifndef DEFINES_DONE
1919
`define DEFINES_DONE
2020
`define EXPONENT 5

vtr_flow/benchmarks/verilog/koios/test.v

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
/////////////////////////////////////////////////////////
1313

1414

15-
`define complex_dsp
1615
`define BFLOAT16
1716

1817
// IEEE Half Precision => EXPONENT = 5, MANTISSA = 10

vtr_flow/benchmarks/verilog/koios/tiny_darknet_like.small.v

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
//////////////////////////////////////////////////////////////////////////////
1717

1818
`timescale 1 ns / 1 ps
19-
`define complex_dsp
19+
2020
module td_fused_top_Block_entry_proc_proc392 (
2121
ap_clk,
2222
ap_rst,

vtr_flow/tasks/regression_tests/vtr_reg_nightly_test4/koios/config/config.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ circuits_dir=benchmarks/verilog/koios
99
# Path to directory of architectures to use
1010
archs_dir=arch/COFFE_22nm
1111

12+
# Directory containing the verilog includes file(s)
13+
includes_dir=benchmarks/verilog/koios
14+
1215
# Add circuits to list to sweep
1316
circuit_list_add=tpu_like.small.v
1417
circuit_list_add=dla_like.small.v
@@ -26,6 +29,9 @@ circuit_list_add=softmax.v
2629
# Add architectures to list to sweep
2730
arch_list_add=k6FracN10LB_mem20K_complexDSP_customSB_22nm.xml
2831

32+
# Add include files to the list
33+
include_list_add=complex_dsp_include.v
34+
2935
# Parse info and how to parse
3036
parse_file=vpr_standard.txt
3137

vtr_flow/tasks/regression_tests/vtr_reg_nightly_test4/koios_multi_arch/config/config.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@ circuits_dir=benchmarks/verilog/koios
99
# Path to directory of architectures to use
1010
archs_dir=arch/COFFE_22nm
1111

12+
# Directory containing the verilog includes file(s)
13+
includes_dir=benchmarks/verilog/koios
14+
1215
# Add circuits to list to sweep
1316
circuit_list_add=conv_layer.v
1417

@@ -25,6 +28,9 @@ arch_list_add=k6FracN10LB_mem20K_complexDSP_customSB_22nm.clustered.xml
2528
arch_list_add=k6FracN10LB_mem20K_complexDSP_customSB_22nm.clustered.densest.xml
2629
arch_list_add=k6FracN10LB_mem20K_complexDSP_customSB_22nm.clustered.denser.xml
2730

31+
# Add include files to the list
32+
include_list_add=complex_dsp_include.v
33+
2834
# Parse info and how to parse
2935
parse_file=vpr_standard.txt
3036

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
#
2+
############################################
3+
# Configuration file for running experiments
4+
##############################################
5+
6+
# Path to directory of circuits to use
7+
circuits_dir=benchmarks/verilog/koios
8+
9+
# Path to directory of architectures to use
10+
archs_dir=arch/timing
11+
12+
# Add circuits to list to sweep
13+
circuit_list_add=tpu_like.small.v
14+
circuit_list_add=dla_like.small.v
15+
circuit_list_add=bnn.v
16+
circuit_list_add=attention_layer.v
17+
circuit_list_add=conv_layer_hls.v
18+
circuit_list_add=conv_layer.v
19+
circuit_list_add=eltwise_layer.v
20+
circuit_list_add=robot_rl.v
21+
circuit_list_add=reduction_layer.v
22+
circuit_list_add=spmv.v
23+
circuit_list_add=softmax.v
24+
25+
# Add architectures to list to sweep
26+
arch_list_add=k6_frac_N10_frac_chain_depop50_mem32K_40nm.xml
27+
28+
# Parse info and how to parse
29+
parse_file=vpr_standard.txt
30+
31+
# How to parse QoR info
32+
qor_parse_file=qor_standard.txt
33+
34+
# Pass requirements
35+
pass_requirements_file=pass_requirements.txt
36+
37+
#Script parameters
38+
script_params=-track_memory_usage -crit_path_router_iterations 100 --route_chan_width 300

0 commit comments

Comments
 (0)