-
Notifications
You must be signed in to change notification settings - Fork 415
Submitting Koios benchmarks #1753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submitting Koios benchmarks #1753
Conversation
Thanks Aman. Yes, a CI test should be added to vtr_reg_nightly. @ArashAhmadian is currently beefing up vtr_reg_nightly and likely refactoring it into two sub-tests (so we can run on two kokoro machines in parallel) so adding him to the conversation. |
Regarding adding tests. There is one main architecture that we've added in this PR (all others are just variations of that architecture) in the PR. And then there are 20 designs. From a wall clock time perspective, the designs take anywhere from 2 mins to 24 hours to run. I am thinking that we add 3 tests - one in vtr_reg_strong, one in vtr_reg_nightly and one in vtr_reg_weekly. We just need to ensure that each design gets tested with one arch variation. So, we can distribute the design/arch combinations across these tests keeping in mind that the time taken by each test doesn't exceed the set limit. |
Yes, that makes sense. You could test all designs on one arch, and then pick one or two interesting designs and test them on all archs. Catches almost as many errors as the full cross product, but with much less cpu time. @ArashAhmadian is in the middle of restructuring the nightly regtests to both increase their coverage and reduce their wall clock time to about four hours. So we should structure the nightly tests for this suite to also fit within four hours wall clock (using multiple cpus, and probably in a new nightly test so they get their own kokoro machine in CI). We'll have to be pragmatic about the CAD settings / circuit combinations to keep the runtime down to that. Weekly can run any really big/slow designs, but it won't get nearly as much attention as it isn't run on each pull request. |
I've added 4 tests:
|
Some designs in the suite use advanced features of the DSP slice in the architecture and instantiate the DSP slice hard macro in the RTL. But to be able to run these designs on a simpler arch (like the Stratix IV based flagship arch), we have provided behavioral code that replaces the instantiation of the DSP slice. This is handled using a macro (`define complex_dsp) in the Verilog file. So effectively, for some designs, we have two variations - one with complex_dsp and one without complex_dsp. I think we want to protect both variations by adding tests that enable and disable the `define. How can we do that? It won't be a good idea to statically copy the design file and keep the `define complex_dsp in one file and remove the `define in the other. @sdamghan , @KennethKent , is it possible to add a feature to provide a `define from the odin config file? |
Thanks Aman. Adding @kmurray and @ArashAhmadian to the conversation. I think we need to cut down what goes in vtr_reg_strong. The entire test is intended to take 15 minutes or less, rather than individual subtests taking less than 15 minutes. See https://docs.verilogtorouting.org/en/latest/dev/developing/#adding-tests -- the recommendation is to keep the test to 1 minute or less. Could just use one (smallest) koios design, run with a fixed channel width, etc. I think the key in that test is to check one of the more complex architectures/DSP blocks; we shouldn't need a big circuit for that. The nightly and weekly additions sound good. @ArashAhmadian is refactoring nightly into multiple sub-tests so we can run on multiple kokoro machines for CI and get higher throughput. I suggest we add the koios designs into a new nightly test (e.g. vtr_reg_nightly4) so we get another kokoro machine for them, and they don't slow down nightly. Alternatively, @ArashAhmadian : you can figure out how to update the various kokoro scripts to split up the runs in the vtr_reg_nightly task list across multiple machines (run one job per task, or run the first N on different machines and the last M on the final machine if we only have N+1 kokoro machines. This is basically Tim's suggestion from last week, except it ensures we never run out of kokoro machines (we just slow down). In this case we'd update the documentation to say slow tests should be listed first (and we'd reverse the task list processing so first gets run first). |
Ah I thought each tests in vtr_reg_strong was supposed to be 15 mins. I realize that none of the Koios benchmarks runs in 1 minute or less. The shortest time I see is 5 mins (on a Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz with 64GB RAM). It may run a bit faster (say 4 min) on one of the other arch variations, but it won't come down to 1 min. So, it looks like we just have to have only nightly and weekly regression tests for Koios. What we can do is to create a test.v benchmark under koios, that only instantiates say the DSP slice from the architecture and run it in vtr_reg_strong. That gives us some protection of the arch file, but not of any real koios benchmark. What do you think about this, @vaughnbetz ?
Regarding this point... @ArashAhmadian, @vaughnbetz, is there something I need to do? I don't see vtr_reg_nightly1/2/3 folders in the repo. This "vtr_reg_nightly4" abstraction is something in the CI system and not in the repo, I guess? |
@vaughnbetz , while we sort the regression aspects out, can you please review and provide feedback on the other aspects of the benchmarks like the file names, folder paths, documentation, etc. I've provided a link to the koios benchmark folder (https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/master/vtr_flow/benchmarks/verilog/koios) in the paper; we need that to be finalized before I submit the camera-ready version of the paper. |
@aman26kbm I haven't made the PR for the refactoring of vtr_reg_nightly (vtr_reg_nightly is separated into three smaller suites only on the vtr_task_reorganization branch in my fork) yet . I will make the request soon, most likely within the next two days. However, you could still add it in as "vtr_reg_nightly_test4" alongside vtr_reg_nightly but for it to run on a kokoro machine, a .cfg file has to be added under https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/master/.github/kokoro/presubmit. The new .cfg file, say nightly_test4.cfg will be almost identical to nightly.cfg with the only difference being the value for VTR_TEST. Alternatively, when finalized, you could add the discussed regression tests under the current suites (vtr_reg_nightly and/or vtr_reg_weekly), and I will refactor them into vtr_reg_nightl4 before making my PR. Please let me know your thoughts. @aman26kbm @vaughnbetz |
@aman26kbm In my opinion, there would be a couple of options for this matter. First, since we can read multiple Verilog files (although without having multiple top modules) I can add a terminal argument to read define from another input file. In this way, you would handle the usage of define by passing the second Verilog file. Another option is to provide a Verilog specifier like |
Thanks, @ArashAhmadian , I'll prefer the second option coz I'm not familiar with the regression setup. But your PR will likely get submitted before this one. This PR's review will take some time. When this PR's other aspects have been reviewed and if your PR has been checked in by then, I can change the files (to move the tests to the new regression folders) to avoid extra work for you. |
I didn't know we could read in multiple Verilog files for one benchmark/design. That's great. I guess then I just refactor the "`define complex_dsp" line into another Verilog file and then no change should be required from you, right? I am confused about this: "I can add a terminal argument to read define from another input file". Having said that, option #2 is better I think. I can imagine this being a common usecase. Typically in designs, there are multiple `defines that can change the behavior of the design. So the ability of adding a `define from the odin config file will be good. @vaughnbetz may have some more inputs here. |
Thanks all.
We could add another option like: Seyed, you'd then modify run_vtr_task/run_vtr_flow to properly paste that define into a verilog file or to include that additional verilog file. I don't have a strong opinion about implementation, although the adding a new file using the script_params setting might be more powerful (can do anything in that file). Does this look doable? Is one easier or more maintainable than the other? |
The vtr_reg_strong tests failed in CI. The error message are: Error code 134 looks to be SIGABRT. I suspect you hit a timeout limit on vtr_reg_strong, so further motivation to cut the cpu time (a lot) added to vtr_reg_strong. |
Sounds good, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and the directory structure looks fine. A few suggested commenting changes in the arch file and students to consider adding as benchmark contributors.
vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.clustered.denser.xml
Outdated
Show resolved
Hide resolved
vtr_flow/arch/COFFE_22nm/k6FracN10LB_mem20K_complexDSP_customSB_22nm.clustered.denser.xml
Outdated
Show resolved
Hide resolved
Since we can pass multiple Verilog files to Odin I think this process is solvable without any changes to the scripts. For instance, we can have the Verilog file (.v) fixed and once extra define is needed we can pass it to Odin using a header file (.vh). In the following, I wrote an example of this scenario. (you could find more in the keywords test of Odin's regression tests.)
In this way, you would specify the complex_dsp macro in a header file, then using Odin's config file you would add it to the circuit list once it is needed. A script_param scenario is a reasonable option as well, please let me know if the above-mentioned approach does not work. |
Adding @andrewboutros. @sdamghan : Thanks for the description of .vh files above. The part I am not sure about is how this goes through the regression test infrastructure, without Aman having to duplicate the benchmark circuit for fancy-DSP and non-DSP versions. The regtest infrastructure takes a circuit list, not a list of files within each circuit. Currently all the vtr tests I can see have one .blif or .v file per benchmark (we could generalize this, but someone would have to write/update the scripts). Can you go further in your example to show how the extra file gets included in the benchmark run from a config.txt -- i.e. exactly how is the benchmark organized and the config.txt file written to make this happen? I am not sure of how to make this happen without adding a new option to run_vtr_task/run_vtr_flow as detailed above. I might be missing something, but I (and I think Aman) need more of a tutorial of exactly how to do it through the regtest infrastructure in that case. |
Based on my recent view of Koios benchmark Verilog files, the definition of complex_dsp macro in each test almost differs from the others. Concerning the -add_define or -add_verilog_file to script_params, we have two different scenarios for run_vtr_task and run_vtr_flow.
To recap, I think by adding an option like mentioned to script params, or Odin config file, we will have to create more files to have the definition of complex_dsp macro in addition to running VTR task would be more complicated since we need to specify the relation of extra Verilog files (definitions of complex_dsp) with the original circuits. Link the PR: https://github.com/aman26kbm/vtr-verilog-to-routing/pull/4/files
|
Sorry for the delay. I was tied up in some other work. @sdamghan , thanks for the details. Basically the idea (let's call it method #2) is to have a set of "generic" circuits and then have a set of "complex_dsp" circuits that just define the complex_dsp macro and `include the "generic" circuit file. Right? A few things to mention...
In the approach Dr. Betz suggested (let's call it method #1), I agree that if complex_dsp is to be defined, it will be defined for all designs in a task. But that's okay. `defining complex_dsp for a design that doesn't use the complex_dsp macro is fine. It's just a NOP. Doesn't do anything. So, method #1 also satisfies our requirements. The main difference between the two methods is that method #1 is more dynamic. In method #1, we will have multiple "generic" circuit files and an additional file that will just have the `define (can have many other `defines or other stuff). This additional file will not have the `include of the design file. We could potentially have more than one additional files with different `define settings. So, when running we would just provide the additional file in the task config through the additional command line arg (to script_params). This file will apply to all files in the task. Method #2 is more static. If we have different `defines to provide, we will need to create more files. One file per benchmark (even if the same `define is to be applied to multiple files). And also it will involve more changes to the task file (changing the file name in multiple places) So, it's a little more work. I prefer Method #1 over Method #2. Please provide your thoughts. @vaughnbetz , acking the review comments. I am planning to work on them today, including adding a dummy/test file in vtr_reg_strong. Most likely, I will submit a commit today to addresses these. |
@aman26kbm thanks for the clarification. |
Thanks for the detailed explanation @sdamghan . I agree that approach (#2) can also work. We'd wind up with longer circuit lists in the config.txt so we tested both the with complex_dsp and without complex_dsp cases, but that's OK. It would let us test both variants in one config.txt if we wanted, and if not, we'd just list the changed circuits in the second (complex_dsp) run. If we use the 'define passed to run_vtr_task/run_vtr_flow, we should still cut down the circuit list in the run that sets the define, as it is a waste of cpu to run the designs that don't have complex dsps again (just get the same answer as in the generic run). So I'm OK with either approach. Approach #2 doesn't require infrastructure changes (I think), while approach #1 does, but gives us a dynamic behaviour that could be useful if we later wind up with combinations of defines. I'll let those doing the actual work make the decision :). |
2. Adding a new test design that we can use in the vtr_reg_strong regression test. 3. Moved all tests from vtr_reg_strong to vtr_reg_nighty. The time taken by the tests in vtr_reg_strong was more than we want. 4. Updated the list of contributors to include Helen and Zach who worked on DLA and CLSTM designs in the past.
Hi @ArashAhmadian , I guess the PR you're working on is this: #1728, right? |
Hi @aman26kbm. Yes, that is the PR. I have made a PR (#1776) to update the doc to mention the parallelism strategy and give detailed steps for adding new nightly suites (waiting for review). Feel free to refer to it once #1728 has landed. Let me know your thoughts. |
Adding step to gzip results from vtr_reg_nightly_test4 to vtr-test.sh file.
I have added changes that move koios benchmark nightly tests to vtr_reg_nightly_test4. @ArashAhmadian , can you please review my changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, does koios_multi_arch
take longer than koios
? We should have the task with the longest flow runs at the bottom of the list so it starts at the beginning of the CI run.
Oh.. koios_multi_arch is shorter than koios. Lemme change the order and submit. |
…e task with the longest flow runs at the bottom of the list so it starts at the beginning of the CI run. In this case, koios task takes much longer than koios_multi_arch.
Sounds good. Files (configs and task lists) are read in backwards by |
…k is now in the Koios benchmarks (named gemm_layer.v). Also, removing the synthesis results for 3 benchmarks (tpu.16x16, tpu.32x32 and matmul_8x8_fp16) from ODIN-II synthesis regression golden results. These benchmarks are no longer present in the vtr_flow/benchmarks/verilog directory. They are not in the Koios suite under different names.
@sdamghan , in My question is: do we want to add separate ODIN regression tests for Koios benchmarks in |
@aman26kbm That would be great if we have a task specifically for the Koios benchmarks in the ODIN_II regression test as well. The ODIN_II task config file is pretty much like the VTR flow. |
Ok. Thanks. I've added koios benchmarks (only the ones that are in VTR nightly) to ODIN's heavy suite (the same suite which had the tpu and gemm designs). |
@vaughnbetz , can you please approve this PR? The only pending item is to add tests for the designs without the "complex_dsp". I am going to do that as a separate PR. |
@aman26kbm : can you summarize the wall clock time and cpu time that's been added to the strong, nightly, and weekly tests? It's good to summarize to make sure it won't cause us any problem, and in case we have any throughput issues we can look it up in this PR. |
Just looked at the logs. |
Thanks. Strong definitely sounds fine :). |
For weekly, I used "-j 4" as mentioned in the developer README. So, the time reported is the wall clock time taken when running 4 jobs in parallel (on 4 cpus on a multi-cpu machine). |
Thanks Aman. I think that'll be OK as the weekly throughput isn't as critical. This is worth posting as a table in the Thursday meeting, and @ArashAhmadian if you could give the numbers for the current nightly and weekly for comparison that would be ideal. |
I ran nightly again. It finished in 2 hours. The large difference compared to the older result of 5 hours is because I had moved one long test into the weekly a few commit ago. So, we are good. The regression time will increase when I add the tests for designs without complex_dsp. But I think we will still be under the prescribed limits for each suite. |
I'll post the data for the various suites used for Koios in the README of these benchmarks. I need to anyway add some information to the README about the new usage of complex_dsp (by adding the additional defines file). I'll do that with a new PR. |
Description
Adding files related to Koios benchmarks. Here's a summary of the files:
Motivation and Context
Adding a set of DL benchmarks for FPGA arch and CAD research
How Has This Been Tested?
I need to add tests. Will work with Vaughn on identifying the tests to add.
Types of changes
Checklist: