Skip to content

Unexpected timing constraint behavior on global clock buffers #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
litghost opened this issue Oct 23, 2019 · 16 comments
Closed

Unexpected timing constraint behavior on global clock buffers #1013

litghost opened this issue Oct 23, 2019 · 16 comments

Comments

@litghost
Copy link
Collaborator

Expected Behaviour

Constraining the input clock pin should propagate through global clock buffers

Current Behaviour

Constraining the input clock pin results in all other clock buffers being ignored.

Possible Solution

  1. Do not supply an SDC file at all
  2. Provide VPR constraints for all clocks (including dependent clocks)
  3. ???

Steps to Reproduce

Context

We are introducing global clock buffers into the 7-series arch, and VPR sees this as two clocks, the clock from the input clock pin, and the clock from the output of the global buffer. Initial testing with global buffers went well, however no sdc file was supplied. When an sdc was supplied that only created a clock on the input pin, but not the global clock buffer net, it caused all timing analysis on BELs that were connected to the global buffer to report no launch clock. This resulted in both placement and routing having a NaN critical path delay.

Your Environment

  • VTR revision used:
  • Operating System and version:
  • Compiler version:
@litghost
Copy link
Collaborator Author

@vaughnbetz / @kmurray I'm guessing this behavior is caused by how VPR constraints clocks. If no SDC is supplied, does VPR generate constraints? And then if an SDC is supplied, VPR generates no constraints? Should VPR always generate the "default" constraints for unconstrained clocks, even in the presence of an SDC file? Should VPR generate better diagnostics if generating the "default" clock constraint is undesirable?

@vaughnbetz
Copy link
Contributor

I strongly believe generating clock constraints by default when there is an SDC is highly undesirable. We constrain clocks to go as fast as possible when there is no sdc as a convenience feature. Once there is an sdc, we obey it. [FYI: this approximately matches what we did in Quartus, and from what I can observe approximately matches what Xilinx does, although both those tools do a "light" optimization to save cpu time with no sdc]. Mixing in default (go as fast as possible on other, unconstrained clocks) when you have an sdc constraining some is very counter-intuitive. Cross-clock domain paths would be another random default (once you have an sdc, whether or not those paths are optimized and how much is defined by exactly how you define and relate your clocks.).

So the immediate workaround for this is to define your sdc to make a single base clock, and a derived clock that is 1x related to it, and apply the derived clock to the global buffer in your sdc. Or if the original input clock doesn't do anything interesting, you could just constrain the output of the global buffer.

To make this more convenient and to capture the delay of the global clock buffer in the clock delay (useful if you are also setting timing constraints between I/Os and this clock), it would be best for the timing graph builder to understand that when clocks go through buffer blocks they remain part of the clock tree and the clock hasn't hit leafs yet. Kevin, how hard would that be to do in a data-driven way? It's possible the code already does this if the clock buffer is defined as a combinational input->output block (not as a clock port input that would indicate there is a register in it).

@litghost
Copy link
Collaborator Author

litghost commented Oct 23, 2019

I do not disagree with anything you have written, however an error in sdc constraints resulting in:

Initial placement cost: 1 bb_cost: 2249.61 td_cost: 0
Initial placement estimated Critical Path Delay (CPD): nan ns
Initial placement estimated setup Total Negative Slack (sTNS): 0 ns
Initial placement estimated setup Worst Negative Slack (sWNS): 0 ns

Initial placement estimated setup slack histogram:
Placement contains 10 placement macros involving 47 blocks (average macro size 4.700000)

------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
      T Av Cost Av BB Cost Av TD Cost     CPD       sTNS     sWNS Ac Rate Std Dev  R lim Crit Exp Tot Moves  Alpha
------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Warning 109: Starting t: 0 of 395 configurations accepted.
0.0e+00   0.785    1039.44 0              nan          0    0.000   0.204  0.0826  161.0     1.00       2897   -nan

BB estimate of min-dist (placement) wire length: 165277

Completed placement consistency check successfully.

Swaps called: 3292

Placement estimated critical path delay: nan ns
Placement estimated setup Total Negative Slack (sTNS): 0 ns

With the only indication of failure is the CPD = NaN and the following message:

Warning 107: 18 timing startpoints were not constrained during timing analysis
Warning 108: 4884 timing endpoints were not constrained during timing analysis

There ought to have a been a message akin to:

4884 timing endpoints were not constrained because clock X was not defined/constrainted

or something. The only reason I figured out how to get past this issue is the fact that no SDC => everything was fine, SDC only on input clock + global clock buffer => borked, SDC on input clock w/o global clock buffer => everything was fine.

@mithro
Copy link
Contributor

mithro commented Oct 23, 2019

I think there are a couple of things here;

  1. It would be good for VPR to transfer timing constraints through "simple" black box devices like clock buffers.
  2. If providing a timing constraints file (IE An SDC file) then you should provide constraints for all clocks in your system. It seem reasonable that this is an error not just a warning. If you want a clock to be unconstrained or "as fast as possible" it makes sense to explicitly require you to say that.
  3. The documentation should explain how to apply a default "as fast as possible" to all clock paths and then override specific clock paths with specific frequencies.
  4. VPR should explain which clocks are missing timing constraints.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Oct 24, 2019

Agreed on 1 and 4. Both will take some development, which this issue can track.

3 is done: see https://docs.verilogtorouting.org/en/latest/vpr/timing_constraints/#default-timing-constraints (explains the defaults and the equivalent SDC constraints)

2: I disagree that this should be an error. Not specifying a constraint on a clock, or on paths between clocks, or on some I/O paths means you do not care about their speed. This happens on synchronizer-based cross-clock domain paths for example, and is perfectly reasonable for them (there is no timing constraint -- they will work at any speed). It is the common CAD tool behaviour to allow this. I agree that we should have a missing constraint report / unconstrained path report though (#4 above), as that helps flag if you unintentionally left some paths out.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Oct 24, 2019

1 is the most important, and shouldn't be that hard as Kevin's code already traces through .names buffers (he thinks). Keith, can you attach a small test case (arch file, small example design, and point us at the exact global clock buffer syntax / definition in both the architecture and .blif files).
I think -- and Kevin seems to agree although he's less sure :) -- we should flip our clock propagation logic. Instead of propagating a clock only when we have a .names that we know is a buffer, we propagate through anything that is combinational, assuming it is a buffer (with an appropriate message). That will usually be correct, and if it isn't, the end user will have to set more detailed sdc constraints on the net coming out of that block (and the user can do that).

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Oct 24, 2019

4 is a good ramp up or summer student task. Will look for an owner. Asking @kmurray (nicely!) if he can squeeze in the solution to 1 above.

@mithro
Copy link
Contributor

mithro commented Oct 25, 2019

@vaughnbetz - I agree that with #2 most CAD tools let you forget a constraint on a clock it'll just continue happily. I have lost many hours to this behaviour. This also frequently trips up beginners. It would be perfectly fine to have a --ignore-missing-timing-constraints type flag.

@kmurray
Copy link
Contributor

kmurray commented Oct 29, 2019

I've implemented (1). It was somewhat more involved as it required some changes to how the timing graph is constructed and how Tatum handles propagating clocks.

The new behaviour is as follows.
For a white-box .names, we track back from the output pin to the input pin (this captures .names buffers if they weren't already swept from the netlist).
For black-box primitives we trace back to the clock pins of the primitives, this allows the user to control which pins are considered during the traceback (by specifying the relevant inputs as clock pins).

For instance consider a black-box clock buffer with an enable:

        .subckt clkbufce \
            in=clk3 \
            enable=clk3_enable \
            out=clk3_buf

Which would have a corresponding model:

        <model name="clkbufce">
          <input_ports>
            <port name="in" combinational_sink_ports="out" is_clock="1"/>
            <port name="enable" combinational_sink_ports="out"/>
          </input_ports>
          <output_ports>
            <port name="out"/>
          </output_ports>
        </model>

and pb_type:

        <pb_type name="clkbufce" blif_model="clkbufce" num_pb="1">
          <clock name="in" num_pins="1"/>
          <input name="enable" num_pins="1"/>
          <output name="out" num_pins="1"/>
          <delay_constant max="10e-12" in_port="clkbufce.in" out_port="clkbufce.out"/>
          <delay_constant max="5e-12" in_port="clkbufce.enable" out_port="clkbufce.out"/>
        </pb_type>

where the in port is noted to be a clock port, which is combinationally connected to the out port.

When tracing back the clock net 'clk3_buf', VPR will now look at the
upstream (combinationally connected) clock port 'in' and correctly identify the corresponding net 'clk3' as the same logical clock.

Notably, since only the in port has been marked as a clock in the architecture, the other combinationally connected upstream port (enable) is not detected as an upstream clock net.
This is a simple way for the end-user to specify how clocks should propagate through black boxes, while avoiding having to teach VPR about the exact logical behaviour of each black box.

The timing graph builder, delay calculators and tatum needed some small tweaks to propagate clocks during analysis between CPIN and OPIN nodes in the timing graph. It all appears to be working, and you can see more details of how it is used in the associated regression test.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Oct 29, 2019

Thanks Kevin! This looks great.
I think your example above should go in the architecture file documentation somewhere -- what do you think?

@litghost
Copy link
Collaborator Author

litghost commented Oct 29, 2019

Thanks Kevin! This will be very useful.

There is one more point of subtlety that is worth discussing. The 7-series global clock buffers can act as clock muxes, but they are rarely use that way. If there are two clock inputs and one output, what does your new logic do? And if only one of the two clock inputs are provided, what do you do?

@kmurray
Copy link
Contributor

kmurray commented Oct 29, 2019

If there are two clock inputs and one output, what does your new logic do?

For something like:

.subckt clkmux \
    clk1=clka \
    clk2=clkb \
    sel=select \
    clk_out=clk_downstream

It would detect clka and clkb as separate clocks. Any downstream paths using clk_downstream would be analyzed for both clka and clkb.

For automatically generated constraints this means clocks would be created for both clka and clkb, and the downstream logic would be analyzed for both clocks. The automatically generated constraints do not analyze cross-domain paths so you wouldn't see any paths clocked by clk_downstream crossing between clka/clkb (or vice versa).

If you manually specified an SDC file with create_clock commands for clka and clkb you would again see the downstream logic analyzed for both clocks. However you'd also want to avoid analyzing clka->clkb and clkb->clka paths (since only one can propagate through the clkmux at a time). This is exactly what the set_clock_groups -exclusive SDC command is for. This also matches standard behaviour in other tools.

And if only one of the two clock inputs are provided, what do you do?

The tracing is driven by the netlist connectivity, so if only one of the clock inputs is left disconnected:

.subckt clkmux \
    clk2=clkb \
    sel=select \
    clk_out=clk_downstream

VPR would see that clk_downstream and clk_b are logically equivalent, so clkb would propagate to the downstream logic. Since the clk1 port is disconnected no additional clock would be inferred.

@kmurray
Copy link
Contributor

kmurray commented Oct 29, 2019

I think your example above should go in the architecture file documentation somewhere -- what do you think?

Agreed, I'll work on getting that in the docs.

@vaughnbetz
Copy link
Contributor

That clock muxing behaviour all sounds good. Thanks Kevin.

@kmurray
Copy link
Contributor

kmurray commented Oct 29, 2019

I've updated the timing modelling tutorial to include these examples.

Closing as I believe the primary issue is now addressed.

@kmurray kmurray closed this as completed Oct 29, 2019
@vaughnbetz
Copy link
Contributor

Looks great; thanks Kevin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants