BMOW title
Floppy Emu banner

Synthesis Puzzles

The more I try to understand the Verilog synthesis tool behavior, the less I understand it. I decided to go back to square 1 with my design, and start by implementing a basic 8-bit counter that can be reset, loaded, incremented, or decremented. Here’s the source:


module counter
  (input clk,
  input reset,
  input [3:0] state,
  input [7:0] d,
  output reg [7:0] q);
  
  localparam [3:0] load = 4'b0000,
      inc = 4'b0101,
      dec = 4'b1111;
  
  always @(posedge clk or negedge reset) begin
    if (!reset)
      q <= 0;
    else if (state == load)
      q <= d;
    else if (state == inc)
      q <= q + 1'b1;
    else if (state == dec)
      q <= q - 1'b1;
  end
endmodule

If I synthesize that using Altera Quartus II v9.0, with a Max 7000S series target device (EPM7128SLC84-15), and the optimization set for minimum area, it uses 10 macrocells. 10 macrocells for an 8-bit counter. The “Timing Closure Floorplan” view will show which macrocells were used, and the equations implemented in each one. It turns out that the first 6 bits of the counter each fit in a single macrocell, and use a T-type flip-flop with four product terms each, which follow a recognizable pattern. But for unknown reasons, the software implements the last two bits completely differently, using D-type flip-flops, an extra macrocell per bit for additional product terms, and one shared expander term. If you’ve got Quartus II Web Pack installed, you can easily confirm this yourself.

I couldn’t understand why the software didn’t just follow the pattern of the first 6 bits for the 7th and 8th bits too. There didn’t seem to be any limit on number of product terms or inputs that it would run into, as far as I could tell. I decided to try it, by using Altera primitives to explicitly specify a T-type flip-flop for all 8 bits, and listing out the exact logic equations for each bit. You can view the Verilog code here: counter_v2.v

The new version worked fine. In fact, it worked better than fine. It fit in 8 macrocells instead of 10, and used no shared expanders or other magic. And it was not only smaller, it was also faster. The software computed a maximum speed of 76.92MHz, compared to only 45.45MHz for the first version.

Read 6 comments and join the conversation 

6 Comments so far

  1. Tom - April 2nd, 2010 3:59 pm

    It could be that the synthesis tools are tripping over the “if … else if … else if …” statement – that kind of statement contains priority. The tools may be trying to preserve that command priority.

    Another possibility is the reset – your newer version has an asynchronous reset and the previous version only resets on a negative reset edge.

  2. Steve - April 2nd, 2010 5:21 pm

    Hmm, I don’t think so. The reset behavior is the same in both– asynchronous reset. The if/else logic creates a mux in both cases, but doesn’t explain why bits 7 and 8 are implemented differently from bits 1-6 in the first case. It’s still a mystery to me.

  3. David - April 2nd, 2010 8:08 pm

    Try re-writing it using a “case” statement. Case statements use less logic than “if..else” statements. This is due to the fact that “if..else if” constructs imply priority and are synthesized to enforce this. In contrast, “case” statements assume mutually exclusive control inputs – this tends to create more multiplexer style logic compared to logic trees for “if..else”.

    This link from Xilinx (http://www.xilinx.com/itp/xilinx4/data/docs/sim/coding5.html) describes this. A similar thing probably applies to Altera.

  4. Erik Petrich - April 3rd, 2010 12:30 am

    I was concerned about the apparent disparity of the edge vs level sensitive reset too, but after thinking about it awhile you have convinced me it really is level sensitive despite the “negedge reset”. I’m looking at your Verilog from the perspective of someone who mostly works in VHDL, so having the edges specified in the sensitivity list is already a bit unsettling. (But I’m not taking sides on which is “better”.)

    I have two stab in the dark suggestions. The first is to try changing the 0 to 1’b0 so that all of the values assigned to q will be unsigned. As an integer constant, the 0 would normally be considered a signed datatype. I don’t think this should make a difference, but it might affect what optimizations opportunities the compiler recognizes.

    The other thought is to increase the size of the counter and see if the complication stays with the two most significant bits or all the bits from bit 6 and higher. If it’s always related to bits 6 and higher then I would look through the options to see if there are any that relate to product term complexity or macrocell fan-in and experiment with them. At least with the Xilinx tools, optimizing for area doesn’t actually guarantee optimal area but instead activates heuristics that normally lead to smaller area, but the exact outcome can still be dependent on other optimization settings.

  5. Steve - April 3rd, 2010 4:22 pm

    Thanks for all the thoughtful comments. Some more info:

    – To be clear, the synthesis result for the low 6 bits is the same for either case. My hand-made solution in case 2 is exactly what the synthesis software does for case 1, for those bits.

    – The flip-flops have a hardwired async clear input, and in both cases the software correctly recognizes that the reset signal should be connected to this. Just to be sure I tried changing the q <= 0 assignment, but it makes no difference.

    – If..else implies a priority encoder when the clauses aren’t mutually exclusive. In this example, the reset input is definitely supposed to have priority. The other three tests of state are mutually exclusive, though, so either case or if..else work equally well. From the Xilinx doc that David referenced: “Most current synthesis tools can determine if the if-elsif conditions are mutually exclusive, and will not create extra logic to build the priority tree.”

    – Just to be sure, I tried changing the last three if clauses into a case statement, and it produced an even worse result: 15 macrocells. Ugh. Not sure why.

    – If you’re keeping score, when I optimize for:
    * speed: 10 mc’s, 6 shared exp, 2 parallel exp, 47.6MHz
    * balanced: 8 mc’s, 4 shared exp, 76.9MHz
    * area: 10 mc’s, 1 shared exp, 45.5MHz
    * my custom version: 8 mc’s, 76.9MHz

    – Following Erik’s second suggestion, it seems that optimizing for area activates a heuristic that tries to prevent the number of signals per product term from exceeding 10. I was able to demonstrate this by making the counter larger as he suggested, and also by changing the number of bits in the state code. I can think of no good reason for this heuristic, other than that it may help reduce routing congestion in larger designs maybe. Generally speaking, the three different optimization heuristics seem to do a poor job, since “speed” is nearly the slowest and “area” is tied for largest size.

  6. Steve - April 3rd, 2010 4:40 pm

    I also tried the same Verilog code using the Xilinx tools, targeting a similar device (XC95108-7-PC84), and it used 23 macrocells. Urp!

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.