Verilog Examples Synthesized
I decided to take the advice I gave myself in the comments of my previous post, and actually synthesize the three Verilog adder examples to see what would happen. I tried each of the examples under Quartus II Web Edition 9.0, set to optimize for area. The size of a, b, c, d0, d1, and d2 was chosen as 8 bits.
1. 44 macrocells. Yes, it created 3 separate dedicated adders. The RTL showed three registers for d0, d1, d2, each with a mux leading into it, as well as the adders and a single decoder for state. The Technology Map Viewer showed 24 mc’s used by the registers, and 20 mc’s total by the three adders.
2. This design is broken. By not specifying default values for in1 and in2, the software inferred a latch for them in the hypothetical s3 state. After fixing that, the design consumed 52 macrocells. Again the RTL showed three registers for d0, d1, d2, each with a mux leading into it. It showed a single adder, with 2 cascaded muxes at each adder input. It also showed a decoder and a stray OR gate. The Technology Map Viewer showed 24 mc’s used by the registers, 27 by the single adder, and 1 more that I couldn’t exactly account for– part of one of the muxes maybe.
3. This design is also broken in the same way as #2. There’s also a copy-paste error in the enable signal in s2 state. After fixing those mistakes, the design consumed 52 macrocells. The RTL looked very similar to #2, and the Technology Map View was identical to #2.
There’s a lot to investigate further here, such as how the single adder in #2 could require 27 mc’s when the three adders in #1 only require a combined total of 20 mc’s. But the major conclusion is that all my attempts at “improving” the design only made the results worse.
Read 3 comments and join the conversation3 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
I have some ideas why the examples with a single adder required more macrocells than the example with three adders. The single adder designs are actually more general, and could theoretically do other operations beyond the three specific add-and-stores that are needed.
For example, with an A/B mux at one adder input, and a B/C mux at the other, they can compute B+B, which the three-adder design cannot. They also have more flexibility in routing the sum to any of the D registers. The four possible sums can be stored in any of the three D registers, allowing for 12 different operations total. The three-adder design is hard-wired for 3 specific operations, and that’s it. Because they are unintentionally more general than was required, the one-adder designs require more macrocells to implement.
Looking at the details, in the three-adder example, computing D0 <- a + b, part of the a+b logic is actually implemented inside the D0 macrocells. That's a nice space savings, and is possible only because D0 is never loaded with anything other than a+b in that example. The single-adder designs can't use that trick. Mostly unrelated, but I've also noticed that the Altera synthesis software never seems to make use of the load enable input on flip-flops. Instead, it always creates a mux at the flip-flop input, where one of the mux possibilities is the FF's current value. Then the FF is loaded unconditionally on every clock edge. Yes, loading the FF with the same value is functionally the same as not loading it at all, but it seems strange to me. Why have that ENA control signal on FF's at all then? I wonder if this is actually better than using the load enable in some way-- shorter clock-to-Q times maybe? It certainly makes the RTL diagrams more cluttered though.
I don’t recall seeing you mention which chip you are synthesizing for. I’m guessing something in the Max 3000A family? I haven’t used them, but here’s my guess from looking at the datasheet: the enable only has a single product term available to it. If there are multiple load conditions (requiring multiple product terms) and you forced it to use the enable, then it would have to use another macrocell to simplify the load/no-load condition first. On the other hand, using the mux with an extra input for the no-load condition(s) would just adds more product terms to the equation for the FF input and the architecture is already set up to efficiently allocate those (if need be, extra terms can be allocated from neighboring macrocells without necessarily losing the donating macrocell’s usefulness).
Good observation, that sounds right. I’m targeting the Altera Max EPM7128S.