Archive for April, 2010
Back-Annotation
I think I’m close to having this CPU fit the 128 macrocell CPLD, but running into some problems with the final details. Soon I’m going to work on a physical board layout for this CPU, at which point all the pin assignments need to be finalized. Any further design tweaks need to maintain those pin assignments, or else they could not use the same board.Altera’s Quartus II software meets this need with a tool called back-annotation. Once you’ve synthesized the Verilog design, and computed a fit for the particular device you’re using, you can back-annotate the original design with the pin placements determined by the fitter. Then if you later change the design, the software will attempt to keep those same placements, or report failure if it can’t.That sounds great, except when I use back-annotation, it always causes fitting to fail. Starting with no constraints, I can synthesize and fit my design successfully (currently 117 macrocells), then back-annotate the device and pin assignments, and run synthesize and fit again. Since I made no changes to the design, the result the second time should be identical to the first, and should match the back-annotation constraints perfectly. But what actually happens is that the second run of synthesize and fit fails, complaining that it’s unable to pack the cells into LABs successfully. This is proving very frustrating, since it’s a showstopper problem if I can’t find a solution.
Be the first to comment!Success?
I think I’ve succeeded at cramming a decently useful CPU into this 128 macrocell device! I threw away my first design and started over from scratch, abandoning almost all the 6502-related elements, and working closer to a minimal instruction set design. I also changed the Verilog structure to explicitly specify the internal datapaths between CPU registers, rather than the more behavior-oriented design I tried originally.
The major caveat is that none of this is tested it yet. No testing whatsoever. It’s virtually certain that the design contains many mistakes, some of which may cause it to fit into fewer macrocells than it would otherwise. As of now, the design occupies 121 of 128 macrocells. Due to routability contraints, however, I can’t really make use of the last seven.
My primary goals were to making a working 8-bit CPU, with a stack, index register, and indexed addressing mode. I’ve accomplished that, but a lot of other nice stuff had to be tossed out. Here are the features that the CPLD CPU supports, and doesn’t support:
Addressing Modes
Supported: immediate, absolute, absolute with index
Not supported: indirect, indirect with index, read-modify-write
Indirect addressing would be nice, but I’m not going to lose sleep over its absence. Self-modifying code can be used as a sort of poor-man’s indirect addressing where needed.
Program Flow
Supported: jump, branch if carry set, branch if zero flag set, call, return
Not Supported: indirect jump, branch if carry not set, branch if zero flag not set, negative flag, overflow flag
The overflow flag is rarely used, and a subroutine can be written to do the same thing if needed. The negative flag would be more useful, but not critical. Branch on flag not set would be very useful, but can always be avoided by modifying the branch test.
Math and Logical Operations
Supported: add A, sub A, compare A, nor A, load/store/push/pull A, load/store/push/pull X, increment/decrement X
Not supported: add/subtract with carry-in, compare X, and, or, xor, shift, rotate, test, register-to-register transfers, direct set/clear of flags
I decided the lack of a carry-in wasn’t a big deal. You can always test the carry out, and manually add 1 to the next stage of a multi-byte addition or subtraction.
Having NOR as the only logical operation seems strange, but is surprisingly powerful:
not A = A nor 0
A or B = (A nor B) nor 0
A and B = (A nor 0) nor (B nor 0)
For the common task of AND-ing a number with an immediate value to check if a particular bit is set, this can be done with NOR in a single step if you use the bitwise-complement of the immediate value, and also reverse the sense of the test:
NOR #$7F
BZ highBitIsOne
Other missing operations can be easily simulated:
clear carry = ADD #0
set carry = SUB #0
left shift = STA temp, ADD temp
transfer A to X = PHA, PLX
transfer X to A = PHX, PLA
Of all the missing functions, the only ones I really wish I could squeeze in are the branch if not set, and compare X. Not having compare X means that any loop over X has to start at some number and go down to zero, instead of start at zero and go up. Most of the time that’s probably OK. In an emergency, compare X could be simulated as:
PHA, PHX, PLA, CMP, PLA
But that’s pretty ugly, and it also assumes that PLA doesn’t modify the flags (still undecided).
I will post the Verilog code once it’s tidied up a bit more, and I’m confident I’ve squeezed it as much as possible.
Be the first to comment!Synthesis Puzzles
The more I try to understand the Verilog synthesis tool behavior, the less I understand it. I decided to go back to square 1 with my design, and start by implementing a basic 8-bit counter that can be reset, loaded, incremented, or decremented. Here’s the source:
module counter
(input clk,
input reset,
input [3:0] state,
input [7:0] d,
output reg [7:0] q);
localparam [3:0] load = 4'b0000,
inc = 4'b0101,
dec = 4'b1111;
always @(posedge clk or negedge reset) begin
if (!reset)
q <= 0;
else if (state == load)
q <= d;
else if (state == inc)
q <= q + 1'b1;
else if (state == dec)
q <= q - 1'b1;
end
endmodule
If I synthesize that using Altera Quartus II v9.0, with a Max 7000S series target device (EPM7128SLC84-15), and the optimization set for minimum area, it uses 10 macrocells. 10 macrocells for an 8-bit counter. The “Timing Closure Floorplan” view will show which macrocells were used, and the equations implemented in each one. It turns out that the first 6 bits of the counter each fit in a single macrocell, and use a T-type flip-flop with four product terms each, which follow a recognizable pattern. But for unknown reasons, the software implements the last two bits completely differently, using D-type flip-flops, an extra macrocell per bit for additional product terms, and one shared expander term. If you’ve got Quartus II Web Pack installed, you can easily confirm this yourself.
I couldn’t understand why the software didn’t just follow the pattern of the first 6 bits for the 7th and 8th bits too. There didn’t seem to be any limit on number of product terms or inputs that it would run into, as far as I could tell. I decided to try it, by using Altera primitives to explicitly specify a T-type flip-flop for all 8 bits, and listing out the exact logic equations for each bit. You can view the Verilog code here: counter_v2.v
The new version worked fine. In fact, it worked better than fine. It fit in 8 macrocells instead of 10, and used no shared expanders or other magic. And it was not only smaller, it was also faster. The software computed a maximum speed of 76.92MHz, compared to only 45.45MHz for the first version.