Hello Nibbler!

August 22nd, 2013 | Category: Nibbler | Author: Steve

Say hello to Nibbler, the 4-bit homemade CPU! Ever since I built BMOW1, people have written to me asking how to make their own homebrew computers. BMOW is a complex design that can be difficult to comprehend, so I decided it was time to create a minimal CPU that’s easy to understand, easy to build, but still capable of running interesting programs. Ideas for Nibbler began percolating in my brain, and after a few weeks of pencil sketches and hand simulation, it’s finally ready to share. And if you’ve forgotten, a nibble is half a byte or 4 bits, so the name fits the CPU.

Some of you may be thinking: “4-bit CPU? BORING!” I agree that many of the 4-bit CPU designs on the web aren’t very exciting, though that’s not an inherent problem with their 4-bitness, but is caused by shortcomings in the computer that surrounds the CPU. Most designs are limited to 256 nibbles of memory, which just isn’t enough to fit a program that does anything very interesting. I/O is often limited to basic LEDs and switches, further reducing the scope of what’s possible.

My goals for Nibbler are:

Only use commonly-available 7400 series chips and RAM/ROM. No programmable logic or other goodies.
Keep the total number of chips as few as possible.
Employ a simple, straightforward design that’s easy to understand.
Maintain a clean logical separation between the CPU and the computer surrounding it.
Run interesting, interactive programs involving several I/O devices.
NOT: Be the most powerful CPU, or the easiest to write programs for.

Design/Architecture

The architecture of Nibbler is shown above. The CPU core is just eleven 7400 series chips, plus the clock crystal. RAM and ROM add two more chips, and peripheral I/O in “the computer” adds three more, for a total of sixteen chips overall. Compared to BMOW’s 65 chips and multiple clocks, that’s very lightweight.

Instruction opcodes are 4 bits wide, which allows for 16 possible types of instructions. All instructions require exactly two clock cycles to execute. During the first clock cycle, called phase 0, the instruction opcode and operand are retrieved from memory and stored in a register called Fetch. The second clock cycle, called phase 1, performs the calculation or operation needed to execute the instruction.

A pair of microcode ROMs is used to generate the sixteen internal control signals needed to load, enable, and increment the other chips in the CPU at the appropriate times. The microcode ROM address is formed from the instruction opcode, the phase, and the CPU carry and equal flags. Each microcode ROM outputs a different group of eight of the sixteen total control signals.

A load-store design is used, with all arithmetic and logical computation results being stored into the single 4-bit accumulator register named “A”. Data can be moved between A and memory locations in RAM, but otherwise all the CPU instructions operate only on A. This greatly simplifies the hardware requirements, at the cost of some decrease in flexibility when writing programs.

In contrast to most modern CPUs, the Nibbler design uses a Harvard Architecture. That means programs and data are stored in separate address spaces, and travel on separate busses. The data bus is 4 bits wide, as one should expect for a 4-bit CPU. The program bus is 8 bits wide: 4 bits for the instruction opcode, and 4 bits for an immediate operand.

Program and data addresses are both 12 bits wide, resulting in total addressable storage of 4096 bytes for programs and 4096 nibbles for data. A 12 bit program counter holds the current instruction address. Since instruction opcodes are 4 bits wide, that makes instructions involving absolute memory addresses 4 + 12 = 16 bits in size, or two program bytes.

Nibbler is notable for a few things it does NOT have. There’s no address decoder, because there’s not more than one chip mapped into different regions of the same address space. Program ROM occupies all of the program address space, and RAM occupies all of the data address space. As you’ll see later, I/O peripherals aren’t memory-mapped, but instead use port-specific IN and OUT instructions to transfer data.

Nibbler also lacks any address registers, which means it can’t support any form of indirect addressing, nor a hardware-controlled stack. All memory references must use absolute addresses. That’s a significant limitation, but it’s in keeping with the project’s K.I.S.S. design goals. With the use of jump tables and dedicated memory locations, a simple call/return mechanism can be implemented without a true stack.

Up to sixteen distinct I/O devices can be supported by the CPU, but the planned I/O devices require just one IN port and two OUT ports. The computer’s input comes from four momentary pushbuttons, arranged in a left/right/select/back cross configuration, and connected to the IN port. Output utilizes one of the two OUT ports, and includes the obligatory LEDs used for debugging, as well as a piezo speaker for software-controlled sound, and a two-line character-based LCD display.

The specific 7400 logic family and chips to be used aren’t yet finalized, but in back of the envelope calculations, it looks like the CPU should support a speed of just over 4 MHz. The longest path is for a write to RAM during phase 1: Clock-to-Q delay for the Fetch register, plus propagation delay for the microcode ROMs, ALU, and bus driver, plus data setup time for the RAM. At two clock cycles per instruction, 4 MHz operation would result in 2 MIPS, which is the same or better than BMOW.

I’ll write more about the instruction set and programming model next time. Until then, if you have any comments or questions, I’d love to hear them!

Read 17 comments and join the conversation

17 Comments so far

Steve Chamberlin - August 22nd, 2013 8:46 pm

Whoops, it looks like the max speed is actually under 3 MHz. The RAM address and write-enable signal must be valid before the second half of the clock cycle, when the RAM is enabled, or else incorrect writes may occur. That means the clock period must be 2x the 185 ns needed for those signals, or 370 ns, which is a 2.7 MHz clock.
Erik Petrich - August 24th, 2013 9:52 pm

Looking forward to seeing this project unfold. The no indirect memory access seem like it will be a big limitation. With the Harvard architecture, you can’t even work around this using self-modifying code.
Steve Chamberlin - August 25th, 2013 4:43 pm
Thanks Erik! I agree no indirect memory access is a big limitation, and would never fly for a “real” CPU, but for this project I think it’s the best compromise of power vs complexity, since I really want to keep things simple. I used paper simulation to convince myself that I could still do most of what I wanted through jump tables, switch statements, unrolling loops, etc. It will be similar to programming in BASIC (though even BASIC has arrays).

I spent a long time searching for a good way to add indirect memory access, without adding too much additional design complexity. One of the nice things about the current design is that it all hangs together just so: all instructions require the same number of clocks, and all the control signals and data go straight to their destinations without needing any glue logic or decoders or multiplexers. Introducing indirect memory access wrecks all that, so it’s not just a matter of adding extra chips for an address register.

The best solution I could find involves reducing the data address width to 8 bits (256 bytes addressable RAM), and specifying the addressing mode in the operand field – what would otherwise be bits 8-11 of a 12 bit address. Two explicitly-loaded memory registers, OUT2 and OUT3, make the low and high nibbles of the indirect address. This would work OK, but has several shortcomings I’m not happy with:
- Addressable RAM area shrinks significantly.
- Increases the total chip count by 20%.
- Using OUT registers for indirect addressing feels a little clunky – it blurs the line between internal CPU functions and external I/O.
- To implement pointers efficiently, a new instruction is probably needed to load a nibble directly from RAM to an OUT register, without involving the accumulator. Otherwise it’s difficult to store A to a pointer address without destroying the value to be stored while setting up the memory registers.
- It just makes the whole system more difficult to explain.
It’s not terrible, and maybe it’s worth those issues in order to gain indirect addressing, but I’m leaning against it. I’ll sleep on it and give it some more thought before deciding for sure.
Erik Petrich - August 25th, 2013 8:38 pm

“Using OUT registers for indirect addressing feels a little clunky – it blurs the line between internal CPU functions and external I/O.” — I agree, but it’s not totally without precedence. The 8051 architecture has its 16-bit pointer register in the SFR address space which is where it has all its I/O ports and other peripheral subsystem registers.

Probably though you should just stick to you original plan and see how it turns out and apply the lessons learn to version 2.
Steve Chamberlin - August 26th, 2013 9:17 am

I thought about this for a long time more, and I’m going with the original plan and omitting indirect addressing. I really want to keep everything as simple and obvious as possible, so the design will be easy to grasp for anyone. My goal is to be able to run programs like Simon or Mastermind, and I’ll be able to do that – though the programs will be long and ugly to look at.
Hans Franke - August 30th, 2013 4:02 am

I wouldn’t so much think about a line between CPU and IO.

(Assumption: there is no room for additional indirect load/store instruction) With two OUT ports holding lower and middle address nibble, the high nibble does not have to be fixed, but could be supplied part by the instruction as usual. The top most bit acts as an identifiyer for indirect (calculated) adressing (read: selecting the right address source) while the remaining 3 form addressbits 10:8. A solution often found in old systems.

This gives 2K addressable RAM organized in 8 pages of 256 Bytes. I doubt that this machine will ever call for a single data structure of more than 256 Bytes, no matter if stack or oherwise.

A different way would be give give certain RAM cells special functions. Some thing DEC did on their early machines to add functionality. Of course such tricks are limited to wordsize (unless we add multiple cycles to hndle combined words). So a pointer can only strech over a range of 16 Bytes – and without a tour thru the ALU (or some other calculation circuit), the address base would be restricted to 16 Byte ‘pages’ (xx0..xxF).

To handle the access, a second tour thru the RAM has to be made – for simplicity by adding a special cycle. Lets assume we use xy0 as indicator for indirect (so a 4 input NAND can select the indirect mode). And look at a read access first. The value outputed in the original RAM access cycle is buffered in a new 4 bit latch Z (For simplicity this is always done here). The new added cycle now puts the buffered nibble (z) out as xyz. Now the original operation (read or write takes place).

To explicit access the indexed cell we need to give up on anoter address (for example xy1) and 1/16th of our addres space.

With this system the RAM becomes organized as 3,5k word size (nibble) cells or 256 ‘registers’ of 14words (plus index). Sounds quite like early calculators, isn’t it? 3,5k (absolute addressabel) RAM and 256 complex variables are more then enough for Mastermind, Simon or I’d say even Pong and Space Invaders or a real nice classy pocket calculator.

The only missing thing here is a free pointer, but anything involving multiple word size data items does require multiple access AND multiple buffers. Here again OUT controlled access is the way to go. … hmm thinking of it, we now reduced the needed external address to 8 Bit ….. SOMEONE PLEASE STOP ME!

:))

When thinking about very small systems, we have to do away with simple beauty of linear address space (as already done with using Harward) or ‘all cells are equal’ dogma.
Hans Franke - August 30th, 2013 4:26 am

Since noone stoped me… here’Smore:))

On Speed: To me this little system looks perfect for asyschronus operation. Having the clock generator pushed at least according to each stage, if not each operation seperatly. A load or store (operation C=A or C=B) can be handled different from an arithmetical op – and a RAM accessing instruction again differnt from a RAM-less. We do not even have to add a timing circuit to each path (while this would realy be neat idea), just form 4 or 8 timing groups.

On Design: While it is strictly not necersarry to have a way to write the programm ‘ROM’ from within, it bight be a good idea to define such an interface – so a self sufficient computer could be created. Like with a little loader from a keypad or whatsoever. Since this is outside the basic scope, even a more complex interface is suitable, with the advantage to occupy only one or two OUT addresses.

One way could be a protocol using one port by outputing 6 nibble for each byte to write in the form cxyzab (c – Command; xyz – ADdress; ab – Value

And yes, I think I’m going to build one (eventually with the/a register scheme)
Steve Chamberlin - August 30th, 2013 8:21 am

Hans, thanks for the great feedback! You’re thinking along the same lines I did when I built BMOW, looking for places where a clever encoding trick or bit of extra hardware could enable more CPU capabilities. My goals for Nibbler are a little different. It’s meant to be an “example CPU”. My #1 goal is to keep the design as simple as possible, while still being able to do something more exciting than blink LEDs.

Your first idea of using A11 as an address mode select sounds good, and is similar to what I mentioned in the comment to Erik Petrich. If I had to support indirect addressing, your method is the way I would do it. I spent a while looking at implementing it, and unfortunately it’s not as simple as it first seems. That address mode bit is only valid for data memory instructions – for other instructions it’s part of the immediate operand or the jump destination address – so more logic is needed to know when it’s valid. Or the address mode bit could be stored in microcode, but that requires finding space for more microcode signals. Then there’s the question of how to actually mux the two address sources for the RAM. You’d need three ‘157 mux chips, or an address bus with separate output enables for the two sources. But as one source (the program ROM) is also used for instruction fetches and jumps, that gets a little complicated too. I spent quite a while going around in circles on this, and concluded that it just wasn’t worth it, and KISS prevailed. But maybe I need to take another look at it.
Yves Legault - September 8th, 2013 10:01 am

I did something quite close to this project back in 1982. I only had a “ADD” and a “JUMP” microcoded with diodes and the thing was drawing 3 Amps while running at 25MHZ. It was way faster than a 6800 or a 6502, but I found the project to be quite instructive back then.
Rando - September 10th, 2013 4:30 am

So so cool!

Do you see this in a Xilink something someday?
Damilare - November 14th, 2016 11:05 am

Good day here, I’ve read through the tutorial severally but couldn’t get the micro-code aspect. I don’t have a PC neither do I have access to a PC.
So I’ve made a programmer with some ICs & switches but couldn’t get @ what byte I’m gonna store those codes. I will like someone(Steve?) to explain more or give me a plain text file containing the bytes and there memory location. please I’ll be glad to get response as soon as possible.
Steve - November 14th, 2016 11:20 am

You’ll find the microcode binary files in the Simulator subfolder of the Nibbler file archive, which you can download here: http://www.bigmessowires.com/nibbler/ They are named microcode_0.bin and microcode_1.bin. Program those bytes to your microcode flash ROMs.
Damilare - November 14th, 2016 12:17 pm

Thanks for the response sir, that file is in pure binary representation, but I don’t have any app to convert it to ASCII that I can understand.
Damilare - November 14th, 2016 12:37 pm

Is there any solution Sir?
Steve - November 14th, 2016 2:03 pm

The microcode files are raw data bytes, not ASCII, so no conversion is required or possible. The first byte of the data file should be programmed at the first byte of the microcode flash ROM, etc.
Damilare - November 18th, 2016 3:46 am

Thank you sir, I understand your concept, but if I open that file with any text editor,instead of seeing the 1s & 0s that I can understand, I can only see those wired characters, but I’ve found a tool to do the conversion in android programming. Again thank you very much for this Idea.
bill Rowe - July 7th, 2017 12:56 pm

Did you use any particular technique to minimize your chip count with the 7400 gates?

Hello Nibbler!

17 Comments so far

Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.