Tiny CPU Update
Slow progress continues on the assembly and first boot-up of Tiny CPU. The photo shows a sample program written in Tiny CPU assembly language, drawing a color pattern on the LCD. The serial port and keyboard interface haven’t yet been added to the board, but the Max II CPLD, SRAM, and Flash ROM are all working fine from both an electrical and a design standpoint. After the initial round of swearing at the soldering iron, there were no further electrical problems at all, so all the work has been on configuration and software. The goal of designing a novel CPU architecture and implementing it in hardware has been met successfully.
Of course it’s not all roses and sunshine, and several smaller problems plus one big one have slowed progress and cast doubt on future direction. Debugging has been a major challenge. When things don’t work as expected, or do nothing at all, there aren’t many good tools to help diagnose the problem. The best tool I’ve found thus far is TopJTAG Probe, a $100 software program that lets you examine the current state of any pin, and display continuously-updating state data in a waveform-style window. It’s great for examining external signals at the pins, but the internal machine state remains invisible. It’s also limited to about 400 samples per second due to its use of JTAG boundary scan, which requires slowing the CPU clock to around 100 Hz to do any debugging. My free trial expires in 17 more days, and I’m undecided whether I’ll purchase it.
Altera also offers system debugging tools, including a scriptable Tcl console, in-system sources and probes, and a virtual JTAG interface. Not surprisingly, these tools all require on-chip logic resources, and Tiny CPU has few LE’s to spare. The most promising tool looks like their Signal Tap II logic analyzer, but it requires on-chip RAM, and the Max II has none. Altera doesn’t appear to offer any tools that work purely through JTAG boundary scan without any on-chip logic resources, like TopJTAG Probe. I thought the jtag_debug interface of the scriptable Tcl console might be what I was looking for, but I was unable to get it to work.
When the CPU is running, it’s pretty slow. It took four seconds to fill the LCD with the color test pattern shown in the photo. Much of that is due to the inefficiency of the bit-banged SPI code I wrote to communicate with the LCD, but the 2.6 MHz clock speed is also a factor. The 2.6 MHz is provided by the Max II’s on-chip oscillator, whose frequency is fixed. It can be divided down using logic if a slower clock is needed, but it’s not possible to go faster than 2.6 MHz. According to the timing analysis report, the CPU should run at up to 40 MHz.
Design Disaster
The biggest problem by far is the bank-switching design. Tiny CPU has a 10-bit address space, enabling 1K to be addressed directly. The companion module Tiny Device performs bank switching, mapping 128 possible 512-byte banks of RAM and ROM into the lower and upper halves of the CPU’s address space. When I first described Tiny CPU’s bank switching design, it seemed a clever and elegant way to expand the address space. After working with it in real programs, however, it feels like a complete disaster. It’s confusing and cumbersome. It complicates the design of the programs, the assembler, and Tiny Device. It makes simple things hard. In short, it needs to be taken out back and shot.
Jumping between routines in different banks requires locating the code such that the first instruction of the routine in the target bank is at the next consecutive address, modulo 512, after the instruction in the source bank that alters the bank register. In this way, execution “falls through” to the target bank, transparently to the CPU, by altering the bank register. In practice I’ve found it very difficult to line up the addresses of entry and exit points in different banks. There’s probably some way to abstract it into a general jump table in each bank, but I haven’t found it yet. Adding a new “far call” CPU instruction might help, but I’m very reluctant to embed knowledge of the bank register in the CPU itself, since at the moment it’s just a memory-mapped port handled by Tiny Device.
Given time, the bank-swapping procedure may seem more intuitive and less onerous, but I’m skeptical. And unfortunately 1K is small enough that programs need to deal with bank swapping a lot. It’s even more common than it seems at first, since the upper half of the CPU address space is always mapped to a fixed block of RAM, so programs running from ROM really only have 512 bytes of space to work with before they need to worry about swapping.
Ideally I’d like to increase the address space to something larger, but that would force major changes all over. The 16-bit instruction encoding uses 6 bits for opcode and 10 bits for address, so a larger address space would mean larger instructions. The assembler would need to be substantially altered. And of course the Verilog source for Tiny CPU and Tiny Device would need major alterations as well. My enthusiasm for such a large refactoring right now is pretty low.
Maybe the best use of Tiny CPU is as a small soft-core to incorporate into larger FPGA designs, where a simple microprocessor is needed and the 1K address space limit is not a problem. It would offer an even smaller alternative to soft-cores like PicoBlaze, and be easily portable to any vendor’s FPGA hardware. In this scenario, Tiny CPU would be used alone without Tiny Device, and the RAM and ROM would likely be FPGA logic resources rather than actual external components.
I leave tomorrow for a 10-day trip, so I’ll think it over while I’m away and decide how to proceed with Tiny CPU development when I return.
Read 3 comments and join the conversation3 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
This sounds like a case for thunks! 16-bit Windows used them to manage its paging, and, in a little different form, so did early Macs.
The idea is to put a branch table in non-paged space. In this system, that would have to be RAM. Each entry in the table should have the minimum instructions needed to change the ROM page and call a subroutine in that page, then flip the ROM page back and return to its original caller. You can load the table from a ROM copy at startup.
If you don’t want to spend the RAM for one thunk per subroutine, how about a trampoline? Put a little bit of code in your scratchpad RAM that stores A in the paging register, then does a RETURN. Then your code can then make far calls by pushing the destination on the stack, loading the page in A, and jumping to the trampoline code. It would cost just two instructions in your valuable non-paged RAM. A trampoline for far CALLs is a little more complicated, because of the need to return through the trampoline again to swap back the page, but it is doable.
Don’t give up! This is a neat project. Thanks for sharing it!
Thanks, those are good ideas! I’ll definitely give that a try. Another thought that occurred to me is some kind of assembler pseudo-instructions for far calls and returns, that make them look mostly like a regular call instruction, but actually assemble into a series of instructions like you described.
The slow SPI speed can be improved by making an SPI interface in “hardware” (in Verilog) inside Tiny Device, instead of relying on Tiny CPU to twiddle the serial clock and data pins through software. That would probably speed up SPI about 10x. And if I get really desperate, I could put an external oscillator on a little board connected to the expansion header, and use that instead of the internal 2.6 MHz oscillator.
Actually, the 1K address space is not a bad limit for the sorts of things that you NEED a tiny CPU for. Just for fun, I wrote on a paper a method of encoding instructions into groups of 3-bit pieces that allowed for an extremely simple state machine that still had some efficiency. Compared to the ‘minimal’ one that Stephen Wolfram talks about, it’s much more practical for small CPUs. Small in this case meaning the amount of gates needed, not the physical size of the core! 😉
The following is vaguely what I remember and might not be exactly what I worked out. The first bit of three indicates if it’s a normal instruction or a management instruction. Management instructions let you define or run some special opcodes at the microcode level. These codes by definition are not simple to cache, but this is not designed to be cacheable. The 4 normal instructions are stuff like NAND/XOR/Push/Pull. The memory was actually not 1D like a tape machine but more like 2 separate 1D tapes (code+data) and registers. A third tape could include the precoded opcode’s microcode to save a little effort while programming. Remember, I wasn’t exactly going for speed nor was I going for the record of the simplest ‘Turing-complete’ CPU. This means that very little space would be needed in theory for such a device but the assembler would at least make sense. Also, it didn’t use absolute addresses so in theory it could handle huge amounts of memory. The address bus was abstracted away in order to reduce the size of the actual CPU. In practice, this architecture is essentially a Harvard style?
What’s funny, is that if you design something like Tiny CPU correctly, you can put an array of them inside the FPGA and use a boundary system to exchange data between them. Rather this is an efficient multiprocessor is another matter, entirely. (If it was that simple, it wouldn’t have taken so long to get stuff like Physix/CUDA cards!)