BMOW title
Floppy Emu banner

Archive for December, 2007

ADC $NNNN,X

Uh oh. I was cranking through the microcode implementation for BMOW’s instruction set, making good progress. I’d finished about two-thirds of the microcode, when I came to ADC absolute, X-indexed. That’s when the wheels came off the cart.

This instruction (add with carry) is supposed to take an absolute memory address, adjust the address by adding the contents of the X register, then add the value at the effective address to the accumulator, plus whatever value was already in the carry bit of the condition code register. Unfortunately, I don’t see how I can reasonably implement it with my current hardware design.

The reason that ADC $NNNN,X is problematic is that it modifies the condition codes (to test and propagate a carry from the low byte to the high byte of the effective address when adding X), but it also depends on the current value of the condition codes (for the carry flag). The effective address computation destroys the carry bit that’s needed for the add step. SBC $NNNN,X (subtract with carry, absolute, X-indexed) has the same problem.

I see two possible ways to fix this, neither one great:

  • Store the old condition codes somewhere before computing the effective address, then restore them for the add step. That would make this instruction unreasonably slow, however. It might not even fit within the 16-clock limitation for a single instruction.
  • Add a fifth pseudo-flag to the condition code register for handling the carry propagation during effective address computation. This flag would be invisible to the programmer, and only accessible at the microcode level. But that would give me 5 bits in my 4-bit condition code register, and also require compensatory changes in the control subsystem and microcode assembler.

Poop.  I’m half hoping that if I stare at it for a while, I’ll think of a better solution.

Read 2 comments and join the conversation 

Optimization, or Distraction?

I think I may be getting too concerned about potential optimizations to the hardware design, before I’ve even built or simulated the initial design. A couple of possible optimizations occurred to me recently.

Improved CC Register: I described earlier how the condition codes are stored in a 4-bit shift register with parallel output, enabling the control circuitry to read all the flags simultaneously. Copying the condition codes to/from a register is done serially, requiring 4 clock cycles of bit shifting.

My latest improvement idea is to use a GAL to make a custom 4-bit register with two independent parallel load inputs, one connected to the ALU and one connected to the low 4 bits of the data bus. That would permit loading of all the condition codes from a stored value in a single clock cycle. The outputs of this GAL-register could also drive the data bus, either directly or through the ALU, making it possible to store all the condition codes in a single clock cycle as well.

These improvements would only benefit instructions that load/store the condition codes, though, which really only happens during interrupt processing, so maybe it’s not worth the effort. There are also a few problems regarding how to connect the GAL-register output to the data bus that I would need to resolve.

An Extra Data Register: A while ago, I considered adding a Y register to the machine, but ran into the limit of 8 possible load destinations. Now that I’ve increased the limit to 16, adding a Y register would be as simple as adding 2 more chips to store and drive the data. The required load enable and output enable lines are already there.

Unlike the A and X registers, the Y register would be connected to the right ALU input, which presents some problems. The T (temporary) register is connected to the right ALU input as well, which means it would be impossible to directly compute any functions of Y and T. For example, to add a constant value to X, the machine can load the constant into T, add X and T, and store the result where needed. But to add a constant value to Y, it would first need to copy X to T, load the constant into X, add X and Y, then restore the old value of X from T.

That would give the machine the odd property that operations involving Y are slower than those involving X and A. To be most useful, the proposed Y register really needs to be connected to the left ALU input, but that input is already “full”, and I don’t think I can relocate or remove any of the existing left inputs.

Too Many GALs? In my drive to optimize the design and reduce chip count, I’m beginning to wonder if I’m using too many GALs. I keep finding more and more places where two or more 7400-series parts could be replaced by a single GAL. For example, every combination of a ‘377 register and ‘244 output driver could be replaced by a single GAL, and the 16-bit pointer registers could be built from two GALs each, instead of four ‘569 counters. In fact, I could replace almost every chip with a GAL, except for the ALU, memory, field decoders, and a few drivers. In that case, looking at the schematic wouldn’t tell you anything at all about how the machine worked, and you’d have to read the GAL program data to understand anything.

I’m not sure I like that idea. Although it would still be a real hardware implementation, it would feel more like a simulation in many ways. It would also once again raise the question of why not just implement the bulk of the machine as a single FPGA?

For the sake of comparison, I estimate that using GALs as much as possible would reduce the component count to about 40, while not using GALs at all would increase the count to about 60.

Be the first to comment! 

Interrupts and Condition Codes

I’ve finished the design changes for interrupts. It worked out pretty much how I’d outlined earlier, using a GAL to implement the special OP register that has both a load enable and a clear input, since no such 7400-family part exists. I also added an interrupt enable bit that can be set and cleared from the microcode. When OP is about to be loaded with the next instruction, and interrupts are enabled, and an interrupt is pending, then the OP register will be synchronously cleared upon the next clock edge. This will force the machine to execute opcode 0, IRQ, which saves the processor state and jumps to an interrupt service routine.

I also made a necessary hardware change to support multitasking: making the stack pointer loadable from microcode, so it can be saved and restored across processes. The machine now has 10 possible destinations for storing data, instead of the 8 it had previously. That change required me to spend my last unused control ROM bit to make the store destination a 4-bit field rather than 3 bits. If I end up needing another control ROM bit for something else later, it’s going to be a problem. I also had to add a second 74LS138 to decode the additional store destinations, so now the component count is up to 51.

Condtion Code Optimization: I originally planned to use 8K control ROMs, with 13 address lines: 8 for the opcode, 4 for the phase, and 1 for the current condition code. That’s why the condition codes are stored in a shift register. If the desired condition code isn’t already in the least-significant bit of the shift register, then it must be right-shifted until it’s in the right spot. Since it takes one extra clock cycle for each shift of the condition codes, this design results in the somewhat bizarre property of testing the carry flag being faster than testing the negative flag, which itself is faster than the other flags.

It turns out that the most commonly-available (and cheapest) ROMs now are 64K or 128K. A 64K ROM has 16 address lines, providing enough width for the 8-bit opcode, 4-bit phase, and all 4 condition codes simultaneously. No more shifting! That simplifies the microcode, and also puts all the flags on even par with one another with regards to testing time.

Although bits are no longer shifted out of the condition codes in order to test them, the shift register is still needed for a different reason. When restoring the condition codes after an interrupt, the old values are shifted in, one at a time. Since the parallel load inputs of the register are connected directly to the ALU, there’s no other way to set the condition code values without adding extra multiplexing hardware.

Be the first to comment! 

Parts List

I made an accounting of all the components needed by my design as it stands today. It totals up to 50 components altogether, with a combined cost of $114.32.

Count Part Purpose Price Each Price Total
ALU Module
2 74LS181 core ALU functions $2.58 $5.16
1 22V10 GAL computation of Z and V condition code flags $4.49 $4.49
1 74LS244 bus driver for ALU output $0.53 $0.53
Data Registers
3 74LS377 A, X, and T registers $0.43 $1.29
7 74LS244 bus drivers for ALU inputs $0.53 $3.71
Address Registers
12 74LS569A 4-bit counter, 4 each for 16-bit PC, AR, SP $1.25 $15.00
Control System
3 29F010-70 128KByte 70ns Flash ROM for control ROMs $6.09 $18.27
1 74LS163A 4-bit phase counter $0.46 $0.46
1 22V10 GAL custom OP register $4.49 $4.49
1 74LS138 3-to-8 line mux, for load enable signals $0.52 $0.52
2 74LS139 dual 2-to-4 line mux, for ALU input and address drive enable signals $0.37 $0.74
1 74LS194A 4-bit shift register, for condition code flags $1.23 $1.23
Memory System
1 TI BQ4013MA-85 128KByte 85ns SRAM $13.29 $13.29
1 29F010-70 128KByte 70ns Flash ROM for OS $6.09 $6.09
1 Futurlec USBMOD4 USB I/O $24.90 $24.90
1 22V10 GAL address decoding, generation of enable signals $4.49 $4.49
2 7-segment LED data display $0.95 $1.90
1 74LS377 data display register $0.43 $0.43
1 8-wide DIP switch console input/debugging $1.25 $1.25
1 momentary pushbutton console input/debugging $1.18 $1.18
2 74LS244 bus drivers for switch/button $0.53 $1.06
1 74LS245 bidirectional bus driver to/from ALU data bus $0.64 $0.64
Miscellaneous
1 crystal oscillator 1.0MHz clock oscillator $1.49 $1.49
1 74LS244 clock signal buffer $0.53 $0.53
1 momentary pushbutton reset button $1.18 $1.18
Grand Total
50   $114.32
Be the first to comment! 

Multitasking

Buddy Betts had some interesting thoughts about multitasking on BMOW. That’s so far off in the future that it’s probably useless to speculate about it now, but what the heck. Running several programs at once would require:

  • A timer interrupt that fires periodically, to initiate task switching.
  • Some way to clear the timer interrupt, so exiting the interrupt service routine doesn’t immediately jump back into it.
  • An ISR (dare I call it an operating system?) that saves the processor state for one task, and restores the processor state for the next task.
  • Some dedicated place to store the processor state of tasks.
  • Relocatable code, or segmented memory, or virtual memory, so that the address spaces of programs won’t collide.

Unfortunately, restoring the complete processor state of a task is impossible with the current hardware design. If you look at the block diagram way down at the bottom of the page, you’ll see that there’s no way to load the stack pointer registers with an arbitrary value. That was intentional, because I didn’t need it before, and it meant the machine had exactly 8 possible load destinations, which could be encoded into 3 control ROM bits. Servicing an interrupt involves pushing more data onto the existing stack, then popping it off at the end of the ISR, but never setting the stack pointer to a completely different value. Task switching requires switching between totally different stacks, however. I could make the SP loadable, but then I’d have 10 load destinations (with SPLO and SPHI), requiring 4 control ROM bits (could be a problem), and a larger mux.

I’m beginning to realize that control ROM outputs are precious commodities, since new hardware capabilities tend to require new control inputs, but there’s a hard limit on the number of outputs from the control ROM (8 bits per ROM, and I have 3 ROMs).

A dedicated place for stored processor state isn’t any problem in theory, although it implies an OS-reserved area of memory, and some way of preventing programs from writing to that memory (or just trusting the programs to play nice).

Relocatable code and/or virtual memory seems like the trickiest issue. With a single program machine, the programs can contain absolute memory addresses and jump locations with no problems. But if the OS can potentially load the program anywhere into memory, the programs will need to account for that. One approach might be to have the OS loader modify the program’s code as it was loading it, adjusting any absolute memory references as needed. That could be difficult. Another possibility might be to add a base pointer register, managed by the OS, and to make memory references interpreted relative to the base pointer. That would probably involve extra hardware or extra cycles to adjust each memory reference.

It might be possible to provide a powerful-enough set of relative addressing instructions that programs wouldn’t need to use absolute addressing at all. After all, I am designing the instruction set, so I can do whatever I want. That would be easy enough for relative branches or jumps (although computing the branch destination would be slower than a branch to an absolute address), but I’m not sure how global program variables would be handled. You might still need a base pointer anyway.

A simpler approach might be to use a larger SRAM, and segment memory into a fixed number of blocks, one per process. For example, if I used a 512K SRAM, address lines 0-15 would be the address as the process sees it, and address lines 16-18 would be the 3-bit process ID. The ROM could be configured so that it was mapped into the same location in every processes’ address space. One big drawback of this approach is that each process would get a fixed size address space, regardless of how much memory it actually needed, putting a hard limit on the total number of processes. A more practical drawback is that 5V TTL-compatible SRAMs larger than 128K generally aren’t available, so the machine would be limited to a maximum of 2 or 4 processes, depending on the size of the process address space. Maybe DRAM or other memory technologies could be used instead. I need to investigate how difficult that would be.

Follow up: It seems that André Fachat has already invented a relocatable executable file format (specifically for the 6502) called .o65. See http://www.6502.org/users/andre/o65/fileformat.html for the spec. It embeds extra information in the executable file that enables the OS loader to modify the program’s code as it’s loaded, similar to how I was considering. Given that it was designed for a pre-existing processor, however, inventing a base pointer register or segmented address space out of thin air obviously wasn’t possible, and so a load-time fix-up was the only possible solution.

Be the first to comment! 

Random Hardware Musings

Here come the GALs: I’ve decided to use GALs in a few places, to make my life easier. GALs (gate array logic) are a simple kind of programmable logic device. Each output pin can be programmed to compute an arbitrary boolean function of the input pins, and can also optionally be clocked like a flip-flop. You could accomplish the same thing with an assortment of individual NAND or NOR gates and flip-flops, but using a GAL really helps reduce the number of required components. A GAL is ideal when you need to compute lots of functions like: /ADRSELECT = /BUSENABLE * CLK2 * /LOAD + /BUSENABLE * /STORE + SOMETHING * /ELSE

Initially I resisted the thought of using GALs, since incorporating programmable logic felt a bit like cheating. Once a few programmable elements are incorporated, why not go all the way and implement the entire machine as a single FPGA? But GALs are far simpler than FPGAs, and have been around since 1978. They seem appropriate for the time period of the machine I intend to build. A few GALs will also help me avoid a proliferation of 7400 Quad-NAND gates and similar chips that I’d need otherwise.

More ALU musings: I’ve decided to keep my 74LS181 ALU, rather than switch to the ‘381 as I’d been contemplating earlier. The ‘181 can do increment, decrement, and left shift of a single operand, which are all too valuable to give up. The ‘381 would have saved me the XOR gates needed to compute the overflow flag, but that seems like a small advantage compared to the loss of those handy ‘181 functions. I can easily use a GAL output to compute the overflow flag.

Interrupts: I’ve nearly finished the interrupt design, and the required hardware changes are minimal. The heart of the change is to use a register with a “clear” input as the opcode register. /LDOP and /IRQ will be fed to an AND gate, and the output will be connected to the register’s /CLR input. In this way, whenever the machine is about to load a new opcode, if the IRQ input is asserted, then it will clear the opcode register instead. Opcode 0 will be a special instruction that saves the processor state on the stack and jumps to the interrupt service routine.

That sounds easy enough, but there are a couple of minor wrinkles. First, there is no 7400-family register with load enable and clear inputs. I could use a pair of 4-bit counters instead, configured so that they never actually count, but that feels a bit ugly. Or I could implement a custom register in a GAL.

The second wrinkle is that I need a way to disable interrupts during the interrupt service routine (or any time really). My current thinking is to add a flip-flop for the interrupt enable bit, /IE. The logic for the opcode register would then be /CLR = /LDOP * /IRQ * /IE. Opcode 0 would need to clear the interrupt enable bit, and the interrupt return opcode would need to set it. Setting and clearing the bit would probably require my last unused control ROM output, unless I can think of a clever way to implement it using the existing bits. Rather than use a separate chip for the flip-flop, I’d probably hide it inside (you guessed it) a GAL.

Read 1 comment and join the conversation 

« Newer PostsOlder Posts »