Archive for January, 2021
FPGA Block RAM Packing
In an earlier blog post, I was lamenting how one-ninth of an FPGA block RAM was wasted when storing 8-bit ROM data, because there’s no simple way to make use of the 9th parity bit in each word of a block RAM. Horrors! To fight this injustice, I’ve developed a solution that I call packed ROM. It stores nine 8-bit bytes in eight 9-bit words of block RAM, and provides an interface to read the data as if it were an 8-bit memory with a larger depth. Using this method, I’m able to store 1152 bytes of read-only data per block RAM instead of only 1024. The solution relies on the fact that the block RAMs are dual port – you can read from two different addresses simultaneously. Compared with using the same number of block RAMs as a standard 8-bit wide ROM, this solution consumes an extra 54 LUT4s in a MachXO2-1200 FPGA – about 4 percent of the total. It increases the MachXO2-1200’s effective capacity for this type of 8-bit ROM data from 7168 to 8064 bytes.
Here’s the Verilog code, as well as a Python program that reads a plain binary file and writes a “packed” file in .mem format. The code assumes 7 block RAMs, but should be easily adaptable to other numbers.
module packedROM #(parameter NUM_BLOCK_RAMS = 7) ( input [12:0] addr, input clk, output reg [7:0] Q ); // packs 1152*NUM_BLOCK_RAMS 8-bit data bytes into 1024*NUM_BLOCK_RAMS 9-bit words // uses 54 LUT4s of the MachXO2 // may need to change addr width depending on NUM_BLOCK_RAMS. Use $clog2()? // nine bytes A-I are packed into eight 9-bit words as follows: // 0: I3 I2 I1 I0 A4 A3 A2 A1 A0 // 1: I7 I6 I5 I4 B4 B3 B2 B1 B0 // 2: F7 F6 E7 E6 C4 C3 C2 C1 C0 // 3: H7 H6 G7 G6 D4 D3 D2 D1 D0 // 4: A7 A6 A5 E5 E4 E3 E2 E1 E0 // 5: B7 B6 B5 F5 F4 F3 F2 F1 F0 // 6: C7 C6 C5 G5 G4 G3 G2 G1 G0 // 7: D7 D6 D5 H5 H4 H3 H2 H1 H0 // bytes A-H are sequental in the byte-oriented address space below addr 1024*NUM_BLOCK_RAMS // byte I is one of the "extra" bytes, in byte-oriented address space beyond addr 1024*NUM_BLOCK_RAMS reg [12:0] wordAddressA; reg [12:0] wordAddressB; wire [8:0] QA; wire [8:0] QB; // dualPortROM is a wrapper for the MachXO2 block RAMs, created by the Lattice IP Express tool. // it is actually a dual port RAM with the write input unused dualPortROM myDualPortROM( .DataInA(9'b000000000), .DataInB(9'b000000000), .AddressA(wordAddressA), .AddressB(wordAddressB), .ClockA(clk), .ClockB(clk), .ClockEnA(1'b1), .ClockEnB(1'b1), .WrA(1'b0), .WrB(1'b0), .ResetA(1'b0), .ResetB(1'b0), .QA(QA), .QB(QB) ); wire [12:0] overflowAddr = addr - (NUM_BLOCK_RAMS * 1024); always @* begin if (addr < NUM_BLOCK_RAMS * 1024) begin // packed area, bytes A-H wordAddressA <= addr; // word address for the upper bits depends on low three bits of the byte address case (addr[2:0]) 0: begin // A wordAddressB <= { addr[12:3], 3'b100 }; Q <= { QB[8:6], QA[4:0] }; end 1: begin // B wordAddressB <= { addr[12:3], 3'b101 }; Q <= { QB[8:6], QA[4:0] }; end 2: begin // C wordAddressB <= { addr[12:3], 3'b110 }; Q <= { QB[8:6], QA[4:0] }; end 3: begin // D wordAddressB <= { addr[12:3], 3'b111 }; Q <= { QB[8:6], QA[4:0] }; end 4: begin // E wordAddressB <= { addr[12:3], 3'b010 }; Q <= { QB[6:5], QA[5:0] }; end 5: begin // F wordAddressB <= { addr[12:3], 3'b010 }; Q <= { QB[8:7], QA[5:0] }; end 6: begin // G wordAddressB <= { addr[12:3], 3'b011 }; Q <= { QB[6:5], QA[5:0] }; end 7: begin // H wordAddressB <= { addr[12:3], 3'b011 }; Q <= { QB[8:7], QA[5:0] }; end endcase end else begin // overflow area, byte I // word address is byte overflow address times 8 for the lower bits, and times 8 plus 1 for the upper bits wordAddressA <= { overflowAddr[9:0], 3'b000 }; wordAddressB <= { overflowAddr[9:0], 3'b001 }; Q <= { QB[8:5], QA[8:5] }; end end endmodule import os from array import array infile = "coderom.bin" outfile = "coderom.mem" inputData = array('B') insize = os.path.getsize(infile) with open(infile, 'rb') as f: inputData.fromfile(f, insize) out = open(outfile,"w") num_block_rams = 7 outsize = 1024 * num_block_rams for x in range(0,outsize): baseAddr = x & ~7 if x & 7 == 0: out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF)<<5) | ((inputData[baseAddr])&0x1F))) elif x & 7 == 1: out.write('{:02X}\n'.format( (((inputData[outsize+baseAddr//8])&0xF0)<<1) | ((inputData[baseAddr+1])&0x1F))) elif x & 7 == 2: out.write('{:02X}\n'.format( (((inputData[baseAddr+5])&0xC0)<<1) | (((inputData[baseAddr+4])&0xC0)>>1) | ((inputData[baseAddr+2])&0x1F))) elif x & 7 == 3: out.write('{:02X}\n'.format( (((inputData[baseAddr+7])&0xC0)<<1) | (((inputData[baseAddr+6])&0xC0)>>1) | ((inputData[baseAddr+3])&0x1F))) elif x & 7 == 4: out.write('{:02X}\n'.format( (((inputData[baseAddr])&0xE0)<<1) | ((inputData[baseAddr+4])&0x3F))) elif x & 7 == 5: out.write('{:02X}\n'.format( (((inputData[baseAddr+1])&0xE0)<<1) | ((inputData[baseAddr+5])&0x3F))) elif x & 7 == 6: out.write('{:02X}\n'.format( (((inputData[baseAddr+2])&0xE0)<<1) | ((inputData[baseAddr+6])&0x3F))) elif x & 7 == 7: out.write('{:02X}\n'.format( (((inputData[baseAddr+3])&0xE0)<<1) | ((inputData[baseAddr+7])&0x3F)))Read 4 comments and join the conversation
When 64 Kbits Is Not 8 Kbytes
This FPGA-based disk controller project is going to need every byte of on-chip memory that I can scrounge up. The datasheet says my Lattice MachXO2-1200 has 64 Kbits of embedded RAM (EBR). See the shaded column in the table above. 64 Kbits is 8 Kbytes, and I plan to store 8 KB of 6502 program code, so that looks perfect. Except that I misinterpreted the table in two different ways.
Looking more closely at the table, there are 7 EBR blocks and each block is 9 Kbits. That’s a total of 63 Kbits, not 64. The datasheet is just wrong here, or they’re using some very liberal rounding method. I just lost 1 Kbit!
That’s not the worst of it. Later in the datasheet, it mentions that each EBR block can be configured as 8192 x 1, 4096 x 2, 2048 x 4, or 1024 x 9. Only the last of those configurations represents 9 Kbits of data. If you want to store 8-bit wide data, you have to use the 1024 x 9 configuration and throw away the extra bit. So that’s another 1 Kbit lost from each bank.
When you add everything up, if you’re storing 8-bit wide data, the MachXO2-1200 can only store 56 Kbits in EBR rather than the advertised 64 or 63 Kbits. That may sound like a small difference, but it will have a big impact on my design.
Sure I could add an external SRAM, and maybe I’ll do that eventually, but I really want to squeeze all the advertised memory space from this chip. The wasted area offends my engineering sensibilities. So I’ve been brainstorming a couple of crazy solutions, and wondering if anyone else has ever tried something similar.
I could pack the 8-bit byte data like a bitstream into consecutive 9-bit words of an EBR block, so nine bytes A through I would be stored in eight words:
EBR0[0]: B0 A7 A6 A5 A4 A3 A2 A1 A0 EBR0[1]: C1 C0 B7 B6 B5 B4 B3 B2 B1 EBR0[2]: D2 D1 D0 C7 C6 C5 C4 C3 C2 EBR0[3]: E3 E2 E1 E0 D7 D6 D5 D4 D3 EBR0[4]: F4 F3 F2 F1 F0 E7 E6 E5 E4 EBR0[5]: G5 G4 G3 G2 G1 G0 F7 F6 F5 EBR0[6]: H6 H5 H4 H3 H2 H1 H0 G7 G6 EBR0[7]: I7 I6 I5 I4 I3 I2 I1 I0 H7
To read a byte would require reading two separate words, and then doing some bit shifting and masking with the results. I’d need some kind of state machine and an appropriate clock to handle the two separate reads. And I’d need some kind of divide by nine logic (or eight-ninths?) to convert a byte-oriented address to the corresponding 9-bit word address.
A second idea is to leverage the fact that these are seven separate EBR blocks, and to read from them all in parallel, combining one or two bits from each to reassemble the byte:
EBR0[0]: I0 H0 G0 F0 E0 D0 C0 B0 A0 EBR1[0]: I1 H1 G1 F1 E1 D1 C1 B1 A1 EBR2[0]: I2 H2 G2 F2 E2 D2 C2 B2 A2 EBR3[0]: I3 H3 G3 F3 E3 D3 C3 B3 A3 EBR4[0]: I4 H4 G4 F4 E4 D4 C4 B4 A4 EBR5[0]: I5 H5 G5 F5 E5 D5 C5 B5 A5 EBR6[0]: E6 D7 D6 C7 C6 B7 B6 A7 A6
The intended advantage here was that I would only need to do one read from each EBR block, but since there are one fewer EBR blocks than bits in a byte, one of the blocks must perform double-duty and this advantage is lost. And I would still need some kind of divide by nine address logic, with different logic for some EBRs than others. I’m actually not sure whether this approach would even work.
I feel like there should be some not-too-complex scheme to store the full 63 Kbits of data in a way allowing for 8-bit byte retrieval, but I can’t quite find it.
Read 9 comments and join the conversationUDC: The Next Generation
I’m still working on development of a disk controller card for the Apple II. As part of that effort, I’m still trying to understand the design of the UDC disk controller. My hope is to combine what I learn about the UDC, the Liron disk controller, and the standard Disk II controller in order to build something with the best qualities of all three. But for now I’m deep in the weeds with the UDC, and today I was pleased to finally confirm my long-held theory about UDC support for intelligent Smartport drives!
In the Apple II world, there are three primary types of disk drives: 5.25 inch drives, dumb 3.5 inch drives, and intelligent drives using the Smartport protocol. The best-known example of an intelligent Smartport drive is the Unidisk 3.5, but others include Floppy Emu’s emulated Smartport Hard Disk.
For quite a while now, I’ve been chasing after this question of whether the UDC supports intelligent Smartport drives. The two available UDC manuals are slightly vague, but could be interpreted as saying there’s no support. Several sources on the web say Smartport drives aren’t supported, and you’ll damage your drive or UDC card if you try. But a contemporary print ad for the UDC advertises Unidisk 3.5 compatibility. I looked at ROM dumps from several different UDC versions, and found what looked like Smartport-related code in some older ones but not in newer ones. It’s all very confusing.
Fun With ROM Hacking
After a month of theory and research, I finally got to try some hands-on tests. The original UDC card is called the “long” version, and the follow-up card with a custom ASIC is the “short” version. I’ve been unable to find anyone with a long UDC (call me!), but I was able to borrow a short UDC. I spent a while mapping all the connections on the card and constructing a schematic, until I was satisfied there were no lurking dangers from merely connecting a Smartport drive to it. I connected a Floppy Emu configured for Smartport emulation mode, booted it up, and… it did nothing. Then I connected a Unidisk 3.5 and tried again, and it also did nothing. Blah.
This UDC short card came with version 4.0 of the ROM, which I’d previously analyzed and concluded didn’t contain any code for Smartport drive support. So it’s not surprising that it didn’t work, but I’d hoped maybe I’d missed something in my ROM analysis. My earlier analysis of ROM version 2.3 from the long UDC found that it did contain code for Smartport drives. I strongly suspect the short UDC is functionally identical to the long, despite its different physical appearance. So what happens if you take ROM 2.3 from a long UDC, and stick it in a short UDC card? Let’s find out!
The ROM is stored in a 27C64 EPROM and is socketed, so no modifications were necessary on the UDC card. I used a 28C64 EEPROM as a drop-in replacement, and programmed it with ROM version 2.3 using my EasyPro universal programmer. (Coincidence: this EasyPro 90B programmer was first mentioned on this blog precisely 13 years ago today. I feel very, very old.) I unsocketed the original ROM chip, popped in the new one, booted up, and… it did nothing. Worse than nothing, it actually froze up the computer. I checked the disk I/O signals with a logic analyzer, and there was no activity at all. Blah again.
Feeling disappointed, I gave up. The next morning, I noticed that the logic analyzer wasn’t actually connected. Oops.
After I’d fixed that, I could clearly see evidence of a Smartport reset and initialization sequence on the disk I/Os, but it still wasn’t working. Eventually I concluded that the code in the ROM was crashing or getting stuck in an infinite loop somewhere. But where, and why? I would have tried single-stepping through the code, but quickly discovered that’s extremely difficult for code in ROM on an Apple IIe.
By carefully comparing the ROM 4.0 and ROM 2.3 code, eventually I was able to guess the problem. A key feature of the UDC is that it sometimes pulls the 6502’s RDY input low to temporarily halt the CPU. This is a sort of flow control mechanism, and happens whenever the code tries to read a byte from the disk, but a new byte isn’t available yet. It appears that the details surrounding use of RDY changed between the long and short UDC, and the short version will halt the CPU on reads to some memory locations that the long UDC ignores. By comparing the two ROMs and making some educated guesses, I was able to modify the version 2.3 ROM so that it no longer caused the short UDC to freeze the computer. But it still didn’t work. Blah for a third time.
After more tinkering and head-scratching, I disconnected the Floppy Emu and tried the Unidisk 3.5 again. I was amazed when it booted right up! But now I had a new puzzle to solve, to explain why the Unidisk 3.5 worked but Floppy Emu’s Smartport emulation didn’t, even when using the exact same disk.
The logic analyzer eventually revealed the answer. Something like 5-10% of the data packets from the Unidisk 3.5 were NAK’d by the UDC card, forcing them to be retransmitted. Some packets had to be retransmitted multiple times. In all my time tinkering with the Smartport protocol over the years, I’ve never before seen a NAK of valid data, and Floppy Emu doesn’t even implement retransmitting a packet in case of a NAK. I implemented the missing retransmit logic, and the Floppy Emu worked! Hooray!
The mystery still wasn’t completely solved, however. What was causing some packets to be NAK’d, from both the Floppy Emu and a real Unidisk 3.5? The NAK rate was also much higher with the Floppy Emu than with the Unidisk 3.5, approaching 30-50%. Acting on a hunch, I experimented with small changes to the Floppy Emu’s bit rate, and found that it caused some changes in the NAK rate, but the results weren’t conclusive.
The “correct” bit rate is either one bit every 4.0 microseconds, or every 3.96 microseconds, depending on your reference source and your assumptions. Floppy Emu with the latest firmware does one bit every 3.95 us. I found that at 4.2 us I got a 100% NAK rate, but at 4.1 us the NAK rate suddenly dropped to about 10%, and at 4.05 us and 4.0 us the NAK rate grew worse again. That doesn’t really make sense, and I didn’t repeat the tests enough times to be very confident in those results. My suspicion is that either the UDC short card is very sensitive to small changes in bit rate, or else that substituting the long UDC ROM is negatively affecting the behavior somehow. I’m not sure it’s worth chasing this mystery further, since it’s quite possibly caused by my weird hybrid card setup.
So the $64000 question is finally answered, sort of. The standard short UDC card does not support Smartport drives, but connecting one won’t damage anything. But it does support Smartport drives when using an appropriately hacked-up ROM, proving that it’s entirely a question of software rather than hardware. As for the long UDC, I still can’t say for certain without examining one directly, but all the evidence points to it supporting Smartport drives.
Where to Next?
This probably marks the end of my very long detour into “how does the UDC work?” and a return to actual development on my disk controller card. Next step: see if I can duplicate the UDC behavior (or something close to it) with my FPGA-based approach.
Read 7 comments and join the conversationSqueezing FPGA Memory
I’m developing an Apple II disk controller that’s based on the UDC disk controller design. The original UDC card had 8K of ROM and 2K of RAM, so it needs 10K of combined memory. The FPGA device I’m using for prototyping, a Lattice MachXO2-1200, has 8K of embedded block RAM and 1.25K of distributed RAM. It also has 8K of “user flash memory”. So will the UDC design fit? It’s close, but I think the answer is no.
At first I thought I could store the ROM data in the FPGA’s UFM section, but that doesn’t look promising. I can store the data there, but compared to embedded block RAM, accessing UFM is inconvenient and probably impractical for live execution of 6502 code. Accessing the UFM requires setting up a Wishbone interface in the FPGA’s Verilog code, starting a memory transaction, and reading out an entire page of flash (16 bytes). It’s also pretty slow. I don’t think it’ll be possible to read an arbitrary byte of UFM and return it to the CPU within ~500 ns, as would be required for directly executing code from it.
OK, so no UFM. Maybe I can store the 8K of ROM data in EBR, using RAM to hold what’s technically ROM? That would work, but it would leave only 1.25K of distributed FPGA RAM remaining to implement the required 2K of RAM for the disk controller. It’s 768 bytes short. No good.
I could switch to a larger FPGA with more memory, or add a separate RAM or ROM chip. But that would increase cost and complexity, and anyway wouldn’t help with my prototype board that’s already built.
Stupid Idea #1
From my analysis of the UDC ROM, I think the upper half of the card’s RAM is only used when communicating with Smartport drives. So I might be able to reduce the RAM from 2K to 1K, and at least I’d be able to test whether 3.5 inch and 5.25 inch drive support works. Using 8K of EBR and 1K of distributed RAM, I’d have a whopping 256 bytes of RAM left. Will it work? I think distributed RAM just means using the FPGA’s logic resources as RAM, so this approach would use 80% of the FPGA’s logic resources and only leave 20% remaining for the actual card functionality, like the IWM model and other logic. It might work, it might not.
Stupid Idea #2
The 8K of ROM isn’t one large chunk. It’s divided into 1K banks that can be mapped into a single 1K region of the computer’s address space. There’s already a small code routine to facilitate the bank switching. What if I could somehow make this routine copy the desired 1K block from UFM to EBR at the moment it’s needed? Then I’d have 8K in UFM, with a 1K cache in EBR, and the 2K of RAM also in EBR.
This would definitely fit, but there would be a delay every time code execution moved to a different 1K ROM page. How long does it take to move 1024 bytes from UFM to EBR? I’m not sure, but I’ll guess it’s tens to hundreds of microseconds. Will that cause problems? Maybe. Will this approach be a pain to implement? Definitely.
Stupid Idea #3
From what I’ve observed of the ROM code, bank 1 contains 5.25 inch functions plus 3.5 inch formatting. Bank 3 is exclusively for 3.5 inch stuff, and bank 7 is exclusively for Smartport drives. Maybe I could temporarily remove some parts of the ROM, in order to make it all fit? Then I might be able to test all the different types of supported disk drives, just not all at the same time.
Stupid Idea #4
Maybe I can modify the ROM code to use 2K of the Apple II’s own RAM instead of 2K of onboard RAM? Then everything would fit in the FPGA. But there must be a good reason the UDC designers didn’t do this. What 2K region of Apple II RAM is safe to use, and wouldn’t get overwritten by running software? I’m not sure.
Stupid Idea #5
Maybe I can modify the prototype board somehow, and graft an extra RAM or ROM chip on there for testing purposes? Maybe I can add a second peripheral card and somehow use its RAM or ROM? Now these ideas are getting crazy.
What’s the Long-Term Solution?
None of these ideas except #2 are workable as a long-term solution, if I eventually move ahead with manufacturing this disk controller card. So what path makes the most sense in the long-term?
Stepping up to the MachXO2-2000 would add about $2 in parts cost, which maybe doesn’t sound like much, but it’s significant. The XO2-2000 has 9.25K EB RAM and 2K distributed RAM, so the design should fit with a small amount of room to spare. That’s surely the least-effort solution.
I could keep the MachXO2-1200 and add a separate 2K RAM chip. The 8K of ROM would fit in the 1200’s EBR. The combined cost might be slightly lower than the MachXO2-2000, but the design and layout would become more complex, and I’m not sure it’s worth it.
I could step down to the MachXO2-640 (2.25K EBR, 640 bytes distributed RAM) and add a separate ROM chip. Total cost would be slightly less than a MachXO2-1200, and I’d also gain lots of extra ROM space for implementing extra features or modes. That would be great. Like adding a separate RAM chip, the extra ROM would make the board design and layout somewhat more complex. But the biggest drawback would be for manufacturing or reprogramming, because both the FPGA and the ROM would need to be programmed separately before the card could be used. Or maybe the FPGA could program the ROM somehow, but it would still be cumbersome and far less attractive than a single-chip programming process.
I never imagined a shortage of just 768 bytes could make such a difference. What an adventure!
Read 14 comments and join the conversationBMOW Product Updates
The Mac ROM-inator II is back in stock – get yours now at the BMOW Store. The ROM-inator II is a replacement ROM SIMM for Macintosh II series computers and the Mac SE/30, adding a bootable ROM disk, 32-bit cleanliness, HD20 hard disk support, and more. Read more about it at the project’s home page.
The instruction manual for the BMOW Floppy Emu disk emulator is now available in Japanese. Thanks to Kay Koba for the translation work. Floppy Emu is a floppy and hard disk emulator for classic Apple II, Macintosh, and Lisa computers. It uses an SD memory card and custom hardware to mimic an Apple floppy disk and drive, or an Apple hard drive. The Emu behaves exactly like a real disk drive, requiring no special software or drivers.
Read 1 comment and join the conversation