Memory Bandwidth
I did some preliminary memory bandwidth calculations for 3D Graphics Thingy, based upon the discussion in the comments of the previous post, and the numbers aren’t encouraging. Even for the simplest possible case, I don’t think there will be enough bandwidth to do what I’m imagining, let alone any more complex cases involving more interesting rendering effects.
For every pixel, every frame, this is the minimum that must be done:
- Clear the z-buffer, at the start of the frame
- Clear the frame buffer, at the start of the frame
- Read the z-buffer, when a new pixel is being drawn
- Write the z-buffer, if the Z test passes
- Write the frame buffer, if the Z test passes
- Read the frame buffer, when the display circuit paints the screen
That’s 6 memory operations, per pixel, per frame. Assuming a pixel is 3 bytes (one byte each for red, green, and blue), a z-buffer entry is also 3 bytes, the frame buffer is 640 x 480, and the refresh rate is 60 Hz, then that’s:
6 * 3 * 640 * 480 * 60 =ย 316MB/sec
If the DRAM datapath is 16 bits wide, as is common, then that’s 158 million memory transactions per second, so the DRAM must run at 158MHz. This is maybe within the realm of possibility, but not by much, I think.
A more realistic estimate would involve steps 3, 4, and 5 happening many times, as fragments of different triangles overlap the same pixel. Performing alpha blending would add an additional “read the frame buffer” step for each pixel. And drawing textured triangles rather than flat-shaded ones would involve one or more additional texture memory reads for each pixel. A more realistic memory bandwidth estimate for a scene with an average depth complexity of 4, with alpha blending and texturing, is probably about 1.4 GB/second, requiring a memory speed of 700MHz. That’s definitely out of reach.
There are certainly some tricks I could use to improve things, starting with using several DRAMs in parallel. Unfortunately each new DRAM requires about 40 FPGA pins to interface with it, and since I need to limit myself to low pin-count FPGAs that can be hand-soldered, realistically I probably can’t do more than two DRAMs in parallel. Using a smaller frame buffer or fewer bits per pixel would also help, but that’s trading away image quality, which I’d like to avoid.
Caching seems like it should play a role here, but I’m not sure exactly how. If there are many shader units operating in parallel, they all must keep their caches in sync somehow, or share a single cache. And even if there’s only one shader unit, so cache coherency isn’t an issue, it’s not obvious to me that a traditional cache would actually speed things up. A pixel shader won’t spend lots of time manipulating the same few bytes of memory over and over, the way a CPU does when executing a small loop. Instead, it traverses the interior of a triangle, visiting each pixel exactly once. The next triangle it processes is unlikely to have any overlap with the previous one. Given these patterns, a cache probably won’t help.
Read 14 comments and join the conversation14 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
Your current board only offers you 4 bits per channel so you are down to 1.5 bytes per pixel.
Many things still look passably good at a 30 Hz frame rate. You’ve got RAM enough to double buffer.
The caches help you do burst reads and writes (there’s spatial locality) even if the hit rate might be low (low temporal locality). And you don’t have to hit the same pixels to hit in the cache, you just have to draw a triangle sufficiently nearby. I don’t know if you intend to sort the triangles before drawing them but that might be an option. Perhaps also clip/split them along tile boundaries?
In any case, Doom works great on my 4MB 40 MHz 386DX. I remember measuring its framebuffer write bandwidth at about 40 MB/s. Admittedly, Doom only used 320×240 with a single byte per pixel ๐
By the way, DDR2 RAM has the option of having several active banks at a time (access to an active bank is faster) and the activation seems to take a little while. That means that reads and writes can be sort of overlapped and pipelined. Having a few more shader contexts available than the FPGA can actually run might be beneficial. At least that’s what /I/ think. I never actually used any kind of DRAM at that level of detail ๐
Haven’t gone through all the stuff in my head yet… but I feel like we’re missing something…
Anyhow, the board your using is capable of reaching DDR400 performance with that ram. It requires a really good memory controller, but it’s possible. That’s 400Mbps per I/O.
That’s running at 200Mhz memory clock. And ofcourse, being DDR, transferring on both clock edges.
This may help some.
http://www.xilinx.com/support/documentation/application_notes/xapp458.pdf
Hmm, well looked up the specs on the GForce 2 and 3… because I was sure that they had just started using similar ram to what’s on your board. I however am starting to get that sinking feeling also. They used similar speed ram, but they used 64 bit and 128 bit interfaces. I assume this is actually parallel DDR chips.
This is about the time when someone should chime in with some good news. >.>
The Spartan 3A board uses a 133MHz clock for the DDR2 SDRAM by default (DDR2-266), but as you said, is theoretically capable of reaching DDR2-400 speeds. It’s a 16-bit wide interface, so every transaction is two bytes. It’s double-pumped, so a transaction is done on both the rising and falling edge of the clock. And lastly, the external data bus is clocked at twice the rate of the SDRAM clock (this is the difference between DDR and DDR2). Add that all up, and you’ve got 133M * 2 * 2 * 2 = 1064MB/sec of theoretical memory bandwidth. I don’t think you’ll ever reach that in practice, because some cycles are wasted transferring addresses and waiting for results, but maybe you could get half that in real-world use, or about 500MB/sec. If you store data in 12 bits per pixel instead of 24 as Peter suggested (I’d originally planned to build a full 8-8-8 DAC), then that’s almost enough to satisfy my second, more demanding usage scenario. If you’re building your own board, you could add a second one in parallel, and then you’d have it.
What scares me after reading the DDR2 specs is the prospect of eventually building my own board using DDR2 memory. It only comes in BGA parts, and the electrical requirements are pretty exacting. I think it’s likely outside the realm of what I could hope to do on my own, by hand.
Instead, I’m looking at plain SDRAM, which I guess you could call SDR. It’s not double-pumped, and the data bus speed isn’t doubled, so it gets only a quarter of the bandwidth of DDR2 at the same clock rate. Yet it looks a lot easier to use, runs at 3.3V, comes in a hand-solderable TSSOP package, and generally seems like something I might actually be able to do.
I’m strongly considering reselling my Spartan 3A kit, and buying a $150 Altera Cyclone II starter kit instead. It just happens to have SDR SDRAM, and also has a bit of plain SRAM. Since the Altera tools generally seem more comprehensible anyway, that may be a good move. Plus the Cyclone II and Cylone III family have more options in TQFP144 packages than the Spartan series, which is about the most I think I could hope to solder by hand.
The Altera kit’s SDRAM also has a 16-bit interface, and assuming it’s clocked at 133MHz, that would give me 266MB/sec of memory bandwidth. I’d probably need to drop to 12 bits per pixel AND lower the resolution to 320×240 in order to make that work, which is a shame, but I have more confidence I could actually get SOMETHING to work this way. One thing that BMOW taught me over and over is that having something simple that works is much better than something fancy and complex that doesn’t.
BGA soldering can apparently done by hobbyists (I have never tried anything like it myself, though):
http://newsgroups.derkeiler.com/Archive/Comp/comp.arch.fpga/2006-03/msg00726.html
http://hubpages.com/hub/BGA-Ball-Grid-Array-Repairing-and-Soldering-BGA
http://www.sparkfun.com/commerce/tutorial_info.php?tutorials_id=59
Take a look at the 30$ skillet they use for hot plate reflow at the bottom of the last link.
There’s of course also the option of using a PCxxxx connector and a DIMM. It’s up to you how many bit lanes you actually want to use on such a beast. You may have an old PC laying about which you can cannibalize — use a hot air gun for desoldering.
One thing that BMOW taught me over and over is that having something simple that works is much better than something fancy and complex that doesnโt.
๐
As I see it, you already have a debugged, mounted, and soldered board with VGA connector, D/A converters, FPGA, RAM, clock, etc…
Good idea about using a DIMM and just ignoring some of the data bits if I run out of FPGA I/Os. I thought of something similar after posting my earlier comment. Looks like there’s a fixed overhead of about 30 pins for control and address, and then it’s just as many data bits as you want.
Yes, the starter kit is already a debugged, soldered, etc board, but is not the final goal. I definitely want to make a custom-built part, so I can gain some experience designing a PCB, and make something that has exactly the parts I want and nothing I don’t. I also want to add a CPU (real, not soft) to the system. Putting all this together in a custom PCB that I build myself is a big part of the goal, aside from the logic implemented within the FPGA.
I had an idea based on Peter’s first comment (sorting). In the scene to be rendered, all the surfaces break down into triangles or maybe quadrilaterals. After doing the math for the projective geometry, we know the 2-D coordinates on the screen for these primitives. Now instead of immediately filling these primitives with a color or texture (using steps 3-5), we break them up into a series of 1-D objects parallel to the video scanlines and store them in a data structure indexed by scanline.
With this setup, you could process steps 1-5 for each scanline, processing only the 1-D objects on the current scanline. The big benefit is that the z-buffer and the pixel buffer needs only to be large enough to hold a single scanline’s worth of pixels, and so would probably fit in the FPGA’s internal RAM. Thus your data bus width could be much larger and clock rate faster than using the external RAM. Once the entire scanline has been processed, you could then copy the pixel data to the external RAM for use as a more permanent frame buffer (this would still be needed in case the scene complexity grew too much to process everything in the time of a single video frame).
The number of objects for a scene will blow up under such a scheme. You can of course iterate through the triangle geometry more than once per frame or only “explode” a small fraction of the triangles at a time (after sorting the triangles) or any combination — but you might not want the complexity. If you want the code that implements it to be really fast then it gets more complicated than it looks. Quake actually did something like this for clipping of some of the frame. Granted, Quake’s scheme was more complicated in that it didn’t use a Z-buffer (for this part of the scene — other parts did use Z-buffering) so it had to support the 3D-clipping. Michael Abrash wrote about it in his series on Quake in DDJ back in the nineties (collected in The Black Book of Graphics Programming which you can find on the net somewhere).
I know you said you didn’t want to go under 640×480. But why not go for a simple 3d system for a handheld? The PSP screen http://www.sparkfun.com/commerce/product_info.php?products_id=8335 is 480×272. Hell, even simple 3d on a portable anything is cool…
One thing that came to my mind is that you could use interlacing, and refresh every second row on the screen/frame. That way you could cut the memory operations by ~half.
Interesting idea… I’m not sure that would look good enough, but it would be worth a try. You’d actually still need to draw the whole screen every frame, but you could update only every other line in the frame buffer each frame.
I was reading a 3D graphics textbook the other day that had some other good suggestions for reducing memory bandwidth needs. One of the more interesting ones was to divide the screen up into many small regions, and cache the the minimum and maximum Z-buffer value in each region. Then if a new pixel’s Z were greater than the maximum Z of the region, you could reject it without ever checking memory to get the actual Z. And if it were less than the minimum Z of the region, you could know to draw it without ever checking memory either.
Also, I realized my usage scenario is probably too pessimistic. If you draw objects in front to back order, then many pixels will fail the z-test in step 3, and can skip steps 4 and 5.
Bottom line, memory bandwidth is still a concern, but I’m more confident now that I’ll be able to find some reasonable optimizations to get acceptable performance.
I’m making slow progress towards actually using the DDR2 memory on the Spartan 3A starter kit board, using a memory interface generated by the Xilinx MIG tool. What seems painfully confusing is how to actually *use* the interface that MIG generates. It spits out a bunch of .v files, a .ucf user constraints file, and lots of scripts. The scripts run synthesis, map, and par from the command line, without going through the ISE Project Navigator.
I’m uncertain if these scripts are optional, or if they contain important command-line switches for the tools, which would cause the memory interface to be built incorrectly if they’re missing. Ideally I would just create a new ISE Project Navigator project, add the .v and .ucf files generated by the MIG, and build the project, ignoring the scripts entirely.