Son of 3D Graphics Thingy – A Homebrew Graphics Processor
An old project of mine may return from the dead. A few years ago, I started an ambitious project to build a 3D graphics processor using an FPGA. The goal was to create a simple version of the GPU you might find in your PC or video game console – something able to work in tandem with a standard CPU, handling the 3D transform math and pixel rasterization needed to draw pictures of spinning cubes and teapots and things. At the time I was doing a lot of 3D software programming for my job as a game developer, so the idea of building 3D hardware was exciting.
Unfortunately, 3DGT quickly turned into How to Shrink Memory Bandwidth Requirements, How to Build a High Performance DDR2 Memory Controller, and then How to Debug the Xilinx Development Tools, none of which were any fun. I eventually gave up, without ever getting to any of the interesting graphics-related stuff.
Yesterday I happened to re-read my summary of the project, which concluded “Lesson learned: start with a small project and add to it incrementally, instead of beginning with grandiose plans.” That started me thinking about what kind of small and simple graphics system I could build quickly, in order to get something working I could iterate on and improve. Almost all my difficulties with 3DGT were related to DDR2 RAM and the RAM bandwidth requirements, so if I could avoid those problems, I’d be in good shape. The solution seemed simple: use standard SRAM, and shrink the frame buffer size and color bits per pixel, until the memory bandwidth requirements are reduced to an acceptable level.
Specs
For this “Son of 3D Graphics Thingy”, I’m envisioning something like this:
- 512KB of SRAM used for video memory, with a 16-bit wide memory interface
- 20 MHz system clock rate
- 8 bits per pixel, indexed color
- One 640 x 480 frame buffer, or two 546 x 480 frame buffers with double-buffering
- VGA output
- No depth buffer (Z buffer)
- Rasterization only; no 3D transform math
This is a much more modest goal than the original 3D Graphics Thingy. Without a depth buffer or 3D transform support, it’s really more of a 2D triangle rasterizer coprocessor than a 3D GPU. The CPU will be responsible for doing the 3D matrix transformations in software, and drawing objects in back-to-front order to ensure proper depth sorting. It won’t compete with the GeForce, but if I recall correctly, it’s very similar to how the original 1995 Playstation worked.
A 16-bit memory interface running at 20 MHz has a max theoretical throughput of 40 MB/s. So what can we do with that? Let’s assume each pixel is cleared to black at the start of each video frame. Then the pixel is written to four times, by four overlapping triangles (the scene’s depth complexity is 4). Finally the pixel is read by the VGA circuit, to generate the video signal output. That’s 6 memory operations per pixel, at 8 bits per pixel (one byte), so 6 bytes per pixel per frame.
Assuming a 640 x 480 frame buffer, each frame will involve 640 x 480 x 6 = 1.84 MB of memory I/O. Dividing that into the 40 MB/s of available video memory bandwidth results in a top speed of 40 / 1.84 = 22.8 frames per second. With only a single buffer, you’ll see screen tearing while objects are animating, which isn’t ideal, but it works.
Plan B is to use two 546 x 480 buffers, and draw objects into one buffer while the VGA circuit generates a video signal from the other buffer. This rather strange frame buffer size was chosen because two buffers fit exactly into 512 KB. Probably the VGA circuit will add black bars on the left and right of the image, pillar boxing the 546 x 480 image inside a standard 640 x 480 video signal. With a 546 x 480 frame buffer, each frame will involve 546 x 480 x 6 = 1.57 MB of memory I/O, resulting in a top speed of 40 / 1.57 = 26.7 frames per second. 26.7 FPS isn’t exactly speedy, but it’s fast enough to draw animated objects. And thanks to double-buffering, you won’t see any screen tearing.
Building It
Now I need to design a board with a CPU, an FPGA, and some SRAM, right? Actually, no I don’t. The Altera DE1 board that I used for Plus Too already has everything I need. Its FPGA is large enough to implement both a soft-core CPU and the custom graphics core, and it’s got 512 KB of SRAM with a 16-bit wide interface. The SRAM has a 10 ns access time, so better performance than I described above is possible if I can boost the clock speed above 20 MHz. And the board also has 8 MB of SDRAM, if I ever get brave enough to make another attempt at writing a memory controller. It looks like other people already have working examples of SDRAM controllers for the DE1, so maybe it wouldn’t be that bad.
So that’s the plan. I’m not expecting to start building this tomorrow – I still have my Nibbler CPU project to finish, and other projects I’d like to pursue – but it’s an interesting idea. My problem is too many ideas, too little time!
Read 6 comments and join the conversation
6 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
Hi,
Check out some of the full system comparisons of 1990’s SGI machines over at http://www.futuretech.blinkenlights.nl/sgi.html
Some of the machines with less capable graphics end up out performing the machines with more capable graphics for specific tasks or when a CPU upgrade becomes available. The SGI machines are interesting as they were built as a fully integrated system so all the components are well balanced with respect to each other.
My point is that you probably want to take a similar approach: what are you going to use the CPU part of the system for? Work that out then use all its remaining capacity to process graphics. Then design hardware which accelerates the rest of the workload. That way you only have to build hardware for the bits that the CPU is particularly bad at.
When you’ve done that small project a bottleneck analysis will suggest future work. 🙂
The Nintendo 64 might be a particularly good study as it used a lot of technology from either the SGI Indy or O2 (I don’t recall which).
Good luck! 🙂
Thanks for link, Andy! I plan to use the CPU both for simulation (what objects are in the scene, and where they are) and for transformation/lighting (all the per-vertex operations). The graphics processor will be used for filling the pixels in the interior of triangles, interpolating the vertex colors, and someday also for depth buffering, alpha blending, fogging, etc.
Like those SGI machines you mentioned, I might find it’s actually faster to do the whole thing in software on a fast CPU, than with a CPU/GPU combination using a slower CPU. But my interest is specifically in building the graphics processor, and a pure software CPU solution doesn’t interest me as much.
Hello! Have you made any progress on this? I’m making my own game console from scratch, and I want to make a GPU on an FPGA as well, but with performance similar to the one in the GameCube. I have a sort-of idea where to start, but I would like a few pointers. You can find my project here: https://hackaday.io/project/9180-the-dingo-console
Cool, that looks very ambitious, with multi-core programmable shaders. There’s a series of older posts on 3D Graphics Thingy that you can read here, and may be useful. I would especially recommend doing some “back of the envelope” memory bandwidth calculations for your design, and make sure it looks doable. I was surprised at the tremendous bandwidth needed for even a simple graphics engine.
http://www.bigmessowires.com/2009/06/20/memory-bandwidth/
http://www.bigmessowires.com/2009/06/27/more-on-memory/
Thanks! I’ve actually spent several days calculating memory bandwidth, and I decided I’m going to ditch the modular setup and make the RAM soldered on, 8 DDR3 800 chips each with 16 bit data busses, totalling at a theoretical bandwidth of about 12.8 GB/s. That should be more than enough for a mid-range GPU and CPU.
I read the links, and I gotta say, I still have much to learn about 3D graphics. These optimizations will help out alot. Thanks for the tips!
This explanation is really out of the box. Loved it