FC8 – Faster 68K Decompression
Data compression is fun! I’ve written a new compression scheme that’s designed to be as fast as possible to decompress on a 68K CPU, while still maintaining a decent compression density. I’m calling it FC8, and you can get the generic C implementation and optimized 68K decompressor from the project’s Github repository. I’ve probably reinvented the wheel with this, and in a non-optimal way too, but once I started I found that I couldn’t stop. My intended platform for FC8 is 68020 and 68030-based vintage Macintosh computers, but it should be easily portable to other classic computers, microcontrollers, and similar minimal systems.
The main loop of the 68K decompressor is exactly 256 bytes, so it fits entirely within the instruction cache of the 68020/030. Decompression speed on a 68030 is about 25% as fast as an optimized memcpy of uncompressed data, which is essentially an unrolled loop of 4-byte move.l instructions with no computation involved. Compared to that, I think 25% is pretty good, but I can always hope for more. 🙂
In the previous post, I described how I was using compression to squeeze a larger rom-disk image into a custom replacement Macintosh ROM that I’m designing. I began with a compression algorithm called LZG, written by Marcus Geelnard. It worked well, but the 68K decompression seemed disappointingly slow. I tried to contact the author to discuss it, but couldn’t find any email address or other contact info, so I eventually drifted towards creating my own compression method loosely based on LZG. This became FC8. On a 68030 CPU, FC8 compresses data equally as tightly as LZG and decompresses 1.5x to 2x faster. FC8 retains much of the compression acceleration code from LZG, as well as the idea of quantizing lengths, but the encoding and decompressor are new.
The algorithm is based on the classic LZ77 compression scheme, with a 128K sliding history window and with duplicated data replaced by (distance,length) backref markers pointing to previous instances of the same data. No extra RAM is required during decompression, aside from the input and output buffers. The compressed data is a series of tokens in this format:
- LIT = 00aaaaaa = next aaaaaa+1 bytes are literals
- BR0 = 01baaaaa = backref to offset aaaaa, length b+3
- EOF = 01×00000 = end of file
- BR1 = 10bbbaaa’aaaaaaaa = backref to offset aaa’aaaaaaaa, length bbb+3
- BR2 = 11bbbbba’aaaaaaaa’aaaaaaaa = backref to offset a’aaaaaaaa’aaaaaaaa, length lookup_table[bbbbb]
The encoding may look slightly strange, such as only a single bit for the backref length in BR0, but this produced the best results in my testing with sample data. The length lookup table enables encoding of backrefs up to 256 bytes in length using only 5 bits, though some longer lengths can’t be encoded directly. These are encoded as two successive backrefs, each with a smaller length.
The biggest conceptual changes vs LZG were the introductions of the LIT and EOF tokens. EOF eliminates the need to check the input pointer after decoding each token to determine if decompression is complete, and speeds things up slightly. LIT enables a whole block of literals to be quickly copied to the output buffer, instead of checking each one to see if it’s a backref token. This speeds things up substantially, but also swells the size of the data. In the worst case, a single literal would encode as 1 byte in LZG but 2 bytes in FC8, making it twice as expensive! All the other changes were needed to cancel out the compression bloat introduced by the LIT token, with the end result that FC8 compresses equally as compactly as LZG. Both compressed my sample data to about 63% of original size.
The 68K decompressor code can be viewed here.
Decompression on the Fly
Several people mentioned the possibility of on-the-fly decompression, since the intended use is a compressed disk image. That’s something I plan to explore, but it’s not as simple as it might seem at first. Disk sectors are 512 bytes, but there’s no way to decompress a specific 512 byte range from the compressed data, since the whole compression scheme depends on having 128K of prior data to draw on for backref matches. You could compress the entire disk image as a series of separate 512 byte blocks, but then the compression density would go to hell. A better solution would compress the entire disk image as a series of larger blocks, maybe 128K or a little smaller, and then design a caching scheme to keep track of whether the block containing a particular sector were already decompressed and available. This would still have a negative impact on the compression density, and it would make disk I/O slower, but would probably still be OK.
Ultimately I think the two decompression approaches each have strengths and weaknesses, so the best choice depends on the requirements.
Boot-Time Decompression:
Pros: Best compression density, fastest I/O speeds once the disk image is decompressed
Cons: 5-10 second wait for decompression at boot time, requires enough RAM to hold the entire disk image
On-the-Fly Decompression:
Pros: No wait at boot time, required amount of RAM is configurable (size of the decompressed block cache)
Cons: Worse compression density, slower I/O speeds, more complex implementation
Hardware Tests
I discovered that a Macintosh IIci in 8-bit color mode decompresses about 20% slower than in 1-bit color mode. But a IIsi decompresses at the same speed regardless of the color settings. Both machines are using the built-in graphics hardware, which steals some memory cycles from the CPU in order to refresh the display. I’m not sure why only the IIci showed a dependence on the color depth. Both machines should be faster when using a discrete graphics card, though I didn’t test this.
The original LZG compression showed a much bigger speed difference between the IIci and IIsi, closer to a 50% difference, which I assumed was due to the 32K cache card in the IIci as well as its faster CPU. It’s not clear why the discrepancy is smaller with FC8, or whether it means the IIci has gotten worse or the IIsi has gotten better, relatively speaking. Compared to the same machine with the LZG compression, FC8 is 1.57x faster on the IIci and 1.99x faster on the IIsi. Based on tests under emulation with MESS, I was expecting a 1.78x speedup.
Tradeoffs
While working on this, I discovered many places where compression compactness could be traded for decompression speed. My first attempt at FC8 had a minimum match size of 2 bytes instead of 3, which compressed about 0.7% smaller but was 13% slower to decompress due to the larger number of backrefs. At the other extreme, the introduction of a LIT token without any other changes resulted in the fastest decompression speed of all, about 7% faster than FC8, but the compressed files were about 6% larger, and I decided the tradeoff wasn’t worth it.
I explored many other ideas to improve the compression density, but everything I thought of proved to have only a tiny benefit at best, not enough to justify the impact on decompression speed. An algorithm based on something other than LZ77 would likely have compressed substantially more densely, or say a combination of LZ77 and Huffman coding. But decompression of LZ77-based methods are far easier and faster to implement.
Compression Heuristics
It eventually became obvious to me that defining the token format doesn’t tell you much about how to best encode the data in that format. A greedy algorithm seemed to work fairly well, so that’s what I used. At each point in the uncompressed data, the compressor substitutes the best match it can make (if any) between that data and previous data in the history window.
However, there are some examples where choosing a non-optimal match would allow for an even better match later, resulting in better overall compression. This can happen due to quirks in the quantizing of match lengths, or with long runs of repeated bytes which are only partially matched in the previous data. It’s a bit like sacrificing your queen in chess, and sometimes you need to accept a short-term penalty in order to realize a long-term benefit. Better compression heuristics that took this into account could probably squeeze another 1% out of the data, without changing the compression format or the decompressor at all.
Read 11 comments and join the conversation11 Comments so far
Leave a reply. For customer support issues, please use the Customer Support link instead of writing comments.
Thank you, a very interesting article. But there is a little piece of information missing: Under what license are you publishing the source code?
What about an MIT or 2-clause BSD license?
https://opensource.org/licenses/MIT
https://opensource.org/licenses/BSD-2-Clause
The liblzg I found at https://github.com/mbitsnbites/liblzg/ is licensed with the zlib/libpng license.
https://opensource.org/licenses/Zlib
> Data compression is fun!
Yes 🙂
I like your BR0 one byte backrefs, I’ll have to experiment with that and see how that compares to my explicit RLE tokens in LZMV.
(forgot to check subscribe)
I would like to change my license recommendation to zlib/libpng instead of MIT or BSD. To be compliant to the license of the original liblzg code you would have to keep the zlib/libpng license notice in the code anyhow.
I hadn’t considered it, but the zlib license would be fine.
I did a lot of profiling of my sample data to choose the encoding details, so other types of data may not compress quite as well. Not surprisingly, I found that backrefs to short match strings were more common than longer match strings. So given a fixed number of bits for encoding a backref, it worked better to assign most of them to storing the offset rather than the length. I even experimented with a backref that had no length bits at all, but an extra bit of offset:
BR0′ = 01aaaaaa = backref to offset aaaaaa, length 4
The result was nearly as good as what I ended up choosing. With my compressed sample data, BR1 and BR2 are used about 8x more often than BR0, which suggests that the breakdown between them still isn’t ideal.
I also experimented with changing the length quantization lookup table. In general I think the the quantization levels have too many levels in the mid-to-high 20s which are rarely used, and not enough levels between 48 and 256. I spent a while experimenting with this, but couldn’t extract more than a tiny improvement in compression density, so I ended up leaving it as-is.
A thought on on-the-fly-decompression: If I do this, I’ll probably choose a block size of 32K, since decompressing a 128K or larger block just to read a 512 byte sector would be painful. So there would be no point in having a 128K sliding window with 17 bits of offset in BR2. I could reduce BR2 to have only 15 bits of offset, and give the extra 2 bits to the length, which would enable it to store the length directly instead of using a lookup table. Or I could use the extra bits to decode additional BR3 and BR4 tokens that do something fancy, like encode 8 ASCII chars in 6 bytes. But that’s probably getting too far away from the fast-and-simple goal.
I should have posted it here instead:
I’m doing development on a 68000 based system (Megadrive) and I recently designed a derived version of LZ4 compression to take benefit of data word alignement on 68000 CPU as i really wanted to have “live GFX data decompression”.
I originally used the classic LZ4 implementation with the best optimized assembly code i could do for decompression and i obtained about 250-360 KB/second of decompressed data on a 7.7 Mhz 68000.
With my new LZ4W compression scheme i was able to obtain 600 up to 950 KB/second on a simple 7.7 Mhz 68000, i think there is no faster decompression algorithm on the 68000, at least i was not able to find one.
The compression level is almost as good than LZ4 for small packet of data but get worst for larger one.
Still you can give a try in your are interested in. You can find here the compressor (wrote in java) and the 68000 assembly decompressor :
https://www.dropbox.com/sh/smgwbi8g6y50n60/AAC47jpBZKxPqeY_NXo703vQa?dl=0
To obtain that decompression speed i had to heavily rely on lookup table so the decompression code is not really small (probably ~4 KB in binary format).
Thanks, I will check it out!
@Stephane.D very cool! After looking at the decompressor code, I can understand how it would be very fast, since there is almost no branching or calculations involved. It’s using a jump table to vector every possible permutation of literal length and match length to a unique block of straight-line code. It seems like it would work well for compressing small files around 1K in size. For my sample data (2304K disk image) the compression density was not so good: compressed file was 79% of original size, compared to 63% for FC8 or 58% for standard zip.
I think this is due to the much smaller window size of LZ4W vs FC8 (256 words vs 131072 bytes), and to a lesser extent to the smaller maximum length of matches/literal strings (16/16 words vs 256/64 bytes). Also making everything word-aligned will negatively impact the compression density, since all matches must be an even number of bytes in length, and must start on an even numbered byte boundary.
But what you lose in compression density, you gain in speed. I didn’t try the 68000 decompressor, but I would bet it’s twice as fast as the decompressor I wrote for FC8.
Thanks for checking it out and for your feedback 🙂
Indeed using forced word aligned data have a negative impact on compression density definitely. Still i’m surprised by the big difference you observed between LZ4W and FC8 compression ratio, i know LZ4W had worst results for bigger dataset but from my tests they weren’t “too bad” compared to classic LZ4 for instance still i admit i didn’t tested it overs > 300 KB data and designed it to really fit my needs (dataset around 8 KB and strictly < 64 KB) and indeed it works quite well for small dataset (up to 8 KB).
Yesterday i tried a modified version of LZ4W which offer better compression for larger data (using a 64KB window when needed), i guess it would fit better your needs but still won't offer as good ratio than FC8 (i guess about 70% for your dataset).
Again i really designed LZ4W for fast unpacking speed, that was my preliminary requirement, unfortunately we always have a trade-off between compression ratio and decompression speed 😉
How does this compare to LZSS? (Note: not LZ77.)
You mean compare LZSS vs LZ4 / LZ4W ?
I think LZSS offer better compression ratio but LZ4 is faster as unpacking data.
And LZ4W has been specifically designed to have very fast decompression speed on 68000 CPU, i made some improvements to it so it offers now better compression ratio while maintening almost as fast decompression speed.