How To Think About Compression

BigZaphod · on Feb 17, 2009

Whoa - 10 minutes ago, out of the blue, I was poking around on Wikipedia reading about compression algorithms and thinking how it might be a fun side project to mess with something like that. Now I reload my RSS reader and find this!

Life is strange sometimes...

What I started wondering about is if there's a chance of achieving better compression results if the compressor and decompressor could communicate about the compression before it happened. For example, (ignoring CPU constraints) what if when I was downloading a large file over the web, the web server and the browser conspired together to achieve a higher rate of compression customized just for me by using some kind of shared knowledge (like a history of what was recently downloaded or whatever) to achieve a transfer of less data? (Like, if I had a bunch of files on my client that the server also has, could it use small chunks of those known files and index their location in the new file to be transfered - thus saving me download time?)

Obviously this is not a fully formed thought... but I think it might be worth considering that compression need not always be bound by the type of data, but also by the receiver of that data and the intention and method of that data's transmission.

jerf · on Feb 18, 2009

A lot of that has already been done. In the web browsing context, when you ask for something and get a 304 Not Modified back because your cached copy is still current, that's basically compressing away the entire content of that object, which may be huge.

Your other idea has been done, too, with "WAN optimization", but it generally isn't helpful for one person, because you already have good caching and proxying going on, even if you don't realize it. That works when multiple people are pulling down the same contents.

Basically, already done.

At this point, if you start going "But what if we also did this...!", you are about 95% likely to start getting into the compression crackpot zone. Information can only be squeezed so hard and once you start proposing other exotic solutions the only explanation that works in response is usually "No, that won't work, but you won't believe me, so, prove it. Implement it and sell it for millions of dollars. Be sure to count all the bytes, not just the ones you like." And generally, that's the end of that.

I am far more likely to believe that you have built a working anti-gravity device, which is at least at the very edge of plausibility under certain not-yet-disproven physical theories, than that you have actually made a huge compression advance, which is effectively mathematically impossible. (Not because compression can't be advanced, but because it's only going to advance incrementally from now on. There's no room left for a breakthrough in the general case.)

jodrellblank · on Feb 17, 2009

There is something to that - due in the next Windows Server 2008 R2 and Windows 7 is an update to Offline Files and network copying, where the server will keep hashes of every file, in smallish chunks.

The idea is that if you copy two similar files from the server, it can check the hashes and tell your client to use parts of a file it already has instead of transferring them (or only copy changed parts of a big file).

I have no source handy to find more details about exactly what scenarios it will be used at the moment though.

BigZaphod · on Feb 17, 2009

Awesome. That's a lot like what I was imagining - keeping a library of known hashes and chunks of shared files in the hopes that, at least occasionally, having those chunks pays off when transferring future files. You could even use OS files and offsets within them that you could be sure were present on the client side. Stuff like that.

code_devil · on Feb 18, 2009

check these terms out .... deduplications and wan optimization

gills · on Feb 18, 2009

Most compression algorithms do consider the context of previously-received data -- but only from the same file/unit/transmission/whetever, aside from predefined contexts and dictionaries for some algorithms (like the DCT in jpeg or the fixed huffman tree in H.263 decoders).

but I think it might be worth considering that compression need not always be bound by the type of data, but also by the receiver of that data and the intention and method of that data's transmission.

It is. This is why we have various entropy-coding schemes for important data (arthmetic, golomb, huffman, lzw, burrows-wheeler/bzip2, rle), and lossy compression for data that doesn't need to be perfect, such as audio, video, and imagery (mp3, h.264, jpg) which use irreversible transforms to pack a lot more information in each bit.

jeeringmole · on Feb 17, 2009

Yes, but you have to be honest about the amount of information contained in the conspiracy. In the limit, the server could send the entire file as part of the preprocessing step. The compressed version would then be "yeah, that one".

That example is silly and extreme to illustrate the point. In your parenthetical example, the costs to be accounted for include not only the index information but also the catalog of shared files.

BigZaphod · on Feb 17, 2009

Well sure, the algorithm would have to be smarter than that and take into account all the costs associated with transferring certain knowledge vs. downloading the data, etc. It was just a rough idea to get someone thinking, after all. :)

neilc · on Feb 18, 2009

There's a lot of prior work that is related to this idea. For example:

http://research.microsoft.com/apps/pubs/default.aspx?id=6469...

An interesting point about their approach is that it is independent of file size: they just need to keep 96 bits of metadata per file.

wmf · on Feb 18, 2009

I think that's variously called rsync, compare-by-hash, WAN acceleration, or deduplication.

DaniFong · on Feb 17, 2009

It seems to me that for backup you'd want a file format with redundancy, or that could tolerate a few bit for sector error while remaining mostly readable. This seems to conflict with most compression schemes.

dangoldin · on Feb 17, 2009

Well isn't that a separate process? You should be able to make any file redundant, not just compressed files.

Something similar to the way CDs are written to be able to handle scratches should work well in this case.