Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.
Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.
It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.
The use of codepoints below 32 (space, start of what's usually considered "printable") makes me a bit hesitant. A lot of systems won't preserve those characters. Base85 is a more efficient alternative to base64, and doesn't use that lower range:
base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes. Depending on your processor's memory model and the cost of multiplies compared to shifts this can make it the best performer.
This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.
I was going to bring up base 85 as well, its a better choice for a variety of reasons. A long time ago I wrote a base encoder class in Java[1] mostly so that we could write a netnews reader in Java but also because I felt UUEncoding was not robust. The challenges of using unprintable characters is a lot more of a headache than anyone pays attention to initially. Lots (and I mean quite a few here) of systems consider unprintable characters "safe" to re-purpose into random uses. One display vendor had them changing the color of future characters in the display as an example.
Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."
It doesn't give you that much usable compactness over base 64, though you can easily encode a 360 degree angle with two bits of precision lost. Also, 5 base 92 characters can fully encode 32 bits of binary data. (Of course, since base 85 can do it in 5 characters.)
I'm probably going to go to typed arrays of 32 bit values. Currently, I can encode an entire ship's data in 18 bytes, of which 4 characters is a hash id.
I've never come across this one, thanks for the tip. I like the way it doesn't use every character in the range 0x20 to 0x7f. That last one in particular (0x7f or 'DEL') has always seemed problematic to me because it's a weird exceptional case. I've always thought someone must have screwed up with that one way back in the day. Using only '!' to 'u' and hence avoiding ' ' feels right for some reason, and I also like the cute trick of getting just a little bit of compression by using 'z' to represent 32 zero bits.
A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].
The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.
It's crucial to evaluate encoding space usage in the context of compression. For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for practical data. Even though base64 is more efficient than base16, it breaks up data across byte boundaries which then makes gzip significantly less efficient.
When would you gzip encoded data instead of encoding gzipped data? Doesn't gzip after encoding defeat the whole idea of encoding the data in a format that won't get mangled by systems that expect to be handling text?
He meant that you typically use base64 when the medium you use (e.g. email) doesn't support binary data. When you compress base64 encoded data you get back binary output. If binary output is ok to transfer, then why would you use base64 in the first place? Why not just compress the raw data?
If your embedding encoded data in another file format which forces restrictions on it. The encoding in the article is very explicitly optimized to be embedded in HTML attributes, which have a limited character range. The full HTML document is later compressed for transport, over a protocol that a) is aware of the compression and b) can transport binary data.
I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.
Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.
More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.
Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times
> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.
It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.
As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.
Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.
> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.
A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.
My problem with base-122 is simply that it's not an even power of 2.
It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding
I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.
No, it hasn't. I sent a pull request that attempts to solve this problem for a C# implementation of Base58Check, but there's no way for me to reliably guarantee that data isn't leaking from the divison.
> This uses one-byte characters encode seven bits and two-byte characters to encode fourteen bits. Hence this attains the goal of encoding seven bits per byte, i.e. the 8 : 7 inflation ratio.
The magic is in the 2 byte encoding 110sss1x 10xxxxxx.
There are really 890 characters used in this encoding, so it should be called base890.
It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.
If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?
Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?
Also, there are still cases where compression cannot be applied, e.g. if a script naively queries innerHTML. (This wouldn't affect loading time but it could inflate the page's RAM unnecessarily)
I would be interested to learn more about those concerns. My "default" setup these days is HTTPS & gzip everything, but I can't say I've read any white papers on the security implications of that.
If the attacker knows or controls any part of the data then the compressed size leaks information about the unknown data because the compressed size will be smaller if the known data shares bytes with the unknown data.
Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.
Base122 does not have that property - as far as I could tell.
Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.
Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85, packing 4 bytes into 5 uucp/mail/Usenet safe characters.
For some reason, base 64 became the standard, and I don't think there's space for anything else these days. (Most things do just fine with binary now, anyway!)
In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.
If your property string is enclosed in a double quote, then a single quote in the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes would break. JS allows both types of quotes on string literals for exactly this reason.)
Still, single quotes are somewhat asking for trouble.
You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....
meanwhile the Unicode consortium has been hard at work since 1991 to make it possible to encode up to 2.8 MB -- more than enough for most images, short videos, or many PDF files -- in a single character.
Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.
I was making a joke about how ridiculously large Unicode is. Obviously it is not 2 million+ bytes per character! (But the fact that you didn't consider my joke obvious speaks volumes).
Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.
Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.
Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...