Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Base-122 – A space efficient alternative to base-64 (kevinalbs.com)
191 points by kevinAlbs on Nov 27, 2016 | hide | past | favorite | 66 comments


http://blog.kevinalbs.com/base122#a_minor_note_on_the_last_c...

Almost a perfect standard, but the prepended one byte header is a mistake IMHO. It makes it impossible to encode when the input size is unknown. Better to encode whether the last chunk is one byte or two at the end of the stream.

Please whoever is involved with this, revise the standard to not have a header and call this existing spec a beta. Otherwise, good work.

Edit: I have opened an issue for this: https://github.com/kevinAlbs/Base122/issues/3#issue-19188159...


It would be easy to do. That 3-bit word just needs to start at 1 (since it encodes fewer than 8 options). Then the fixed 1-bit can instead encode whether another 7-bit segment follows.


Good point, I will change that.


The use of codepoints below 32 (space, start of what's usually considered "printable") makes me a bit hesitant. A lot of systems won't preserve those characters. Base85 is a more efficient alternative to base64, and doesn't use that lower range:

https://en.wikipedia.org/wiki/Ascii85


base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes. Depending on your processor's memory model and the cost of multiplies compared to shifts this can make it the best performer.

This was true on Vax 8200 hardware back in the day. In the same software, with Huffman decoding of JPEGs it was also fastest to create a finite state machine with an 8 bit symbol size. I suspect that is no longer true since it would kill your L1 cache and be well into your L2 cache on modern x86 machines. It is probably better to take the instruction count hit and process as bits or nibbles.


base 85 also has the interesting property that 4 original bytes fit in 5 encoded bytes.

Still useful for Javascript, as the bit shift operators work on 32 bit "registers".


I was going to bring up base 85 as well, its a better choice for a variety of reasons. A long time ago I wrote a base encoder class in Java[1] mostly so that we could write a netnews reader in Java but also because I felt UUEncoding was not robust. The challenges of using unprintable characters is a lot more of a headache than anyone pays attention to initially. Lots (and I mean quite a few here) of systems consider unprintable characters "safe" to re-purpose into random uses. One display vendor had them changing the color of future characters in the display as an example.

Stick with the characters that nearly everyone assumes could legitimately come up in a document and your chances of running afoul of some "creative genius" who decided "Hey its unprintable so no one will try to print it, but when I do print it I want this thing to happen..."

[1] http://grepcode.com/file/repository.grepcode.com/java/root/j...


I wrote a base 92 encoder for the Javascript game I'm working on:

http://www.emergencevector.com/

It's pretty easy to write the decode for the 0-91 integer in Javascript.

    if (ch == "!") {
        return 57;
    } else {
        return ch.charCodeAt(0) - 35;
    }
It doesn't give you that much usable compactness over base 64, though you can easily encode a 360 degree angle with two bits of precision lost. Also, 5 base 92 characters can fully encode 32 bits of binary data. (Of course, since base 85 can do it in 5 characters.)

I'm probably going to go to typed arrays of 32 bit values. Currently, I can encode an entire ship's data in 18 bytes, of which 4 characters is a hash id.


I've never come across this one, thanks for the tip. I like the way it doesn't use every character in the range 0x20 to 0x7f. That last one in particular (0x7f or 'DEL') has always seemed problematic to me because it's a weird exceptional case. I've always thought someone must have screwed up with that one way back in the day. Using only '!' to 'u' and hence avoiding ' ' feels right for some reason, and I also like the cute trick of getting just a little bit of compression by using 'z' to represent 32 zero bits.


Holla for ASCII85. Used it many times when some of the base64 chars are unusable. A nice thing to know about when working with old systems


A lot of people have experimented with a lot of different ways of encoding binary data as printable text. Wikipedia has a list of different encoding schemes[0].

The most efficient one is yEnc[1]. Still the simplest ones such as base64 or good old hex may actually work better once compression comes into the picture.

[0]: https://en.wikipedia.org/wiki/Binary-to-text_encoding

[1]: https://en.wikipedia.org/wiki/YEnc


It's crucial to evaluate encoding space usage in the context of compression. For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for practical data. Even though base64 is more efficient than base16, it breaks up data across byte boundaries which then makes gzip significantly less efficient.


When would you gzip encoded data instead of encoding gzipped data? Doesn't gzip after encoding defeat the whole idea of encoding the data in a format that won't get mangled by systems that expect to be handling text?


When serving gzip-compressed pages to browsers that support it.


He meant that you typically use base64 when the medium you use (e.g. email) doesn't support binary data. When you compress base64 encoded data you get back binary output. If binary output is ok to transfer, then why would you use base64 in the first place? Why not just compress the raw data?


If your embedding encoded data in another file format which forces restrictions on it. The encoding in the article is very explicitly optimized to be embedded in HTML attributes, which have a limited character range. The full HTML document is later compressed for transport, over a protocol that a) is aware of the compression and b) can transport binary data.


Came here to post this... yEnc is about as compact as you can get, and has been around for over a decade.

Besides that, the Z85 encoding is the next runner up as a compact "string safe" encoding: https://rfc.zeromq.org/spec:32/Z85/


YEnc is unsuitable when the text will be transfered as UTF-8.


I appreciate the author building a more efficient alternative to base64, but I had to laugh for a moment at the suggested use case.

Using an alternative to base64 encoded data for web pages and node is a horrible idea. If I came across a code base using that I would scream. A lot. None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast. Its a 1-2kb savings per page that's going to make somebody jump off bridge one day.

More terrifying is that the JavaScript community is so fascinated by shiny objects that thousands of people are going to use this. I'm not sure if it's more funny or terrifying.

Anyways, this encoding is still useful if you need to pass data between legacy systems. I've used html escapes and ASCII85 before to get around annoying old stuff that doesn't use Unicode many times


> None of my tools work with it, and now my browser has to run a bunch of js for something that's normally native and very fast.

It's not that I don't agree with the general idea of your post, but I disagree with this part. If nobody every tries something new we'd still be in the stone age. I think base122 is a bit silly but if it had it's use and adoption became widespread then browsers and tooling etc. would follow, just as they now support base64.


Does anyone else find it odd that both

Base-122 was created with the web in mind.

And

As §3 shows, base-122 is not recommended to be used on gzip compressed pages, which is the majority of served web pages.

Occur in just a few lines from each other? I get there's more use cases like email and such but if you're going to create something for the web but it can't be used on the majority of web pages that seems like a fairly large oversight/caveat.


I agree. Including:

> Base-122 encoded strings contain characters which did not seem to play well with copy-pasting.

A very important part of web development is being able to manipulate text documents. It seems that using UTF-8 in more places can reveal cracks in implementations for browsers/DE's/editors/terminals/etc.


My problem with base-122 is simply that it's not an even power of 2.

It's very easy to write a cache-timing-safe version of base{16,32,64} encoding for use in encoding/decoding cryptographic keys in configuration files. To wit: https://github.com/paragonie/constant_time_encoding

Base-122? Not sure if it's even possible.


I'm pretty sure that Bitcoin has dealt with this issue for its base58 encoding. It might be worth checking if their algorithm is generalizable to other radix sizes.


Check out digit-array if you are interested in the generalized algorithm. The source code is commented with formal math notation for the operations.

https://github.com/deckar01/digit-array/blob/master/README.m...


This doesn't appear to address the problem. At all.

https://paragonie.com/blog/2016/06/constant-time-encoding-bo...


I misread your original comment. Nothing I posted is related to constant time encoding. I apologize for the confusion.


It's OK. I just want to be clear that my objection to base-122 is the same as my objection to base-N where N != 2^m for some integer m.


No, it hasn't. I sent a pull request that attempts to solve this problem for a C# implementation of Base58Check, but there's no way for me to reliably guarantee that data isn't leaking from the divison.

https://github.com/adamcaudill/Base58Check/pull/3


I think you are focusing on the wrong number.

Each byte of base64 produces 6 bits of data, so the boundary aligns at 32 bits. LCM(6,8) = (6•8)/2

Each byte of base122 produces 7 bits of data, so the byte boundary aligns at 56 bits. LCM(7,8) = (7•8)/1

Edit: Due to the variable length encoding, there is no guarantee of byte alignment.


If it was base 128 it Would have produced 7 bits. But as it is 122 it produces 6.8 bits, hence a problem.


> This uses one-byte characters encode seven bits and two-byte characters to encode fourteen bits. Hence this attains the goal of encoding seven bits per byte, i.e. the 8 : 7 inflation ratio.

The magic is in the 2 byte encoding 110sss1x 10xxxxxx.

There are really 890 characters used in this encoding, so it should be called base890.

((2^7)-6)+((6•2)•(2^6))=890


A number system can have any natural number as radix, so why not? https://en.m.wikipedia.org/wiki/Radix

From the posted article: "This leaves us with 122 legal one-byte UTF-8 characters to use"

Seems legit to me.


> It's very easy to write a cache-timing-safe version of base{16,32,64} encoding ...

> Base-122? Not sure if it's even possible.

Parent is clearly referring to his previous statement, not that Base-122 itself is not possible.


Not obvious to me at all.


It's an interesting technical exercise, but I don't think this is the right approach for optimizing HTML load times.

If the goal is to reduce latency for small images, wouldn't it make it more sense to extend data URIs so the same base64 string can be referenced in multiple places?

Actually, as HTTP2 can effectively return multiple resurces in the answer of one request, do we still need embedded images for latency reduction at all?


Compression takes care of exactly this.


Unless compression cannot be used: https://news.ycombinator.com/item?id=13049898

Also, there are still cases where compression cannot be applied, e.g. if a script naively queries innerHTML. (This wouldn't affect loading time but it could inflate the page's RAM unnecessarily)


Sure, but I'd consider those edge cases to be situations where the treatment is worse than the disease


HTTPS is an edge condition? Over 50 percent of the web is now transferred via HTTPS.


It's not HTTPS. It's HTTPS cross-site requests with secret cookies.


tldr, no revolution, base64+gzip still far better than base122+gzip. But the article worth reading.


Aren't there security concerns using gzip and HTTPS? Could this be a space efficient alternative when gzip is disabled?


I would be interested to learn more about those concerns. My "default" setup these days is HTTPS & gzip everything, but I can't say I've read any white papers on the security implications of that.


If the attacker knows or controls any part of the data then the compressed size leaks information about the unknown data because the compressed size will be smaller if the known data shares bytes with the unknown data.



Something that is isn't really discussed is that the need for base64 arose so that binary data could transit 7-bit email systems.

Base122 does not have that property - as far as I could tell.

Whilst we are all, basically, living in an 8-bit world; I suspect it will be sometime before people feel comfortable assuming that an 8-bit transport is viable over email.

[edit: spelling ]


Base85 is between base64 and base122, while using printable characters. Any statistics of compressed base85?


I don't have an easy way to try a test corpus but I fed this comment page into it.

  Raw size:     37409
  gzip(raw):    6170
  gzip(base16): 7482
  gzip(base64): 10675
  gzip(base85): 10549
So it works better than base64, and it has the advantage of working 4 bytes at a time rather than 3.

Also note that when you're feeding in binary data the gzip sizes for raw and base-xx data get a lot closer together.


There's also basE91:

http://base91.sourceforge.net/


Note that if you're not using UTF-8 to encode your characters, you can beat Base64, Base85, and Base122 with https://github.com/ferno/base65536


Berkeley Unix (or was that SunOS?) used to cone with btoa that used base 85, packing 4 bytes into 5 uucp/mail/Usenet safe characters. For some reason, base 64 became the standard, and I don't think there's space for anything else these days. (Most things do just fine with binary now, anyway!)


Interesting, but please add a license :)


Why not embrace the Unicode world we live in now and just offset bytes to a Unicode region that has 256 non-space characters?

https://github.com/diafygi/Offset248

Pros: simple, no ratio or ending calculation, easy copy/paste, same character length as original bytes

Cons: has to be percent encoded in urls, some fonts might not render all characters


In utf8, these characters are all encoded as two bytes. So this encoding makes the byte count twice as long, it's no more efficient (in bytes) than hex encoding.


Base64 is widespread and works fine, while nobody groks base122.

Base122 requires UTF-8, and while that's pretty common, it's not universal, so base64 can't ever go away in favor of base122.

Compressed base64 is more efficient than base122 (with or without compression).

Conclusion: Big nope.


Are single-quotes allowed in the spec?

There is an explicit call out for "double quote == bad", but single quotes are also valid property delimeters in HTML.


If your property string is enclosed in a double quote, then a single quote in the payload is fine. (Otherwise, a lot of inline JS in onclick etc. attributes would break. JS allows both types of quotes on string literals for exactly this reason.)

Still, single quotes are somewhat asking for trouble.


Base-122 encoded strings may contain single quotes but not double quotes. The choice between the two was arbitrary.


I always wondered why people didn't do this.


I'm looking for an efficient encoding that is guaranteed to not contain certain substrings, such as improper language.


You can probably get most of what you want by simply not using vowels. A few non-English ones will still get through like knnbccb and there's also shlt and so on....


Define improper language. Does your improper-language dictionary cover every possible translation of every possible "improper" word?


meanwhile the Unicode consortium has been hard at work since 1991 to make it possible to encode up to 2.8 MB -- more than enough for most images, short videos, or many PDF files -- in a single character.


Are you referring to UTF-8? If so, this is misleading as you can encode up to 2^21 + 2^16 + 2^11 + 2^7 = 2,164,864 code points, which is not the same as encoding bytes in a single character.


I was making a joke about how ridiculously large Unicode is. Obviously it is not 2 million+ bytes per character! (But the fact that you didn't consider my joke obvious speaks volumes).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: