The latter is my own, with a focus on memory efficiency (by my benchmarks, it uses about 40% less space than patricia-trie for the same data set). Would love to get some extra eyes on it to figure out how to trim it further or speed it up.
Crit-bit trees are great but they're not silver-bullet: like every tree structure, even in the lucky case a search can cause O(log n) cache misses, while a good hash table needs just one or two.
Also, as pointed out in another comment, it is not necessarily balanced (like red-black trees) and there are some real-world datasets where crit-bit (or Patricia) trees behave poorly, for example URLs. I myself have seen Patricia trees with height greater than 100, for sets of URLs where log n ~ 30. According to VTune, most time in searches was spent in cache misses.
That's a very good point. One way to circumvent this problem is to combine hashing and crit-bit trees,
by constructing an unordered tree on the hash values of the set elements. The nice thing here is that if the
hash function behaves like a uniform random variable the resulting tree will be balanced. Furthermore,
without ordering the implementation is trivial.
At the same time you will not need rehashing once the table "fills up" and since you use don't discard
bits from the hash function the expected number of collisions after inserting n elements with a 32 bit hash
is n / 2^32 - or 0 for reasonable values of n.
Additionally you can use clustering (e.g. build a crit-nibble tree, with 16 pointers per node - one 64 byte cache line)
to reduce the number of cache misses. This can backfire spectacularly unless the data is essentially random,
so the hashing step remains important.
It may not be a silver bullet, but there are some interesting trade-offs.
The data structure is not better than a normal hash table. It is a different trade-off.
For instance, you don't have to do a rehashing step, which is important for real time applications. Memory usage is also very deterministic at 2*(n-1) words in an n element table (with 2 words per node).
If none of this matters to you and you merely want a map with fast lookups then a simple hash table is probably a better choice.
> A crit-bit tree for a nonempty prefix-free set S of bit strings has an external node for each string in S; an internal node for each bit string x such that x0 and x1 are prefixes of strings in S; and ancestors defined by prefixes.
And that's a perfect (missing) example for "a picture is worth a thousand words". Then again... this coming from DJB - I should be surprised the description is as verbose as it is.
that's a perfect (missing) example for "a picture is worth a thousand words".
I don't know if it helps, but here's the picture I drew at the top of my patricia trie code. (Crit-bit trees aren't the same as patricia tries, but they're close enough for expository purposes.)
/**
* Our Patricia tree structure can be thought of as operating on strings of
* 9-bit bytes, where 0x00 -- 0xFF are mapped to 0x100 -- 0x1FF and 0x00
* represents the end-of-string character (note that NUL can occur inside
* keys). The field (struct pnode).mask is either 0 or a power of 2; if 0,
* the left child, if non-NULL, is a pointer to the record associated with
* the key thus far. For example, the strings "hello", "hello colin",
* "hello world", and "wednesday", each associated with pointers to
* themselves, are stored in the following tree:
*
* [0x10, 0x60, 0, ""]
* | |
* [0x00, 0x00, 5, "hello"] [0x00, 0x00, 9, "wednesday"]
* | | | |
* "hello" [0x10, 0x60, 1, " "] "wednesday" NULL
* | |
* [0x00, 0x00, 5, "colin"] [0x00, 0x00, 5, "world"]
* | | | |
* "hello colin" NULL "hello world" NULL
*
*/
Edited to add: And here's the node structure:
/* Structure used to store a Patricia tree node. */
struct pnode {
struct pnode * left; /* Left child. */
struct pnode * right; /* Right child. */
uint8_t mask; /* Critical bit mask. */
uint8_t high; /* High bits of this node. */
uint8_t slen; /* Length of s[]. */
uint8_t s[]; /* Bytes since parent's s[]. */
};
The particular words you've quoted have managed to condense a thousand pictures — an infinite number, really — into 45 words. It's kind of a counterexample.
Some examples would certainly help, and probably diagrams would help understanding the examples.
That's assuming you take "nonempty prefix-free set", "internal node", "bit string x such that x0 and x1" (not sure if they're supposed to be subscripts or are 0 and 1 bits) for granted. A picture solves the problem in many ways giving you real examples.
It's 45 words with lots of assumed knowledge that you could write books about ;)
Yeah, I gotta say, I knew exactly how it worked the moment I looked at the radix tree picture on wikipedia. After reading more than half the linked article though I still had very little understanding of what it really did.
> Another advantage is that a crit-bit tree guarantees good performance: it doesn't have any tricky slowdowns for unusual (or malicious) data.
Consider a crit-bit tree over variable-length bit strings. The set 1, 11, 111, ... (with n bit strings) has depth n.
Depending on how you count, it can be argued that this is still substantially better than the worst-case of hash tables, but in all of the ways I think of counting, a crit-bit tree is worse on this input than a balanced search tree.
Big O notation, it isn't. The big question is the cost of different types of comparisons.
The most important thing to remember is that a crit-bit lookup does O(string-length) "critical" bit lookups, and then an O(string-length) final comparison - for a total of O(string-length).
Whereas a hash lookup usually does O(string-length) computation of the hash value, average case O(1) lookups, and then O(string-length) comparison of the key value - for a total of O(string-length).
A balanced tree does O(log-n) comparisons, of O(string-length) each (the final one giving the confirmation, which was an independent step for both hash and critbit trees).
This is true even on this specific input. It boils down to how the amortized single-character comparison compares to a cache miss.
If crit-bit trees leafs are packed in a cache-friendly way (they are constant size, so this might actually be easy to arrange), they are likely to perform at least as well as balanced trees.
Given a query string of length k and cache lines that store B bits, the crit-bit trie for the bad input uses Theta(k) computations and causes Theta(k) cache misses, I think. A balanced tree holding the same set causes Theta(k * lg n) computations and O(lg n + (k * lg n / B)) cache misses, I think. My understanding is that lg n / B is frequently much less than 1, even for small cache line sizes, so that the balanced tree will generally use less I/O than a crit-bit tree but more computation.
A crit-bit tree stores next-different-bit positions at branches and entire values at the leaves, and is searched by following branches depending on the value of the relevant bit in the search key (the "critical bit" which gives the data structure its name). Once a leaf is reached the entire search key is compared to the value at the leaf, which is necessary because a key that is not in the tree might differ at bit positions that are non-critical and would not have been tested during the tree descent.
This is a very different arrangement to a radix tree, which is essentially a form of trie.
Years ago we used this at Singshot to evenly distribute numerical assets across subdirectories. Very useful these days for shardy cloudbucket applications.
According to the paper, this code was actually extracted directly from qhasm, which djb released in 2006. This is notable because any code becomes more worthy of admiration once we know it was written by djb. As Mark-Jason Dominus used to say, "I'd drink poisoned cool-aid as long as djb told me he'd made it himself."
I'm not sure it's the best way to store sets. It's seems like potentially doing a pointer dereference per bit is quite an overhead. It would be 8 times faster (for dense sets) to have a crit-byte tree. Though then obviously if you did that naively, there would be a huge associated storage cost. To mitigate that, you can do what AMTs do and have a bitmap per node followed by an array of pointers.
The minimal-acceptor approach is also best when you're considering dynamic programs that effectively involve some kind of FSA-intersection, including best-first lazy versions like "find the closest spelling correction to a sequence of words in your lexicon of some text", because your state (suffix set) representation is canonical.
I only skimmed the paper, but a simple n*log n approach to produce the same result is to first sort the lexicon, then build the trie while canonicalizing the suffix sets depth first (needing to hash/compare only one level deep, by induction).
The data structure for the mealy recognizer described in the paper can very easily be changed to support minimal perfect hash numbers for all elements, without using more space (my implementation uses less space because of it, because the same information can be used to calculate the number of strings with a prefix in O(1) time).
I couldn't find any benchmarks on the web to substantiate the speed claims.
Still, it's a cool data structure, and I respect djb enough to trust that it's at least competitive with hash tables; I'll consider it when I want more operations than offered by a hash table (especially for a set that's very sparse compared to key length).
The article puts forward a tree as a better alternative to hashtables:
"I have become convinced that this strategy should change. The revised strategy is much simpler: there should be one fundamental set-storage type, namely a crit-bit tree. Here's how a crit-bit tree stacks up against the competition:
* A hash table supports insertion, deletion, and exact searches. A crit-bit tree supports insertion, deletion, exact searches, and ordered operations such as finding the minimum. Another advantage is that a crit-bit tree guarantees good performance: it doesn't have any tricky slowdowns for unusual (or malicious) data."
Oh, what's the big-O of a lookup in a tree vs lookup in a hash-table again? O(lg N) vs O(1) with bad locality of reference to boot, you say?
He's not arguing that crit-bit trees are better than hash tables, he's arguing that they are a better choice for the fundamental set datatype for languages like python, perl etc etc.
When you write an application, you know the performance requirements of the set representation you choose & can pick the most appropriate for your application. A language designer doesn't have this information: their choice of fundamental datatype affects everyone.
Bernstein is arguing that they should choose a set representation with good performance for as wide a range of features as possible rather than one that performs well for a small set of requirements, and very poorly for others.
I can see arguments both ways here, but FAIL is a bit too strong IMO.
You weaken his rhetoric; he writes like its crit-bit all the way for all purposes.
And since there is not a queue of people complaining that Python's dict is not iterable in order, I'd say he's wrong. Random access in a dict is the most common use-case by a very long chalk.
https://github.com/rkapsi/patricia-trie https://github.com/jfager/functional-critbit
The latter is my own, with a focus on memory efficiency (by my benchmarks, it uses about 40% less space than patricia-trie for the same data set). Would love to get some extra eyes on it to figure out how to trim it further or speed it up.