I think it is fascinating how big differences there can be depending on which metric you look at. Just goes to show that there is no one size fits all solution, and that you really can squeeze extra perf if you know your application and target hardware. I imagine that if you have some specific fixed length inputs you could do even better
What's problematic about this table is that some of the "quality problems" are just statistical noise. For example it lists quality problems for sha1-160/sha2-224/sha2-256, which should not have any weaknesses that can be detected by a statistical test. Others seem to be implementation limitations, not flaws in the algorithm itself (e.g. blake3_c's "no 32bit portability").
I wonder if there would be difference in wall-clock time if you ran multiple instances of the hash across multiple threads, would the vectorized ones scale differently than scalar ones? With dynamic clocking and weird execution unit arrangements everything is variable
If you enable SMT then you might end up using almost all of your possible aesenc's on one of the two threads (for e.g. ahash or meowhash), so you would observe significantly less than 2x the throughput from running 2 threads on the core. If you don't enable SMT then you should get the same results that SMHasher gets. Workloads that are not mostly made of hashing probably won't suffer from this issue even with SMT enabled.
This is really good!! Thanks! The security discussion is particularly interesting, I thought it was the randomization of the initial state of the hash functions which mitigated the hash table collision attacks in the popular programming languages.
I am interested in finding a hash function that helps me fill an array in a more compact way, without too many empty spaces. I.E. if I have 10k words to hash I don't want to use a 20k array.
This may be more about the indexing strategy than the hash function. Meaning: once we have the hash(es), how do we decide what index to use for the key?
Open-addressing schemes like linear probing, quadratic probing, certain variants of Robinhood hashing (which is kind of on a different axis anyway) commonly have 50–70% expected space efficiency. Performance is bad above 90%.
But Cuckoo hash tables can get above 90% with great lookup performance (non-amortized constant time) and decent insert performance. Insert performance tends to suffer closer to 100%.
Whether or not cuckoo hash tables qualify as “open addressing” will depend on the implementation, but the main point is that the addressing/indexing strategy mainly determines the operable “load factor” of the hash table.
This is a great article on open addressing schemes (including cuckoo):
The choice depends on your circumstances. If you have 10K arbitrary words and want a 50% chance of collision then using a good hash means you'll need 20K entries (assuming you aren't using chaining).
You are thinking of a different calculation. I was referring to the average number of probes needed to find an unused slot, not the number needed to have a very low chance of collision for the 10,000 elements in the hash table.
If the slot is already unused, then only one probe is needed. If the slot is used, then either you have a match or you need to test the next slot(s). With 50% of the slots occupied and 50% free, you should expect 1 + 0.5 + 0.25 + ... = 2.0 probes until you find an empty slot.
Here are my results from testing three different probing techniques, using 10,000 items inserted into 20,000 slots, following by generating 10,000 new random words and seeing how many probes are needed to find an empty slot (#probes / num words):
This shows that with a really good probing algorithm (each probe uses a new sha256 calculation), on average you really do have only have a 50% chance of collision, and even with linear probing (which has is easy to develop, quick to re-probe, and has good cache coherency) you don't take a big hit due to collisions.
Here's the code for the above:
import random, hashlib, itertools
rng = random.Random()
# Create a random 'word' - the details don't matter as I assume there's a good hash
def get_word():
return rng.randbytes(10)
# Return a really good hash of size 2**256.
def get_hash(word, attempt):
sha256 = hashlib.sha256(attempt.to_bytes(4, "little") + word)
return int.from_bytes(sha256.digest(), "little")
N = 20_000
# The following code requires a `slots` list, which should
# be initialized as:
# slots = [None] * N
# Use a new hash for each probe
def k_independent_probe(word):
for attempt in itertools.count(1):
h = get_hash(word, attempt) % N
if slots[h] is None:
return (attempt, h)
# Increment the hash by 1 for each probe (good cache locality)
def linear_probe(word):
h = get_hash(word, 0) % N
for attempt in itertools.count(1):
if slots[h] is None:
return (attempt, h)
h = (h + 1) % N
# Increment the hash by attempt**2 for each probe
def quadratic_probe(word):
h = get_hash(word, 0) % N
for attempt in itertools.count(1):
if slots[h] is None:
return (attempt, h)
h = (h + attempt ** 2) % N
# Add the word to the hash.
# Must set the module variable `probe` to one of the above functions.
def insert(word):
_, h = probe(word)
slots[h] = word
# Test out the probe functions.
for probe in (linear_probe, quadratic_probe, k_independent_probe):
slots = [None] * N
# Add 10,000 entries
for i in range(10_000):
word = get_word()
insert(word)
# See how many probes are needed to find an empty slot
num_tests = 10_000
total_num_probes = 0
for i in range(num_tests):
word = get_word()
num_attempts, _ = probe(word)
total_num_probes += num_attempts
print(f"{probe.__name__}: {total_num_probes} / {num_tests} = {total_num_probes/num_tests:.2f}")