I never really finished the project, thus only the rough qualitative benchmarks you get at the bottom (measured mostly by profiling and size(1)); I saw that it wasn't enough of a win in the larger context where I needed it, thus it made sense to stop early instead of making exhaustive benchmarks.
The blog post was mainly for curious readers, I'm surprised it hit HN at all. :-)
Well, again, different problem constraints, different solutions. Seemingly that tool can handle larger sets than gperf (although it claims gperf stops at a couple hundred, which is an exaggeration; try it with the first 1000 lines of /usr/dict/words and it's nearly instant, and with the first 10k it needs 35 seconds or so), but it also says the runtime is even slower. My goal was to have faster runtime, not handle more keys. YMMV.
There’s an explanation of how to implement a perfect hash generator in “Hacker’s Delight” that you can read to take a known set of keys and produce a consistent perfect hash at runtime.
It’s a very worthwhile thing to keep in your back pocket if you’re going to deal with hashes whose keys are 100% knowable at runtime but not before.
I believe there are some high concurrency hash tables out there where the data structure contains two tables, and so each get results in a constant number of fetches, but the count is almost always greater than one. But if it avoids concurrency issues that ends up being acceptable.
I remember Cliff Click presenting one that was lockless and if I'm recalling correctly, where capacity growth operations could happen in parallel with reads. And possibly writes.
Uhh, there is nothing at all "academic" about pthash, (or phast, or ptrhash), apart from the fact that they also described their ideas in papers. All of those tools work on massive sets of data, support massively parallel construction, and support external memory construction. They are all well-engineered libraries and not "academic" in any derogatory sense.
They are academic, as they are not practical. Most of the algos produce random graphs, ie random indices. The given index has nothing to do with the keys.
They give the hash sizes without the needed ordering tables, they don't even produce those seperate tables. Because academic competition. So in reality you'd need O(2n) more time and O(n) space. Without random key access it gets slow. Then there are no checks for false-positives. You'd need fast random access to the keys then. Then they could not compile to static sourcecode.
All this is pure fantasy for real products. gperf, nbperf are practical. They store the keys, and give a proper search function. They produce C code. pthash, ptrhash, cmph not. It's complicated because the keys can be in a file, a stream, a database, with just linear access or random access. I'm adding options to either store the ordering table, or emit an ordered keys file. And I compile to C or C++ code. And I allow several hash algos, not just random seeds. Jenkins hash is mostly good, but there are better, depending on the keys.
What's most annoying with gperf and similar tools is that they aren't really suited to applications where the set of keys is known at runtime during initialization.
I've written code that, during initialization, after all keys have been collected, essentially called gperf to create the lookup function, compiled it, and then dynamically loaded it, so that the (long-running) program would be able to use it.
Maybe this is one of those situations where compile time code execution wins out. Instead of needing one solution for runtime and one for a priori knowledge, you just run the runtime code during the build process and Bob's your uncle.
If I'm understanding this correctly, you're essentially hashing twice. You get a perfect hash on the second hash call, and don't worry about one on the first.
There are other hash tables that use double hashing but they don't guarantee a lack of collisions. They just make them highly improbable.
I wrote a python library that would build a perfect hash at run-time, it was basically the stupidest way you could do it - it would shell out to gperf to build the library, compile it to a shared library, then link the entry points in (I think with ctypes? I can't remember).
It was just for fun, but in the end even ignoring the startup costs of all of that, the default performance of python's dictionary was better.
Make sure to read the post linked right at the beginning as well: http://0x80.pl/notesen/2023-04-30-lookup-in-strings.html as well as the magic bitboards linked, too https://www.chessprogramming.org/Magic_Bitboards
Though honestly this post really needed some numbers and benchmarks.
I never really finished the project, thus only the rough qualitative benchmarks you get at the bottom (measured mostly by profiling and size(1)); I saw that it wasn't enough of a win in the larger context where I needed it, thus it made sense to stop early instead of making exhaustive benchmarks.
The blog post was mainly for curious readers, I'm surprised it hit HN at all. :-)
gperf is very limited in the number of keys it can handle as opposed to, say, https://burtleburtle.net/bob/hash/perfect.html
Well, again, different problem constraints, different solutions. Seemingly that tool can handle larger sets than gperf (although it claims gperf stops at a couple hundred, which is an exaggeration; try it with the first 1000 lines of /usr/dict/words and it's nearly instant, and with the first 10k it needs 35 seconds or so), but it also says the runtime is even slower. My goal was to have faster runtime, not handle more keys. YMMV.
Not an exaggeration, just written when machines were a lot slower. Anyway, more work in this space is always welcome, so thanks.
There’s an explanation of how to implement a perfect hash generator in “Hacker’s Delight” that you can read to take a known set of keys and produce a consistent perfect hash at runtime.
It’s a very worthwhile thing to keep in your back pocket if you’re going to deal with hashes whose keys are 100% knowable at runtime but not before.
One thing not mentioned: very often "give up and allow a not-quite-perfect hash" is a reasonable solution.
I believe there are some high concurrency hash tables out there where the data structure contains two tables, and so each get results in a constant number of fetches, but the count is almost always greater than one. But if it avoids concurrency issues that ends up being acceptable.
I remember Cliff Click presenting one that was lockless and if I'm recalling correctly, where capacity growth operations could happen in parallel with reads. And possibly writes.
I am working on modern perfect hashing also. First for integer keys, and then also for millions of keys. Should be practical, not academic.
https://github.com/rurban/gperf/tree/autotools or some other branch. Forgot which.
https://github.com/rurban/cmph/tree/compressed_o (lotsa improvements)
https://github.com/rurban/pthash (compile to static C++)
I've also extended nbperf for those features
Uhh, there is nothing at all "academic" about pthash, (or phast, or ptrhash), apart from the fact that they also described their ideas in papers. All of those tools work on massive sets of data, support massively parallel construction, and support external memory construction. They are all well-engineered libraries and not "academic" in any derogatory sense.
They are academic, as they are not practical. Most of the algos produce random graphs, ie random indices. The given index has nothing to do with the keys.
They give the hash sizes without the needed ordering tables, they don't even produce those seperate tables. Because academic competition. So in reality you'd need O(2n) more time and O(n) space. Without random key access it gets slow. Then there are no checks for false-positives. You'd need fast random access to the keys then. Then they could not compile to static sourcecode.
All this is pure fantasy for real products. gperf, nbperf are practical. They store the keys, and give a proper search function. They produce C code. pthash, ptrhash, cmph not. It's complicated because the keys can be in a file, a stream, a database, with just linear access or random access. I'm adding options to either store the ordering table, or emit an ordered keys file. And I compile to C or C++ code. And I allow several hash algos, not just random seeds. Jenkins hash is mostly good, but there are better, depending on the keys.
What's most annoying with gperf and similar tools is that they aren't really suited to applications where the set of keys is known at runtime during initialization.
I've written code that, during initialization, after all keys have been collected, essentially called gperf to create the lookup function, compiled it, and then dynamically loaded it, so that the (long-running) program would be able to use it.
That means you need to deploy with a compiler etc., not ideal.
Maybe this is one of those situations where compile time code execution wins out. Instead of needing one solution for runtime and one for a priori knowledge, you just run the runtime code during the build process and Bob's your uncle.
See https://burtleburtle.net/bob/hash/perfect.html
If I'm understanding this correctly, you're essentially hashing twice. You get a perfect hash on the second hash call, and don't worry about one on the first.
There are other hash tables that use double hashing but they don't guarantee a lack of collisions. They just make them highly improbable.
I wrote a python library that would build a perfect hash at run-time, it was basically the stupidest way you could do it - it would shell out to gperf to build the library, compile it to a shared library, then link the entry points in (I think with ctypes? I can't remember).
It was just for fun, but in the end even ignoring the startup costs of all of that, the default performance of python's dictionary was better.
Obligatory mention: https://github.com/pytries/marisa-trie (marisa trie - a succinct DS) which gives nearly perfect information theoretic lower bound compression
particularly marisa_trie.RecordTrie: https://marisa-trie.readthedocs.io/en/latest/tutorial.html#m...