This post is meant to be my inaugural post discussing technology, algorithms, etc., and is a response to
http://sna-projects.com/blog/2010/06/beating-binary-search/ , showing how you can apply practical engineering to beat some of the best theoretical algorithms.
In the world of software engineering, I've seen a lot of elegant solutions to hard problems, and a lot of horrible solutions to simple problems. As an algorithm, binary search is very straightforward to conceptualize, not too difficult to implement, and solves searching a sorted sequence very well. It's generally one of the first (and most correct) answer to a lot of questions. It is conceptually very elegant, which is why it is used very often.
The general idea is that if you are looking for something in a sequence, you check the middle item. If what you are looking for is before what you just checked, you throw out everything greater than the item you just checked, and have just reduced your problem set by half (or vise-versa when the item you are looking for is greater than the item you just checked). This repeated division by 2 ends up taking log2(n) comparisons, where 'n' is the number of items in your sequence. In simple terms, how good is this? Searching 1 million items takes 20 comparisons, and searching 1 trillion only takes 40. Not bad. But we can do better.
If you know the distribution of your sorted data, you can write a function to guess near where you expect your item to be. This is generally how humans search for names in a phone book, and is what computer scientists call interpolation search. Analysis has shown that if your function reasonably approximates how the data is actually distributed, you can reduce that log2(n) to log2(log2(n)); dropping those 20 comparisons from binary to 5, or 40 to 6. This is very powerful when the cost to compare is high, or when it is expensive to get the element to compare. This is the problem that Jay had when he wrote the earlier mentioned post... the size of the data they are dealing with is big enough so that they can't store it all in memory, so need to go to disk before performing almost all of the comparisons. If you have to do 20 comparisons, and it takes 10ms to get your data from disk, then that's 200ms to do your search. But if you can reduce that to 5 reads from disk, you've just reduced your search time to 50ms, or a 4x improvement in performance.
When dealing with the specific characteristics of disks (slow seeks, fast streaming), you can further improve interpolation search with a couple tricks. When you read 1 byte from disk into your program, the underlying system isn't actually reading 1 byte; it is typically reading 4096 (or whatever your block size is). You can exploit this by instead of trying to read the specific data you want to compare against, you read the entire block your data is in. But since you are already there, why not something like 32k before and 32k after the data you want to check? That will help you to predict better if what you are looking for isn't there, and will help you with one of the major drawbacks of interpolation search: your function isn't always the best at predicting the distribution of your data. Even with a modest modern drive transfer speed of 100M/second, that 64k will only take an additional .64ms above the 10ms seek, and will help to guarantee that you get your 5 disk reads that you really want (comparisons on in-memory data are effectively free compared to reading from disk). But can we do better?
One very common search-engine trick is to build an index on top of your sorted data. I know what you are thinking, "Josiah, isn't an index just another binary search (tree)?" Yes and no. We'll take a large-scale example, so that I can explain what we do, and why it's better.
Let us say that we have 16 billion ids that we need to search. These ids are 16 bytes, and have 16 bytes of pointer data associated with them (to match the problem they were having in Project Voldemort). That's 512 gigs of data. No small task.
If we use a normal binary search, we need to perform 23 disk reads before we discover the 64k block we are looking for (log2(16 billion) = 34, -11 for the 2048 entries in each 64k block, gets us 23), getting us 230ms. If we use interpolation search, that's only 5 reads, or 50ms. Not bad. But we can do better.
What if we wrote a second file, which contained just the first item (without pointer data) from every 64k block? We could represent an entire 64k block in 16 bytes of data, or our 512 gigs in just 128 megs. If you've got 128 megs to spare in your machine, you keep that file, and can binary search in-memory over those items in a few microseconds, followed by a single seek + 64k read, to find exactly the data you want in 10ms.
But what if you don't have that 128 megs to spare? Well, we can do the same thing to that 128 meg sequence to get it down to 32k. I know you've got 32k sitting around, which will get you 2 disk seeks with 64k reads, for a total of 20ms.
This is a full B+Tree with implicit pointers. And in the case of static data, beats pretty much anything else out there. This is why databases, filesystems, etc., all use B+Trees to improve search performance.
So what did we do? Take 34 comparisons and disk reads, show how it can be reduced to 23 with a simple trick (64k reads). Then by using a better algorithm, show how we could get it down to 5 reads. But if we've got a tiny amount of extra space, we can reduce it to 2 reads, and if we've got more spare memory, just one read. 230ms -> 50ms -> 20ms -> 10ms. Not bad.