Whenever possible at Adly, we like to use existing software to solve as many problems as we can. From PostgreSQL for our database, to syslog-ng and Flask for our logging, to Redis for search, caching, and other fast-data operations, ActiveMQ for our message broker, etc. But there are times when existing software just isn't fast enough, takes too much memory, or just isn't suited for the task. This is one of those times.
In our Adly Analytics product, we calculate a number for pairs of Twitter users called "Also Follows". For two twitter users, it is the number of followers that follow both of those users. We had tried using Redis for this directly, but we were looking at roughly 700 megs of memory to store one set of 7.5 million followers. Ugh! Straight Python did a bit better at 500 megs, but that was still far more than we could reasonably use to store the data. Combine that with Redis intersections taking multiple seconds to run, and the fact that we needed to intersect 1,000 sets against 10,000 sets every week (these were due to the nature of the kinds of updates we were performing), and we'd need to have 33 processors plus roughly 300 gigs of memory to just perform this operation fast enough. That's just too expensive (over $6k/month just for machines in EC2 for the memory).
When faced with seemingly insurmountable costs, we did what any tech-directed company would do; we wrote custom software to solve our problem. What did we do?
First, we got rid of hash tables. They are great for random lookups, but in the case for fixed-size data (we had 32 bit integers for Twitter user ids), a sorted array would be roughly 1/20th the size of the set in Redis, while offering a method for random lookup (if necessary).
When we are fetching the follower lists, we get them in chunks of 5000 at a time, unsorted. We write them to temporary files on disk while we are reading them, then when done, we read the list into memory, then sort it using the C standard library quicksort. After sorting, we scan the sequence for any duplicate values and discard them. Once we have a sorted sequence of unique integers, we write them to disk.
Once they are on disk, we signal our intersection daemon to memory map the file*, letting the operating system cache the file as necessary. When we need to perform the "also follows" calculation, we run a custom intersection algorithm that is tuned to rely on memory bandwidth, which allows us to intersect a set of 5 million and 7.5 million integers with roughly 2 million shared integers in roughly 6 milliseconds on a 3ghz Core 2 Duo (using one processor). That is roughly 2 billion integers examined every second. Compare that to Redis taking roughly 7 seconds to perform that same operation (due to the random hash table lookups, allocating and copying data, etc.), and we've improved our performance by 1000x.
For the data we need to store in memory, we went from over 300 gigs to under 15 gigs. We also went from taking around 45 minutes to intersect 1 item against 10,000, to taking 3 seconds**. While we still believe in using existing software as much as possible, sometimes it is just worth it to write what you need to in C/C++ to get the speed you need.
Along the way, we tried to use scipy.weave.inline to inline C into Python. Sadly, due to some threading issues, we got some nasty behavior and segfaults. I didn't dig too far into it and just wrote what we needed as a plain Python C extension. It's pretty easy.
* Memory mapped files are an interesting and crazy thing, available in Posix (Linux, BSD, OS X, etc.), Windows, and just about any other mature system out there. They allow you to open a file in such a way that you can access the contents of the file as if it were a C array. You can map the entire file or parts of the file. The operating system caches the data as necessary, and if you have multiple processes memory mapping the same file, the memory is shared between them.
** Theoretically it was dropped to 3 seconds, but due to processor caching behavior, Python overhead, etc., this is actually around 45-50 seconds. In an hour, we sustain roughly 2.4 trillion integers examined, or roughly 670 million integers intersected against each other every second on average.
If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!
Update: You can read why we didn't use a bloom filter in a followup post.