Thursday, September 15, 2011

Improving Performance by 1000x

Whenever possible at Adly, we like to use existing software to solve as many problems as we can. From PostgreSQL for our database, to syslog-ng and Flask for our logging, to Redis for search, caching, and other fast-data operations, ActiveMQ for our message broker, etc. But there are times when existing software just isn't fast enough, takes too much memory, or just isn't suited for the task. This is one of those times.


In our Adly Analytics product, we calculate a number for pairs of Twitter users called "Also Follows". For two twitter users, it is the number of followers that follow both of those users. We had tried using Redis for this directly, but we were looking at roughly 700 megs of memory to store one set of 7.5 million followers. Ugh! Straight Python did a bit better at 500 megs, but that was still far more than we could reasonably use to store the data. Combine that with Redis intersections taking multiple seconds to run, and the fact that we needed to intersect 1,000 sets against 10,000 sets every week (these were due to the nature of the kinds of updates we were performing), and we'd need to have 33 processors plus roughly 300 gigs of memory to just perform this operation fast enough. That's just too expensive (over $6k/month just for machines in EC2 for the memory).

When faced with seemingly insurmountable costs, we did what any tech-directed company would do; we wrote custom software to solve our problem. What did we do?


First, we got rid of hash tables. They are great for random lookups, but in the case for fixed-size data (we had 32 bit integers for Twitter user ids), a sorted array would be roughly 1/20th the size of the set in Redis, while offering a method for random lookup (if necessary).

When we are fetching the follower lists, we get them in chunks of 5000 at a time, unsorted. We write them to temporary files on disk while we are reading them, then when done, we read the list into memory, then sort it using the C standard library quicksort. After sorting, we scan the sequence for any duplicate values and discard them. Once we have a sorted sequence of unique integers, we write them to disk.

Once they are on disk, we signal our intersection daemon to memory map the file*, letting the operating system cache the file as necessary. When we need to perform the "also follows" calculation, we run a custom intersection algorithm that is tuned to rely on memory bandwidth, which allows us to intersect a set of 5 million and 7.5 million integers with roughly 2 million shared integers in roughly 6 milliseconds on a 3ghz Core 2 Duo (using one processor). That is roughly 2 billion integers examined every second. Compare that to Redis taking roughly 7 seconds to perform that same operation (due to the random hash table lookups, allocating and copying data, etc.), and we've improved our performance by 1000x.


Our results:
For the data we need to store in memory, we went from over 300 gigs to under 15 gigs. We also went from taking around 45 minutes to intersect 1 item against 10,000, to taking 3 seconds**.  While we still believe in using existing software as much as possible, sometimes it is just worth it to write what you need to in C/C++ to get the speed you need.

Along the way, we tried to use scipy.weave.inline to inline C into Python. Sadly, due to some threading issues, we got some nasty behavior and segfaults. I didn't dig too far into it and just wrote what we needed as a plain Python C extension. It's pretty easy.

* Memory mapped files are an interesting and crazy thing, available in Posix (Linux, BSD, OS X, etc.), Windows, and just about any other mature system out there. They allow you to open a file in such a way that you can access the contents of the file as if it were a C array. You can map the entire file or parts of the file. The operating system caches the data as necessary, and if you have multiple processes memory mapping the same file, the memory is shared between them.

** Theoretically it was dropped to 3 seconds, but due to processor caching behavior, Python overhead, etc., this is actually around 45-50 seconds. In an hour, we sustain roughly 2.4 trillion integers examined, or roughly 670 million integers intersected against each other every second on average.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

12 comments:

  1. Small addition - if the data set to be sorted is extremely huge (i.e. exceeds available RAM by more than 100x times), then std::sort() may become slower than external sorting ( http://en.wikipedia.org/wiki/External_sorting ). In this case try STXXL ( http://stxxl.sourceforge.net/ ).

    ReplyDelete
  2. Nice article!

    Try also std::sort() + std::set_intersection() from STL algorithms ( http://www.cplusplus.com/reference/algorithm/ ) for intersecting user ids - this probably will be faster than your custom solution. Also the resulting code size may become smaller and cleaner.

    std::sort() implementation provided by gcc is based on Introsort ( http://en.wikipedia.org/wiki/Introsort ). It behaves quite well with huge arrays, which don't fit RAM, because it tends to access data sequentially. So you can directly sort ids in memory mapped file without splitting the file into small chunks.

    std::set_intersection() also works excellent with huge data sets, which don't fit RAM due to the same reason - it accesses data sequentially.

    Waiting for the next blog post, which reveals additional performance boost gained from STL algorithms :)

    ReplyDelete
  3. For some reason, your other posts were deleted. I didn't delete them, but I did get the emails.

    Size wasn't an issue for us, as our largest sets were 20 million elements, which at 4 bytes each maxed out at 80 megabytes. Those weren't a problem on the 16+ gig machine we were using.

    In terms of performance, it is unlikely that C++ would be any faster. Typically, C++ code generation will produce vtables for code execution, which even if the functions themselves are inlined, can bloat the actual core loops with extra code, slowing it down.

    I also started out with an algorithm similar to the one described on the STL page, but improved it by roughly 2x by making the simple observation that longer sets will skip over more items. By restructuring the code into 2 nested loops (instead of one loop), the inner loop is faster at skipping over the longer list, with the outer loop handling exact matches and skipping over the shorter list. This improves branch prediction and reduces core code size, getting us 2x better performance. Also, we don't care about actually getting the set of intersected users, we just cared about the count. So the STL is a guaranteed loss in terms of performance.

    Ignoring performance, using the STL is the worst experience I've ever had with any programming language. And using C++ with Python (which was what we used to control it) is at least a big of a pain as using C++ in the first place.

    So for the sake of performance, simplicity, and sanity, I don't believe I'll be rewriting it in C++ :P

    Regards.

    ReplyDelete
  4. If the "also follows" requires calculation of only the size of the intersection, it can be calculated with a bloom filter calibrated to the desired error rate. No sorting of inputs is necessary.

    ReplyDelete
  5. Yes, but using a bloom filter would be slower. But, why would a bloom filter be slower?

    Let's first start with a fact: RAM is actually really slow. When you make a request to read RAM, it returns a 64 byte cache line from that region in about 15 nanoseconds (in the best case). That's around 4.3 gigs/second random access. If you are accessing memory in a *predictable* pattern, the second block you request is returned in about 10 nanoseconds, and subsequent blocks can be returned in as fast as 3-4 nanoseconds. So, after the first few "slow" requests, you can read at a peak rate of roughly 21.3 gigs/second for predictable access patterns.

    Back to our case. We intersected a set of 5 million items against a set of 7.5 million items in 6 milliseconds. Doing the math... 4 bytes * (5m + 7.5m) / 6ms -> 8.3 gigs/second. So, we were within a factor of ~2-2.5x of maxing out the memory bandwidth available to the entire system with our 6ms intersection.

    But, let's pretend that we had already built a bloom filter, and let's even say that the math said that we only need 7 probe (the reality would probably be closer to 16 or so). At 15ns per probe (because bloom filters have unpredictably *random* access patterns), 7 probes per 5 million in our input set, my math says that the intersection would take 525 milliseconds, which is 87 times slower than our intersection method.

    But here's the other nasty bit of reality: constructing the bloom filter for that 5 million entry set (because we need to have bloom filter for all sets) takes at least twice as long as querying it. Why? Because to update 1 bit, you first need to read the cache line, then you need to update that one bit in the register, then you need to write the data back. I know what you are thinking, you are thinking that your write-through cache will help you. But you are wrong. Because you are writing randomly to your bloom filter, you run out of cache very quickly, and the caching system will quickly start blocking the processor from reading or writing to memory, because it's got a large backlog that is lagging at 15ns per write.

    Meanwhile, the "slow" O(nlogn) sorting algorithm finishes in 1.2 seconds, which is within spitting distance of the theoretical bloom filter we just described.

    Another drawback with the bloom filter is memory. I know what you are thinking, you are thinking "well, with a 7 probe bloom filter, maybe you only need 2-3 bytes per user, instead of the intset 4 bytes per", but in order to perform the intersection, you still need your list of users (sorted or not) in addition to the bloom filter, leaving you with 6-7 bytes per user.

    Conclusion: bloom filters would be (80-160 times) slower to intersect, would use more memory, don't offer a construction-time savings (probably closer to 2x longer for a 16 probe bloom filter), and are only *approximately* correct.

    Thanks, but I'll stick with sorted intsets.

    The reason why our custom solution was so much faster than Redis is primarily the memory access patterns and memory efficiency. Large sets in Redis have roughly a 60+ byte/entry overhead, most of which needs to be read during every intersection. And because of the bucketing + linked list construction method, could even read more than that to intersect one item with another. But we reduced that to 4 bytes per int, and ensured that all memory access patterns were sequential. With the practical realities of modern processors and memory, there is no better way.

    Regards.

    ReplyDelete
  6. I've expanded these answers in a new related blog post: http://dr-josiah.blogspot.com/2012/03/why-we-didnt-use-bloom-filter.html

    ReplyDelete
  7. Did you consider MinHashing? It's usually for approximating Jaccard coefficient (not intersection size), but maybe it could be adjusted. I think that with appropriate tuning you could get even better performance.

    Another question - are your intsets kept as int's, or are you using some kind of delta coding (as the sets are already sorted) to minimize the storage size (and thus the bandwidth)?

    ReplyDelete
    Replies
    1. Intsets were ints. Using deltas may help, but I suspect that the decoding complexity would eat up any reduced memory use bandwidth savings.

      I'd not heard of MinHashing before. Reading about it, it looks like it would have been able to give us a reasonable approximation of the number we were looking for. Pity that I'm no longer doing that work, no longer have access to the data, and the entire analytics backend was turned off a few months ago. Such is life.

      Delete
  8. There are cache friendly bloom filters designed for io bound apps
    It does something that put all bits of an item in the same word so each time one men access is enough to check

    ReplyDelete
    Replies
    1. Sounds a lot like an open-addressed hash table, without the chaining. You would still get O(n) random memory lookups, so it would still be slower than the iterative comparison, but at least it isn't so bad as the standard bloom filter.

      Delete
  9. Why didn't you pre-sort the short lists you fetched and then did a merge-sort on loading?

    ReplyDelete
  10. Christophe: it's a matter of effort vs. result.

    As is, I only needed to implement code that removed duplicates from a pre-sorted array. The sort itself was already done.

    For the merge sort, I'd have had to implement the merge sort, or included someone's library of code. Also, there are few (if any) in-place merge sorts, whereas the built-in quicksort is in-place as-is. Maybe that doesn't sound like a big deal, but with 40 meg arrays, that can cause some nasty memory fragmentation.

    Also, while it seems like performing a bunch of small sorts, followed by a big mergesort at the end is more efficient, it really isn't. Asymptotically, they are both O(nlogn), and the built-in quicksort is already amazingly efficient (see this Google tech talk for why: https://www.youtube.com/watch?v=aMnn0Jq0J-E )

    ReplyDelete