A Dash of Technology: March 2012

After having been featured on the High Scalability blog, there has been renewed interest in my 6 month old post about Improving Performance by 1000x.

Occasionally I've received a few questions and/or comments about that post in other forums, but today I got a couple posts from two well-intentioned commentators that highlighted some very interesting misunderstandings about how modern computers and compilers work, misunderstandings that are actually very common. Unfortunately, a couple of the author's comments have been removed (I didn't remove them; I only remove obvious spam comments), so I will summarize the questions after I briefly describe what I built.

What I built

Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users. By being careful, we went from taking roughly 7 seconds to intersect that data in Redis, to taking 6 milliseconds to intersect it with our custom software. On the memory side of things, we reduced our memory requirements by roughly 20x with our method over the standard Redis method.

Let me be clear on this, Redis is a great tool. I still use it every day (even though I've left Adly for ChowNow, and I'm writing Redis in Action for Manning Publications (I should be working on it now, but I wanted to blog on a Friday night). But for some problems, like the one we had, custom software can be a huge win. On with the questions!

Question: Why not use the high-performance C++ STL set intersection code?

Updated Answer: The STL is a pain to work with, and my experience on the C++ side of things (6 months professionally), told me that I would find neither performance nor simplicity by using it.

On top of all of that, we didn't actually need the list, we only needed the count, so even if we had spent the pain of writing it in C++, we'd get a result that was unnecessary, and have to write our own intersection code in the end anyway.

I originally started with code very similar to the C++ STL set intersection code and was dissatisfied with the results. By observing that the longer sequences will skip more items than the shorter sequences, I wrote a 2-level loop that skips over items in the longer sequence before performing comparisons with the shorter sequence. That got me roughly a 2x performance advantage over the STL variant.

There have been a few comments about my STL concerns about performance, which I wanted to respond to.

My original post mentioned performance issues with using vtables in modern C++. My experience from ~4 years ago was using a combination of the STL and other C++ classes. I may have been incorrectly attributing the relative low performance of some operations to method dispatch, instead of some of the quality of containers we were using. Let me explain.

In 2004-2005, I had built a hash table as part of a search engine in C, which was used as a set (no values, only keys). It used 64 bit hashes and 64 bit values on a 32 bit machine. Using this hash set, I was able to add/update/remove items at a rate of 8 million items/second on a mobile 1.2 ghz Intel Core Solo.

In 2007-2008, I had used std::hash_set to store a set of 64 bit integers, and I was surprised to find that it could only add/update/remove 3 million items/second, despite running on a 3.0 ghz Intel P4 (hyperthreading enabled/disabled only affected performance by 5-10% on that problem). I had shrugged it off and called it a vtable problem*, which was a naive decision that I'd not revisited until now. Rethinking it, it was more likely the combination of both the efficiency of the P4 architecture (specifically with respect to pipeline depths; 11-13 with core solo vs. 20+ with P4) and the specifics of the hash table itself.

* I had been willing to quickly jump to the conclusion that it was the language/compiler because I was used to that on the Python side of things. No really. Why is standard Python slow compared to C/C++/Java? It's simply a matter of time and money not having been spent to research and implement JIT and/or static compilers. Look at Boo (which uses static types), PyPy (which uses a restricted version of Python), or Unladen Swallow (which uses an LLVM backed JIT); once time and money has been spent in an effort to improve Python performance, it happens. Incidentally, that's also the only reason why the JVM is so fast: $Billions have been spent on making it faster in the last 20 years.

Probably mostly wrong portion of the original answer:

Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables. If your system has to call them to dispatch, you are wasting time in that processing when you could be doing something else. For most software, that doesn't matter. But when we are talking about nanoseconds (and we will shortly), every bit of work costs you.

Even ignoring the vtable cost, we'll pretend that all of the set code was inlined. But now, all of your manipulation and boundary checking is in your core intersection loop, which will introduce more branches, more branch mispredictions, which will ultimately slow down the intersection code.

Question: Why not use a bloom filter?

Answer: Bloom filters are slower and use more memory.

DRAM is actually really slow. When you make a request to read DRAM, it returns a 64 byte cache line from that region in about 15 nanoseconds (in the best case). That's around 4.3 gigs/second random access. If you are accessing memory in a predictable pattern, the second block you request is returned in about 10 nanoseconds, and subsequent blocks can be returned in as fast as 3-4 nanoseconds. So, after the first few "slow" requests, you can read at a peak rate of roughly 21.3 gigs/second for predictable access patterns on high end machines.

Back to our case. We intersected a set of 5 million items against a set of 7.5 million items in 6 milliseconds. Doing the math... 4 bytes * (5m + 7.5m) / 6ms -> 8.3 gigs/second. So, we were within a factor of ~2-2.5x of maxing out the memory bandwidth available to the entire system with our 6ms intersection.

Let's pretend that we had already built a bloom filter, and let's even say that the math said that we only need 7 probes (the reality would probably be closer to 16 or so). At 15ns per probe, because bloom filters have unpredictably random access patterns, 7 probes per 5 million users in our input set, my math says that the intersection would take 525 milliseconds, which is 87 times slower than our intersection method.

But here's the other nasty bit of reality; constructing the bloom filter for that 5 million entry set (because we need to have bloom filter for all sets) takes at least twice as long as querying it. Why? Because to update 1 bit in the filter, you first need to read the cache line, then you need to update the bit in the register, then you need to write the data back. I know what you are thinking, you are thinking that your write-through cache will help you. But you are wrong. Because you are writing randomly to your bloom filter, you run out of cache very quickly, and the caching layers within your processor will quickly start blocking the processor from reading or writing to memory, because it's got a backlog of data to flush to memory that is limited to 15ns per random write. So with a (lower-bounded) 525ms to intersect, that's roughly 1.05 seconds to construct the bloom filter in the best case.

Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times, so even if it did offer a modest performance advantage, it wouldn't be worth the performance penalty during intersection.

Another drawback with the bloom filter is memory. I know what you are thinking, you are thinking "well, with a 7 probe bloom filter, maybe you only need 2-3 bytes per user, instead of the 4 bytes per user". But in order to perform the intersection, you still need your list of users (sorted or not) in addition to the bloom filter, leaving you with 6-7 bytes per user.

Conclusion: bloom filters would be (80-160 times) slower to intersect, would use more memory, don't offer a significant construction time performance advantage (probably closer to 2x slower for a 16 probe bloom filter), and are only approximately correct.

The lesson you should learn

The reason why our custom solution was so much faster than Redis is primarily the memory access patterns and memory efficiency. Large sets in Redis have roughly a 60-80 byte/entry overhead on a 64 bit machine, most of which needs to be read to intersect one item with another. And because of the bucket + linked list (which is how Redis builds sets), intersection in Redis could read more than 60-80 bytes to intersect one item with another. Ours did so well because we reduced our space to 4 bytes per user (from 60-80), only ever read that 4 bytes, and ensured that all of our memory access patterns were sequential. With the practical realities of modern processors and memory, there is no faster way for single-processor set intersections.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

A Dash of Technology

Friday, March 16, 2012

Why we didn't use a bloom filter

What I built

Question: Why not use the high-performance C++ STL set intersection code?

Question: Why not use a bloom filter?

The lesson you should learn