A Dash of Technology: Why we didn't use a bloom filter

Friday, March 16, 2012

Why we didn't use a bloom filter

After having been featured on the High Scalability blog, there has been renewed interest in my 6 month old post about Improving Performance by 1000x.

Occasionally I've received a few questions and/or comments about that post in other forums, but today I got a couple posts from two well-intentioned commentators that highlighted some very interesting misunderstandings about how modern computers and compilers work, misunderstandings that are actually very common. Unfortunately, a couple of the author's comments have been removed (I didn't remove them; I only remove obvious spam comments), so I will summarize the questions after I briefly describe what I built.

What I built

Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users. By being careful, we went from taking roughly 7 seconds to intersect that data in Redis, to taking 6 milliseconds to intersect it with our custom software. On the memory side of things, we reduced our memory requirements by roughly 20x with our method over the standard Redis method.

Let me be clear on this, Redis is a great tool. I still use it every day (even though I've left Adly for ChowNow, and I'm writing Redis in Action for Manning Publications (I should be working on it now, but I wanted to blog on a Friday night). But for some problems, like the one we had, custom software can be a huge win. On with the questions!

Question: Why not use the high-performance C++ STL set intersection code?

Updated Answer: The STL is a pain to work with, and my experience on the C++ side of things (6 months professionally), told me that I would find neither performance nor simplicity by using it.

On top of all of that, we didn't actually need the list, we only needed the count, so even if we had spent the pain of writing it in C++, we'd get a result that was unnecessary, and have to write our own intersection code in the end anyway.

I originally started with code very similar to the C++ STL set intersection code and was dissatisfied with the results. By observing that the longer sequences will skip more items than the shorter sequences, I wrote a 2-level loop that skips over items in the longer sequence before performing comparisons with the shorter sequence. That got me roughly a 2x performance advantage over the STL variant.

There have been a few comments about my STL concerns about performance, which I wanted to respond to.

My original post mentioned performance issues with using vtables in modern C++. My experience from ~4 years ago was using a combination of the STL and other C++ classes. I may have been incorrectly attributing the relative low performance of some operations to method dispatch, instead of some of the quality of containers we were using. Let me explain.

In 2004-2005, I had built a hash table as part of a search engine in C, which was used as a set (no values, only keys). It used 64 bit hashes and 64 bit values on a 32 bit machine. Using this hash set, I was able to add/update/remove items at a rate of 8 million items/second on a mobile 1.2 ghz Intel Core Solo.

In 2007-2008, I had used std::hash_set to store a set of 64 bit integers, and I was surprised to find that it could only add/update/remove 3 million items/second, despite running on a 3.0 ghz Intel P4 (hyperthreading enabled/disabled only affected performance by 5-10% on that problem). I had shrugged it off and called it a vtable problem*, which was a naive decision that I'd not revisited until now. Rethinking it, it was more likely the combination of both the efficiency of the P4 architecture (specifically with respect to pipeline depths; 11-13 with core solo vs. 20+ with P4) and the specifics of the hash table itself.

* I had been willing to quickly jump to the conclusion that it was the language/compiler because I was used to that on the Python side of things. No really. Why is standard Python slow compared to C/C++/Java? It's simply a matter of time and money not having been spent to research and implement JIT and/or static compilers. Look at Boo (which uses static types), PyPy (which uses a restricted version of Python), or Unladen Swallow (which uses an LLVM backed JIT); once time and money has been spent in an effort to improve Python performance, it happens. Incidentally, that's also the only reason why the JVM is so fast: $Billions have been spent on making it faster in the last 20 years.

Probably mostly wrong portion of the original answer:

Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables. If your system has to call them to dispatch, you are wasting time in that processing when you could be doing something else. For most software, that doesn't matter. But when we are talking about nanoseconds (and we will shortly), every bit of work costs you.

Even ignoring the vtable cost, we'll pretend that all of the set code was inlined. But now, all of your manipulation and boundary checking is in your core intersection loop, which will introduce more branches, more branch mispredictions, which will ultimately slow down the intersection code.

Question: Why not use a bloom filter?

Answer: Bloom filters are slower and use more memory.

DRAM is actually really slow. When you make a request to read DRAM, it returns a 64 byte cache line from that region in about 15 nanoseconds (in the best case). That's around 4.3 gigs/second random access. If you are accessing memory in a predictable pattern, the second block you request is returned in about 10 nanoseconds, and subsequent blocks can be returned in as fast as 3-4 nanoseconds. So, after the first few "slow" requests, you can read at a peak rate of roughly 21.3 gigs/second for predictable access patterns on high end machines.

Back to our case. We intersected a set of 5 million items against a set of 7.5 million items in 6 milliseconds. Doing the math... 4 bytes * (5m + 7.5m) / 6ms -> 8.3 gigs/second. So, we were within a factor of ~2-2.5x of maxing out the memory bandwidth available to the entire system with our 6ms intersection.

Let's pretend that we had already built a bloom filter, and let's even say that the math said that we only need 7 probes (the reality would probably be closer to 16 or so). At 15ns per probe, because bloom filters have unpredictably random access patterns, 7 probes per 5 million users in our input set, my math says that the intersection would take 525 milliseconds, which is 87 times slower than our intersection method.

But here's the other nasty bit of reality; constructing the bloom filter for that 5 million entry set (because we need to have bloom filter for all sets) takes at least twice as long as querying it. Why? Because to update 1 bit in the filter, you first need to read the cache line, then you need to update the bit in the register, then you need to write the data back. I know what you are thinking, you are thinking that your write-through cache will help you. But you are wrong. Because you are writing randomly to your bloom filter, you run out of cache very quickly, and the caching layers within your processor will quickly start blocking the processor from reading or writing to memory, because it's got a backlog of data to flush to memory that is limited to 15ns per random write. So with a (lower-bounded) 525ms to intersect, that's roughly 1.05 seconds to construct the bloom filter in the best case.

Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times, so even if it did offer a modest performance advantage, it wouldn't be worth the performance penalty during intersection.

Another drawback with the bloom filter is memory. I know what you are thinking, you are thinking "well, with a 7 probe bloom filter, maybe you only need 2-3 bytes per user, instead of the 4 bytes per user". But in order to perform the intersection, you still need your list of users (sorted or not) in addition to the bloom filter, leaving you with 6-7 bytes per user.

Conclusion: bloom filters would be (80-160 times) slower to intersect, would use more memory, don't offer a significant construction time performance advantage (probably closer to 2x slower for a 16 probe bloom filter), and are only approximately correct.

The lesson you should learn

The reason why our custom solution was so much faster than Redis is primarily the memory access patterns and memory efficiency. Large sets in Redis have roughly a 60-80 byte/entry overhead on a 64 bit machine, most of which needs to be read to intersect one item with another. And because of the bucket + linked list (which is how Redis builds sets), intersection in Redis could read more than 60-80 bytes to intersect one item with another. Ours did so well because we reduced our space to 4 bytes per user (from 60-80), only ever read that 4 bytes, and ensured that all of our memory access patterns were sequential. With the practical realities of modern processors and memory, there is no faster way for single-processor set intersections.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

37 comments:

ChristophMarch 17, 2012 at 12:34 AM
I guess you use the standard c++ std::sort algorithm instead of the qsort of C.
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 12:50 AM
i'd try a skip list.
ReplyDelete
Replies
ChristophMarch 17, 2012 at 12:52 AM
I also wonder how the relation between the left and right set sizes has to be before using a binary_search to find the next element in the remaining data is faster?
ReplyDelete
Replies
Joshua GuthrieMarch 17, 2012 at 1:02 AM
This is kinda the best article I ever read these pasts two months. Thanks for dwelving that deep into C <3
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 1:16 AM
I don't get your focus on vtables when talking about drawbacks of the STL. Afaik there's hardly any polymorphism in it.

But while we are at wailing about indirections. Using qsort introduces an inner loop function call to your sort function. std::sort can inline it.

Anyways. The take home message from the post is a valid and valuable one: Sometimes it's worth the time to custom build.
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 1:50 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies
530 GeeksMarch 17, 2012 at 2:16 AM
Your statement about Redis using linkedlist + bucket for sets is a bit incorrect, given that you are storing integers.

You are using 32 bit numbers, so Redis would use an Int Set. An IntSet is essentially a binary search tree. The elements are stored in an array as either a 16, 32 or 64 bit integer. In your case, it would have used 32 bits per user. The overheads are exactly 8 bytes per set - of which 4 bytes are used to store the number of elements in the set.

Since IntSets are actually sorted in memory, the algorithm Redis uses for intersection is pretty much what you ended up doing.

Totally like the approach you described, but I don't think Redis' performance would have been as bad as you suggest. You are among the most active users on the Redis mailing lists; So I know I don't have to sell Redis to you -:) But perhaps you could do a PoC using intsets in Redis?

Reference : https://github.com/antirez/redis/blob/unstable/src/intset.c
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 2:52 AM
It seems like there's a subtle detail lurking here: you only need the count, not the list. I'm not familiar with redis but I'm guesssing the slow original operation you're describing returns the set? And the same for the STL? etc.
ReplyDelete
Replies
Kevin/Keiven/Kei-venMarch 17, 2012 at 3:38 AM
Few questions:

1) Can you give me an example of your "sequence of integers that represents a Twitter user's followers"? Is it something like the following?
Twitter user i sequence: [follower_id1, follower_id2, ..., follower_idn]

2) "We intersected a set of 5 million items against a set of 7.5 million items"
Is each "item" a sequence, or something else?

3) I guess I'm not clear what the final end goal(s) are. Is it to cluster users with similar set of followers?
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 5:28 AM
Hello from another C and python user.

I use a lot of high level languages but from time to time, I need c. I usually get from 10e4 to 10e5 better performance like you because I know what I do and I could control it.

People that only know high level languages want to use it for everything(if you only have a hammer...). It is not a 20%, 40% or 50% penalty, it is orders of magnitude when you work with a lot of data.
ReplyDelete
Replies
UnknownMarch 17, 2012 at 5:45 AM
The C++ standard library algorithms and data structures aren’t “object oriented”, don’t use runtime dispatching (and hence no vtables), do not perform superfluous bounds checks and do not hide “the complexity of method dispatch”.

In fact, the whole algorithms library was designed to offer high abstraction at *no* runtime cost. Which is exactly what they do. All dispatch is performed at compile time.

If you were satisfied with `qsort`’s performance, try C++’ `sort` with a functor comparer to get an instant performance boost.
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 7:38 AM
Yes please answer KevinX's questions about what you are trying to do. It's not clear, especially his question #2.
ReplyDelete
Replies
AnonymousMarch 17, 2012 at 10:55 AM
C++ syntax may be ugly but if you're worrying that the STL is slow due to vtables you're either doing it wrong or have no idea what you're talking about.
ReplyDelete
Replies
XorlevMarch 17, 2012 at 1:48 PM
This is a fantastic article. :) I rarely get a chance to optimize hotspots to this degree anymore. Not since matrix multiplication in my masters program.
ReplyDelete
Replies
John FreemanMarch 17, 2012 at 3:07 PM
Can you share your STL implementation? How about this one?

https://gist.github.com/2065728

I'd be surprised if that wasn't efficient.
ReplyDelete
Replies
rossjudsonMarch 18, 2012 at 9:02 AM
I know your intset is sorted, already, but I don't see the cost of doing that calculated in, anywhere.

You know what would be cool? If you could post two example intsets for readers to play with. I'm pretty sure that, as a set of numbers, they're anonymized quite well.

I'm also curious about the distribution of the numbers. You do have to read them from disk at some point, but perhaps the sequential IO cost is minimal enough that it doesn't matter.
ReplyDelete
Replies
Lee WeiMarch 18, 2012 at 10:20 PM
any sample of the code on GitHub or something?
ReplyDelete
Replies
ehalfertyMarch 3, 2014 at 10:16 AM
Great post. I want to understand practical complexity on modern CPUs at this level. Are there any books you would recommend?
ReplyDelete
Replies
UnknownMarch 8, 2014 at 8:40 AM
hi sir
i wanted to implement Bloom filter based query search in Routing protocol such as MFLOOD in ns2. Kindly can you guide me as to how to proceed. I am unable to begin the task
ReplyDelete
Replies

Add comment