Comments on A Dash of Technology: Why we didn't use a bloom filter

Email me with your requirements and I will tell yo...

2014-03-08T10:10:44.677-08:00

Email me with your requirements and I will tell you my consulting rate.

hi sir i wanted to implement Bloom filter based qu...

2014-03-08T08:40:37.160-08:00

hi sir
i wanted to implement Bloom filter based query search in Routing protocol such as MFLOOD in ns2. Kindly can you guide me as to how to proceed. I am unable to begin the task

It's been about 10 years since I picked up a b...

2014-03-03T13:25:18.446-08:00

It's been about 10 years since I picked up a book on computer/CPU architecture, but the MIT OpenCourseware course on the topic uses the same book and seems to cover more or less the same stuff I learned: http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-823-computer-system-architecture-fall-2005/

Other than that, my knowledge comes from a variety of sources including coursework in college, courses in grad school, building computers and paying attention to BIOS settings, and writing software in low-enough level languages to be able to measure actual memory and CPU performance.

Great post. I want to understand practical complex...

2014-03-03T10:16:01.464-08:00

Great post. I want to understand practical complexity on modern CPUs at this level. Are there any books you would recommend?

any sample of the code on GitHub or something?

2012-03-18T22:20:54.284-07:00

any sample of the code on GitHub or something?

"Meanwhile, the slow O(nlogn) standard C sort...

2012-03-18T10:52:28.478-07:00

"Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times"

After constructing the unsorted intset on another machine, we would upload them to S3, spawn a task, which was then executed by the processes that handled this stuff, which would download the intsets from S3, load them into memory via mmap (typically 200-300 ms), sort them (under 1.5 seconds), then begin the process of intersecting that set with thousands of other previously sorted and mmap-loaded intsets.

I don't have access to the intsets anymore, as I no longer work for the company I did this for. Also, the data is publicly available: it's just Twitter follower lists.

I know your intset is sorted, already, but I don&#...

2012-03-18T09:02:40.110-07:00

I know your intset is sorted, already, but I don't see the cost of doing that calculated in, anywhere.

You know what would be cool? If you could post two example intsets for readers to play with. I'm pretty sure that, as a set of numbers, they're anonymized quite well.

I'm also curious about the distribution of the numbers. You do have to read them from disk at some point, but perhaps the sequential IO cost is minimal enough that it doesn't matter.

I don't have an STL implementation, I thought ...

2012-03-17T16:44:58.663-07:00

I don't have an STL implementation, I thought I made that clear in the article and subsequent comments. I did implement an algorithm very similar to the STL set intersection algorithm, but it only performed half as fast as the algorithm we ended up using.

Can you share your STL implementation? How about t...

2012-03-17T15:07:12.064-07:00

Can you share your STL implementation? How about this one?

https://gist.github.com/2065728

I'd be surprised if that wasn't efficient.

This is a fantastic article. :) I rarely get a cha...

2012-03-17T13:48:15.469-07:00

This is a fantastic article. :) I rarely get a chance to optimize hotspots to this degree anymore. Not since matrix multiplication in my masters program.

Also, having just read http://radiospiel.org/sorti...

2012-03-17T13:11:23.002-07:00

Also, having just read http://radiospiel.org/sorting-in-c-3-times-faster-than-c , you are probably right on the sorting side of things. I'll give it a shot if I ever need to build it again. Thank you.

I've updated the post to remove references to ...

2012-03-17T12:41:01.004-07:00

I've updated the post to remove references to vtables in the main portion of the answer, along with an explanation.

I've updated the post to remove references to ...

2012-03-17T12:40:53.337-07:00

I've updated the post to remove references to vtables in the main portion of the answer, along with an explanation.

I've answered it :)

2012-03-17T12:39:30.167-07:00

I've answered it :)

I've updated the post.

2012-03-17T12:38:36.625-07:00

I've updated the post.

I also want to use high level languages for everyt...

2012-03-17T11:51:56.121-07:00

I also want to use high level languages for everything ;) My original intersection code was in Python, and still managed to intersect those two sorted sets in 3.5 seconds, 2x faster than the C Redis! Of course the Python was iterating over the two sorted sequences and relied on heapq.merge() for most of the heavy lifting.

Twitter user A and B. a_followers = [fol1, fol2, ...

2012-03-17T11:47:25.744-07:00

Twitter user A and B.

a_followers = [fol1, fol2, fol3, ...]
b_followers = [fola, folb, folc, ...]

By "items" I meant users. So we are intersecting the sets a_followers and b_followers above.

Not even as interesting as clustering. We just wanted to find audience overlap so that we could say things like, "75% of Khloe Kardashian's followers also follow Kim Kardashian, so if you wanted to minimize your ad spend, you could just have Kim tweet for you. Or if you wanted to get double-coverage for 75% of Khloe's followers, you could add Khloe." This is the "Also Follows" metric described in http://dr-josiah.blogspot.com/2011/07/building-adlys-twitter-analytics.html

I thought I wasn't subtle about it. But yes, w...

2012-03-17T11:43:10.952-07:00

I thought I wasn't subtle about it. But yes, we only needed the count, not the actual items.

First, intsets are sorted arrays of integers; exac...

2012-03-17T11:42:29.356-07:00

First, intsets are sorted arrays of integers; exactly what we were using, but intsets weren't available in a stable Redis release when I was doing the intersection code.

Second, intsets are typically limited to small sets; on the order of a few hundred or few thousand elements, and were primarily meant as a way of minimizing overhead for small sets; not to improve performance for large sets. When you go past that limit to the *millions* of elements like we had, Redis switches to set-based intsets.

Third, Redis' algorithm for intersection iterates over one set, then searches in the other set using binary search. See my reply to an earlier comment about using binary search instead of the intersection method I describe and when it may pay off.

Fourth, if intsets had been available and if they had been efficient to intersect large sets, we may have tried them. But again, Redis still produces the result set, so would have been slower by at least a factor of 2 over what we were able to do with our custom solution.

2012-03-17T11:30:04.847-07:00

This comment has been removed by the author.

You are welcome :)

2012-03-17T11:29:18.258-07:00

You are welcome :)

The larger set would need to be at least 15/3 * lo...

2012-03-17T11:12:10.074-07:00

The larger set would need to be at least 15/3 * log2(n) times larger to make it worthwhile (because binary search behaves in practice like a random traversal). Basically, you would need a smaller sorted set of roughly 65k items vs. 7.5m items before binary search could theoretically win in terms of memory access times. You may or may not win on the algorithm efficiency. A good binary search actually has a pretty tight loop.

I had used this method about 7 years ago in a search engine, and IIRC, we were able to intersect lists of 100k items against each other in about 5 milliseconds on 2.4 ghz P4s (which were horrible from an efficiency standpoint). I wasn't as careful as I am now, so I wouldn't use that as a comparison to what we did, as that algorithm supported (A u B u C ...) ^ (D u E u F ...) ^ (G u H u I ...) ^ ...

And you would perform worse. A skiplist offers bas...

2012-03-17T11:04:55.581-07:00

And you would perform worse. A skiplist offers basically zero advantages over other binary structured trees, at the cost of more pointers. Even in the best case, with a binary tree of one kind or another, you still have 2 pointers per value stored, that would be 12 bytes per user on a 32 bit machine, compared to our 4 bytes.

You would need to examine effectively all of the data during the intersections (3x as much data), and the code for handling the structures itself would be slower than a simple tight loop.

Nope, we used C's qsort.

2012-03-17T11:00:38.157-07:00

Nope, we used C's qsort.

C++ syntax may be ugly but if you're worrying ...

2012-03-17T10:55:56.509-07:00

C++ syntax may be ugly but if you're worrying that the STL is slow due to vtables you're either doing it wrong or have no idea what you're talking about.