Thursday, June 9, 2011

Building an Ad Network Ready for Failure

A little over a year ago in May of 2010, I was tasked with the job of creating a self-serve location-targeted ad network for distributing ads to web, desktop, and mobile applications that typically displayed Twitter and Facebook status messages. This post will describe the architecture and systems that our team put together for ad network that ran 24/7 for over 6 months, serving up to 200 ad calls per second, targeting under 200ms response time, despite network hiccups and VPS failures, and how you can do the same with your architecture.

Our Rambo Architecture

Before we had started on our adventure to build an ad network, we had suffered hiccups with Amazon's EC2 systems. While most of the time the VPSes that we had created months prior had suffered no issues, occasionally we would have a VPS disappear from the network for some time, EBS volumes would hang on us, EBS volumes would be unable to be re-assigned to another VPS, VPS instances would be stuck in a state of being restarted, all of the things that people have come to expect from EC2. Given these intermittent failures, we understood that our entire system needed to be able to recover from a hang/failure of just about any set of machines. Some months after running the system, we read about Neflix' experience, their Rambo Architecture, and their use of Chaos Monkey. We never bothered to implement a Chaos Monkey.

Load Balancing

Initially, we had used Apache as a load balancer behind an Amazon elastic IP. Before switching live, we decided to go with HAProxy due to it's far better performance characteristics, it's ability to detect slow/failing hosts, and it's ability to send traffic to different hosts based on weights (useful for testing a new configuration, pushing more to beefier hosts, etc.). The *only* unfortunate thing about HAProxy was that when you reloaded the configuration (to change host weights, etc.), all counters were reset. Being able to pass SIGHUP to refresh the configuration would have been nice. Other than that minor nit, HAProxy was phenominal.

Worker Machines

After running through our expected memory use on each of our worker machines (I will describe their configurations in a bit), we determined that we could get by with a base of 750 megs + 300 megs/processor for main memory to max out the throughput on any EC2 instance. We were later able to reduce this to 450 + 300/processor, but either way, our ad serving scaled (effectively) linearly as a function of processor speed and number. In EC2, this meant that we could stick with Amazon's Hi-CPU Medium VPS instances, which offered the best bang for the $ in terms of performance. We had considered the beefier processors in Hi-Memory instances, but we had literally zero use for the extra memory. If our service took off, we had planned on moving to fewer High CPU Extra Large instances, but for the short term, we stuck with the flexibility and experimentation opportunities that Medium instances offered (thanks to HAProxy's ability to send different traffic volumes to different hosts).

The actual software stack running on each box was relatively slim: Redis (ad targeting and db cache), ActiveMQ (task queues), Apache + mod_wsgi + Python (to serve the ads and interact with everything), and a custom Geocode server written in Python (for handling ip -> location, lat,lon -> location, city/state/country -> location).

Software Stack

The geocoding server was the biggest memory user of them all, starting out at 600 megs when we first pushed everything out. Subsequent refinements on how data was stored dropped that by 300 megs. This process was, in my opinion, the most interesting of the technology that we developed for the ad targeting system, as it was able to take a latitude,longitude pair and determine what country, what state, and what zipcode that location was in (as applicable) in under 1 ms, yet do so in 300 megs of memory using pure Python. It had a simple threaded HTTP server interface. Each VPS ran an instance to minimize round-trip time and to minimize dependencies on other machines.

We had originally run with RabbitMQ, but after running for a week or two, ran into the same crashing bug that Reddit ran into. Our switch to ActiveMQ took a little work (ActiveMQ wasn't disconnecting some of it's sockets, forcing us to write a Twisted server to act as a broker to minimize connections, but one of our engineers sent a patch upstream to fix ActiveMQ), but we've been pretty happy with it. In addition to the queue, we also had a task processor written in Python that processed impressions and clicks, updating the database and the ad index as necessary if/when an ad ran out of money.

Our Apache + mod_wsgi + Python server mostly acted as a broker between encoding/decoding requests/reponses, ad calls to Redis, cache pulls from Redis, geocoding requests, and logging results. This is where we scaled/processor, and the memory use was primarily the result of running so many threads to maximize IO between our servers and Redis. For a brief time, we were also performing content analysis for further targeting (simple bayesian categorization across 25 categories, matched against pre-categorized ads, calculated in Python). This was consistently the slowest part of our calls, amounting for ~40ms, which we later dropped to 5ms with a little bit of numpy. Sadly, we found that content analysis was less effective for the bottom line than just paying attention to a calculated CPM on individual ads (calculated via CPC * CTR), so we tossed it for the sake of simplicity.

The real workhorse of our ad targeting platform was Redis. Each box slaved from a master Redis, and on failure of the master (which happened once), a couple "slaveof" calls got us back on track after the creation of a new master. A combination of set unions/intersections with algorithmically updated targeting parameters (this is where experimentation in our setup was useful) gave us a 1 round-trip ad targeting call for arbitrary targeting parameters. The 1 round-trip thing may not seem important, but our internal latency was dominated by network round-trips in EC2. The targeting was similar in concept to the search engine example I described last year, but had quite a bit more thought regarding ad targeting. It relied on the fact that you can write to Redis slaves without affecting the master or other slaves. Cute and effective. On the Python side of things, I optimized the redis-py client we were using for a 2-3x speedup in network IO for the ad targeting results.

The master Redis was merely Redis that was populated from the database by a simple script, and kept up-to-date by the various distributed tasks processes, which syndicated off to the slaves automatically.

After Responding to a Request

After an ad call was completed, an impression tracking pixel was requested, or a click occurred, we would throw a message into our task queue to report on what happened. No database connection was ever established from the web server, and the web server only ever made requests to local resources. The task server would connect to the database, update rows there, handle logging (initially to the database, then briefly to mongodb, and finally to syslog-ng, which we use today for everything), and update the Redis master as necessary. In the case of database or Redis master failure, the task queue would merely stop processing the tasks, batching them up for later.

The Weak Points

Looking at this setup, any individual VPS could go down except for the load balancer and all other parts would continue to work. Early on we had experimented with Amazons load balancer, but found out that it wouldn't point to an elastic IP (I can't remember why this was important at the time), so we used a VPS with HAProxy. Thankfully, the load balancer VPS never went down, and we had a hot spare ready to go with an elastic IP update.

Any worker box could go down, and it wouldn't effect serving except for a small set of requests being dropped. Our Redis master or master DB could even go down, and some ads may/may not be served when they should/shouldn't. We did lose the Redis master once, due to a network hiccup (which caused replication to one of the slaves to hang, blow up memory use on the master, and subsequently get the master killed by the Linux OOM killer). But this caused zero downtime, zero loss of $, and was useful in testing our emergency preparedness. We had some API workers go down on occasion due to EBS latency issues, but we always had 2-3x the number necessary to serve the load.

Our true weakness was our use of PostgreSQL 8.4 in a single-server setup. Our write load (even with API calls coming in at a high rate) was low enough to not place much of a load on our database, so we were never felt pressured to switch to something with more built-in options (PostgreSQL 9 came out about 3-4 months after we had started running the system). But really, for this particular data, using Amazon's RDS with multi-AZ would have been the right thing to do.

Where is it now?

After 6 months of development, getting application developers to start using our API, getting reasonable traffic, getting repeat ad spends, etc., we determined that the particular market segment we were going for was not large enough to validate continuing to run the network at the level of hardware and personnel support necessary for it to be on 24/7, so we decided to end-of-life the product.

I'll not lie, it was a little painful at first (coincidentally, a project in which I was the only developer, YouTube Groups, had been end-of-lifed a couple weeks before), but we all learned quite a bit about Amazon as a platform, and my team managed to build a great product.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!