A Dash of Technology: July 2011

Thursday, July 14, 2011

A Neat Python Hack? No, Broken Code

A few months ago I saw an anti-pattern that was offered by a much-loved member of the Python community. He offered it as a neat little hack, and if it worked, it would be awesome. Sadly, it does not work. The anti-pattern is available as a recipe on the Python Cookbook here, which I alter to make fully incorrect, later switch back to the original form to show that it also doesn't work, and explain why it can't work in the general case.

# known incorrect lightweight lock / critical section
ci = sys.getcheckinterval()
sys.setcheckinterval(0)
try:
    # do stuff
finally:
    sys.setcheckinterval(ci)

A cursory reading of the interactive Python help below would suggest that the inside the try block, no other Python code should be able to execute, thus offering a lightweight way of making race conditions impossible.

Help on built-in function setcheckinterval in module sys:

setcheckinterval(...)
    setcheckinterval(n)
    
    Tell the Python interpreter to check for asynchronous events every
    n instructions.  This also affects how often thread switches occur.

According to the basic docs, calling sys.setcheckinterval(0) might prevent the GIL from being released, and offer us a lightweight critical section. It doesn't. Running the following code...

import sys
import threading

counter = 0

def foo():
    global counter
    ci = sys.getcheckinterval()
    sys.setcheckinterval(0)
    try:
        for i in xrange(10000):
            counter += 1
    finally:
        sys.setcheckinterval(ci)

threads = [threading.Thread(target=foo) for i in xrange(10)]
for i in threads:
    i.setDaemon(1)
    i.start()

while threads:
    threads.pop().join()

assert counter == 100000, counter

I get AssertionErrors raised with values typically in the range of 30k-60k. Why? In the case of ints and other immutable objects, counter += 1 is really a shorthand form for counter = counter + 1. When threads switch, there is enough time to read the same counter value from multiple threads (40-70% of the time here). But why are there thread swaps in the first place? The full sys.setcheckinterval documentation tells the story:

sys.setcheckinterval(interval)

Set the interpreter’s "check interval". This integer value determines how often the interpreter checks for periodic things such as thread switches and signal handlers. The default is 100, meaning the check is performed every 100 Python virtual instructions. Setting it to a larger value may increase performance for programs using threads. Setting it to a value <= 0 checks every virtual instruction, maximizing responsiveness as well as overhead.

Okay. That doesn't work because our assumptions about sys.setcheckinterval() on boundary conditions were wrong. But maybe we can set the check interval to be high enough so the threads never swap?

import os
import sys
import threading

counter = 0
each = 10000
data = os.urandom(64).encode('zlib')
oci = sys.getcheckinterval()

def foo():
    global counter
    ci = sys.getcheckinterval()
    # using sys.maxint fails on some 64 bit versions
    sys.setcheckinterval(2**31-1)
    try:
        for i in xrange(each):
            counter += (1, data.decode('zlib'))[0]
    finally:
        sys.setcheckinterval(ci)

threads = [threading.Thread(target=foo) for i in xrange(10)]
for i in threads:
    i.setDaemon(1)
    i.start()

while threads:
    threads.pop().join()

assert counter == 10 * each and oci == sys.getcheckinterval(), \
    (counter, 10*each, sys.getcheckinterval())

When running this, I get assertion errors showing the counter typically in the range 10050 to 10060, and showing that the check interval is still 2**31-1. That means that our new counter += (1, data.decode('zlib'))[0] line, which can be expanded to counter = counter + (1, data.decode('zlib'))[0] reads the same counter value almost 90% of the time. There are more thread swaps with the huge check interval than with a check interval that says it will check and swap at every opcode!

Obviously, the magic line is counter += (1, data.encode('zlib'))[0]. The built-in zlib libraries have an interesting feature; whenever it compresses or decompresses data, it releases the GIL. This is useful in the case of threaded programming, as it allows other threads to execute without possibly blocking for a very long time - even if the check interval has not passed (which we were trying to prevent with our sys.setcheckinterval() calls). If you are using zlib (via gzip, etc.), this can allow you to use multiple processors for compression/decompression, but it can also result in race conditions, as I've shown.

In the standard library, there are a handful of C extension modules that do similar "release the GIL to let other threads run", which include (but are not limited to) some of my favorites: bz2, select, socket, and time. Not all of the operations in those libraries will force a thread switch like zlib compression, but you also can't really rely on any module to not swap threads on you.

In Python 3.x, sys.getcheckinterval()/sys.setcheckinterval() have been deprecated in favor of sys.getswitchinterval()/sys.setswitchinterval(), the latter of which takes the desired number of seconds between switch times. Don't let this fool you, it has the exact same issues as the just discussed Python 2.3+ sys.getcheckinterval()/sys.setcheckinterval() calls.

What are some lessons we can learn here?

Always check before using anyone's quick hacks.
If you want to make your system more or less responsive, use sys.setcheckinterval().
If you want to execute code atomically, use a real lock or semaphore available in the thread or threading modules.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Friday, July 8, 2011

Building Adly's Twitter Analytics

If you are a regular reader of my blog, you will know that since joining Adly, I've been a little busy... Together, we have designed and built a real-time search platform, a content and location targeted ad network, and this morning we announced the public release of our Twitter Audience Analytics Platform, Adly Analytics.

This post will discuss some of the tools and methods that we used to pull and process the data to turn it into what it is today. As discussed in our press release and on TechCrunch, we have 8 tabs (12 views) worth of information that represent some of the ways that Adly influencers can view their audience data: Top Mentions, Also Follows, Top Followers, Follower Growth, Gender, City, State, and Country. In each section below, I will describe some of the techniques we use to gather and process this information.

Basic Overview

Before I dig into each of the specific tabs, I wanted to give a brief overview of the technology we use to make everything happen. There will be more detail to come on this front, and I am in the process of writing some technical whitepapers, but for now here is the big picture.

The main technical tool at our disposal is our custom Twitter Spider. Similar in a sense to GoogleBot that crawls much of the web, our spider crawls different parts of Twitter. On the more technical side of things, our spider communicates with Twitter's servers using their API.

Each different kind of data that we fetch requires a different type of spider, and each type of data is stored in one or more different formats. The underlying technology is actually very straightforward; we use ActiveMQ as a coarse-grained task queue (one of our senior engineers, Eric Van Dewoestine, who is behind Eclim, wrote our custom Python bindings about a year ago for our Ad Network), Redis as our general high-speed data store, and a custom task processor written in Python to spider, process, and store the data back into Redis.

Let's look at a few of the tabs (you can see them all on Adly.com/Analytics):

Top Mentions

The first tab, Top Mentions, is intended as a way to allow you to discover who are the most influential people that are @mentioning or retweeting you. We pull this information direct from Twitter and filter it to only represent those most influential people who are already interacting with you.

Follower Growth

The data that is behind Follower Growth is used as part of many other parts of our system. Generally, any time we receive user information from Twitter (sometimes we get it as part of the call, like in the case for Top Mentions), we check to see if the information is for a user that we have determined to be influential (in the Adly network, is from a set of the most influential Twitter users, etc.). If it is, we update the current count for their number of followers, and place that user in a list of users whose followers we want to pull. Over time, we will fetch the full listing of followers for everyone we have determined to be influential, and combine all of these lists to find new users whose information we do not yet have. Some of these users will then be influential, thus helping us to further develop our listing of influential people, find more followers, etc.

Also Follows

Once we have the full listing of followers for any two influencers, we can calculate how many followers they share. For example, 24% of @JetBlue's followers also follow @50Cent, but only 8% of @50Cent's followers follow @JetBlue. Then again, @50Cent has over 12 times the number of followers, so his followers do tend to follow @JetBlue far more than is typical. This gives both @JetBlue and @50Cent the opportunity to discover brands and influencers that have something in common with themselves.

Top Followers

Like Also Follows, since we have the full listing of followers on hand, and because we also have the public user information for all of those followers, we can easily determine things like Justin Bieber (@justinbieber) and Ashton Kutcher (@aplusk) are @charliesheen's two biggest followers. This is useful to help discover the biggest influencers that are interested in what you say, and who you may want to start interacting with.

Gender, City, State, and Country

Like Top Followers, again we have the full listing of followers on hand, but we've pre-processed the user information to determine the gender and location of Twitter users around the world. We use this information to give both an individual "62% of Kim Kardashian's followers are women" or "6.8% of Fox News' followers are in Texas", as well as "Kim Kardashian has 1.3 times the number of women following her than expected" or "Charlie Sheen has 3.2 times the number of followers in Ireland than expected".

For instance, @SnoopDogg has one of the most diverse and globally representative followings we've ever seen on Twitter. He has fans all around the world, and in many cities and states one might not expect to find a lot of Twitter users.

Beyond interesting, this is useful Business Intelligence for Snoop and his managers. Understanding who his mega fans are, and where they are most conventrated can be helpful in planning tours, personal appearances, PR and more to really bring his Twitter following to life in a meaningful way.

Anyway, I am actually headed out for some much needed R&R, so will come back to this topic later with more technical discussion and some graphics to illustrate how we created the new service.

If you want to check it out for yourself, leave a comment below with your Twitter @name, and I will be sure to get you in the queue.

Tuesday, July 5, 2011

Port Forwarding: Just use SSH

This post is going to deviate a bit from what I normally post about. Normally, I talk about building software and systems. I include code snippets, rationale for why we made a particular design decision, etc. This post is going to be different.

A few nights ago, I decided it was time for to have my own personal server in the cloud. A remotely accessible server for personal projects. For the last couple years, I'd been building personal projects with the AppEngine SDK (using Python, obviously), but constantly running into limitations in terms of data store queries, transactions, types of data to cache, etc. Having a server just allows me to put the software in the cloud that I need. Things like Redis (of which I am fond), a standard SQL server (PostgreSQL is my preferred DB, but MySQL works very well too), or if I feel like dog-fooding and the kinds of queries that a non-relational store offers YogaTable is a great little tool.

This leads me to a 2 hour battle I had the other night, culminating in a face-palm moment. As some of you may or may not know, I use Windows as my desktop operating system. It's a personal choice that is the result of a lot of frustration with both OS X (beachballs of death, multi-minute system hangs, bad external monitor behavior, visually pretty but poor quality software, ...) and with Linux as a desktop (I shouldn't have to spend time troubleshooting a kernel update that disables my wireless card weeks after the kernel update). On the other hand, I love Linux as a server, especially when someone else has already configured system images that work perfectly in the cloud. This leaves me in a bit of a bind.

For my professional work at Ad.ly, I've actually got a physical machine sitting under my desk at our office running Linux (same version as we run in production). To access it, I have a local VM (which was originally my development environment) that I use as my ssh tunneler; either directly to it's IP when I'm at work, or to it's VPN IP when I'm at home (using the convenient aliases "work" and "home", respectively). It forwards an ssh port (for FreeNX typically), a development http port, and a Samba port. A few nights ago when trying to set up a secondary set of tunnels to the new machine, I decided it was time to reboot my local VM (there were a few security updates). After the reboot, it would seem that whatever incantation that made the Samba port forward work (which is on port 139), was no longer working.

Now, instead of needing one more set of port forwards, I now needed the original set again. After mucking about for 2 hours with netcat, inetd (which doesn't allow different servers on different IPs), iptables, and authbind, I realized that my issue (I needed to forward port 139 on a few different IP addresses to different destinations) could be easily handled with a root local forward:

$ sudo ssh -L 192.168.a.b:139:192.168.a.b:1139 \
           -L 192.168.a.c:139:192.168.a.c:1139 \
           ... localhost

Then I just need to forward the proper ports 1139 to the destination boxes, which is where the facepalm comes in. I'd struggled with three of those four methods when I originally set up the box (the new one is authbind, which didn't let me bind port 139 as non-root), and I keep forgetting that for "forward port X on host A to port Y on host B", even with A and B being the same host, SSH is generally the simplest path to success. Not netcat, not iptables, and not inetd. Lesson re-learned.