Wednesday, December 12, 2012

Why "right to work" is bullshit

Today is going to be a different sort of blog post. Normally I talk about technology, but helping to run a startup, finishing a book, and preparing for the imminent birth of my first child have all prevented me from posting for a while. But recently I've heard a lot of obvious bullshit about an issue that pushed me to the point of writing about it.


A little back story before I get to the purpose of this post, and my point. Tonight I attended the Southern California Python Interest Group Meetup, where one of the speakers discussed continuing education. More specifically on how online education courses, free online resources, and social interaction have grown and become very effective at teaching new programmers. One of the topics that was brought up was a relative lack of shared cultural context (history, literature, politics, etc.) that can be missed when only relying on online resources. This is because when you are interested in something, you learn as much as you can about that one topic, but you don't necessarily learn very much about the world of stuff that is out there. My contribution was to say that if you are in the position of being an educator or peer to someone who is missing some of this education/context, that you should suggest literature, movies, or other media that can offer information about the context of these things, as context is important in everything.

The "right" to work


Recently, the state of Michigan passed a law that essentially says that if you are working in an otherwise "union workplace", you no longer have to pay union dues in order to gain the benefits of the union bargaining process.
I'm going to ignore a lot of the arguments for and against these so-called "right to work" laws, because almost all of them miss the purpose behind the laws actually being passed, miss the context in which unions were created, and completely forget the reasons why unions are still important today.

Power


In the industries in which most individuals are employed, the employer has the power. Make no mistake, aside from a few specific highly-educated, highly-trained, or high-demand careers, the employee has very little power. The employer decides wages, can say yes or no to vacation scheduling, can lay off or fire employees for little to no reason, can require salaried employees to work evenings or weekends (through explicit or implied threats of reduced bonuses, demotions, no raises, etc.), ... Unless your social circle is very limited, you will know someone who is worried about whether they will have a job in the future.

Unions exist to balance power. They were created literally from the blood, sweat, and tears of people who had suffered from the power inequity inherent in the employer - employee relationship. Unions in the United States sprung from a society where people injured on the job would be fired because they could no longer do the job. Where an employee who questioned their boss could be fired and blacklisted from an entire industry. And in some industries, talking about collective bargaining could get you fired or even killed.

While some of the many issues that caused the spread of unions in the United States have been addressed through laws (anti-blacklisting, workers compensation, anti-monopoly...), workers in the United States (and all over the world), still need to be protected.

Confusing the issue


One of the many methods that proponents of these laws use is to confuse the issue with bullshit. These "right to work" laws are not about letting everyone get a fair shake at getting and keeping a job (in fact it reduces the protections that employees have from employers). It is about removing the only effective check to rising corporate power in the United States. You should find it damning that the same people who say that unions shouldn't exist (either explicitly or through right to work laws) are the exact same people who think that corporations should be allowed to spend unlimited money on political campaigns, and that unions shouldn't be able to spend union dues on politics at all. Political figures who are advocating for right to work are trying to take away the worker's right to be represented in politics through organizations who are actively looking out for them - their unions.

I've also heard accounts of "last-hired, first-fired" in education being an example where union contracts have gone bad, and that this is why unions need to be stopped. But this is just an argument to confuse the real issue. Those contracts were agreed upon to protect career teachers that wouldn't have other employment opportunities outside the school that they'd been teaching at for decades (in some cases), because despite attempts to address the issue, ageism exists in the United States, just as surely as racism, sexism, glass ceilings, and homophobia do. To the point, the issue isn't who should be hired or fired, depending on the budget. The issue is that we are not spending enough on education.

Summing up


If there are just a two things that you take away from this (in addition to wanting to learn more about the history of unions and union creation in the United States), they should be:
  1. Arguments that use one or two example contracts as reasons why unions don't work are confusing the issue - there are millions of union-negotiated contracts that don't have those clauses
  2. "Right to work" laws are an attempt to reduce employee power, to reduce low/mid-level employee wages, to provide more short-term corporate profits, to ultimately increase pay to executives (union labor workers generally earn 10-30% higher wages compared to non-union labor [reference])

Sunday, April 1, 2012

Python bytecode hacks, gotos revisited

Some 8 years ago today, two April fool's jokes appeared on python-list and its mirror newsgroup, comp.lang.python. At the time, I was too young or too much of an idiot to appreciate them.
The first was a pronouncement of Guido's having decided that instead of the tab vs. space wars, that we would use the unicode character u'\x08' and our editors/IDEs could be told how wide to display the character. Even better, almost all editors/IDEs already supported the functionality. Like the wet blanket that I was at the time, I of course pointed out that PEP 8, which defines "the standard" to be 4 space indents, no tabs, hadn't been updated and that no discussion of the kind had occurred in python-dev at any time over the weeks prior. In a nutshell, I was a jerk. Sadly, I can't find the original thread in Google Groups. I still feel bad about being the guy who rained on that parade. So, whomever wrote it, I'm sorry.

The second bit of awesome that I remember was Richie Hindle's Goto in Python. If you've not seen it, or you haven't tried it out, you are missing something. The miracle of it all is that it worked. The details are very cute, it hooks the system trace function, and mangles the next line number to be executed. Now, I don't believe that I said anything negative about it at the time, but I certainly wasn't terribly enthusiastic about it. Having recently been reminded of Richie's goto, it dawned on me how brilliant of an idea the joke was. While I'm sure someone may have thought it was a good idea in the early days of Python, and Guido would have rightfully rejected it. But that Richie came back some 12+ years after Python's creation with a working implementation that didn't alter the underlying syntax? Genius.

Getting to the point

Why am I writing this post? Surely it can't just be to pontificate on the history of Python. Oh no. It's time that I give something back for not being able to appreciate a joke all those years ago, and I'd like to make an effort to commemorate the 8th anniversary of Richie and the unknown 'tab indent' poster's brilliance. Recently, it has come to my attention that Carl Cerecke re-implemented goto in Python by hacking bytecodes (way back in November of 2009, I know, I'm a little late to the party). Why didn't I see it before? Why didn't Richie see it before? It's an evolutionary step in the right direction! It's starting with a non-practical but great joke, and taking it to it's logical conclusion: make it practical. Now, Carl didn't see it as being a joke, and reasonably so. And after seeing that it could be practical, I agree that it's no longer a joke (I will explain this later). But sadly, the recipe has bugs. My contribution is to get rid of the bugs, and add experimental features.

Gotos in Python

First, let's look at the syntax based on one of the examples listed as part of Carl's recipe.
@goto
def test1(n):

    s = 0

    label .myLoop

    if n <= 0:
        return s
    s += n
    n -= 1

    goto .myLoop

assert test1(10) == 55
You define a label using 'label.<labelname>', optionally adding a space between 'label' and the '.' to make it easier to read. Similarly, gotos are defined via 'goto.<labelname>'. Perfectly reasonable and valid Python syntax, just an attribute reference. The magic lies in the decorator up front, which translates those attribute references into bytecode that either turns a label into a sequence of no-ops, or a jump to the label. But, as I mentioned before, there are bugs.

The recipe as listed suffers from 3 major bugs, 2 of which are similar in cause. In the decorator written by Carl, he uses a dictionary as a mapping of goto labels to bytecode offsets where that label occurs. Because of this, you can't have two gotos leading to the same label. And when the translation occurs, it leaves all but the last goto leading to that label unchanged, so the more useful example below that uses gotos for error handling, wouldn't work.
@goto
def handler(data):
    if testA(data):
        if testB(data):
            if testC(data):
                ...
                if testD(data):
                    goto .error
            ...
        
        else:
            goto .error
    ...
    
    return "success"

    label .error
    # do something complicated that you don't want
    # to duplicate in every error condition
    return "error"
Only the last goto would have been translated into a jump, but the other one will sit around and not cause an error until it is hit, at which point you will get an AttributeError. The solution, of course, is to map gotos names to a list of bytecode offsets for that label. An easy fix.

Remember earlier when I mentioned that I no longer think that gotos in Python are a joke? Well, if you dig through the CPython source code, you will come to find that gotos are used in the C parts of CPython for error handling. In Python, you can implement much of the same control flow with loops and 'break/continue' statements, but it's legitimately less elegant than a goto. I recently found myself writing code for work where using gotos for error handling would have simplified our code significantly. Ultimately, I ended up not using gotos, only because our requirements changed and I could gut the function to remove all of the strange corner cases. But I now recognize how useful gotos can be.

The second bug is that if you happen to include the same label twice, the decorator won't warn you, and will again just use the last label as the destination for jumps, leaving earlier labels as AttributeError potholes. This is also an easy fix, we can just check to see if it the label has been seen during the bytecode manipulation already, and if so, raise an exception.

The last bug is caused by the earlier decorator creating a new function that didn't preserve existing attributes. To fix this bug, we'll bypass creating a new function by just replacing the code object on the function. This is, by the way, one of my favorite ways to monkey-patch libraries and running systems; you don't need to replace every reference to a function or method, you just need to replace it's code object. But I digress, let's add a feature.

Computed gotos

Richie's library also supported computed gotos, but Carl didn't want to include them because they are not Pythonic. I suspect a different reason, and that is because computed gotos are a huge pain in the ass. Python's underlying bytecode doesn't support jumping to an arbitrarily computed offset, and because Python's code objects are immutable, you can't even write self-modifying code without mucking about with the ctypes module. How frustrating ;) . But don't fret, because Richie has the answer: trace functions. Now, since this is a post about bytecode hacking, of course we've got to keep hacking on bytecodes. Also, in C/C++, computed gotos aren't really bare like Richie was using, no, they are actually switch/case statements. Ah hah! All we need is a syntax in Python that gives us enough room to do all of the shenanigans that we need to do for switch/case statements in Python. First, the syntax. See the following (very ugly) code listing for the syntax that we are going to use.
@switch
def test2():
    a = 1
    while switch *a:
        case[0]
        # not taken
        goto .end

        case[1]
        # taken

        case[2]
        # there is fallthrough like C/C++!

    # default code goes here

    label .end
    # this is after the switch/case
Now, don't let that syntax fool you, the while loop is just to set up a block using at least the 12 bytecodes we need for the hack. A with statement would have been sufficient, but it has a lot of setup and teardown that is annoying to replace (a total of 24 bytecodes), and the other non-looping block in Python, 'if', only offers 11 bytecodes. You are probably wondering why we didn't just re-use "continue" and "break" to jump back to the switch or jump out of the switch. The short reason is because I didn't want to have to rewrite the entire bytecode of the function to adjust offsets (because continue/break are 1 bytecode, and jumps are 3 bytecodes), and the longer reason is just that I was being lazy :P

Okay, so we've got a syntax, but how do we jump to arbitrary case statements? Well, we set up a helper function that will handle jumping to arbitrary line numbers, then we replace the while loop bytecode with alternate bytecode that fetches the destination line number from a dictionary that we packed away in the locals array (defaulting to a line just after the while loop, so the "default" is just outside the while block), and we call the helper to jump to the proper line number instead of looping.

The function that actually handles jumping to arbitrary line numbers will set up the system trace function to a necessary level to induce the line number change, then allow the trace to clean itself up. With this method, we can get switch/case statement handling to execute fairly quickly, aside from the trace function overhead.

The sad truth

I know what you are thinking, you're thinking, "Holy crap, switch/case in Python with a bytecode hack? We can totally use that to optimize some types of parsing!" Well, it's not that simple. Because of the stuff we did with the trace function, we actually can't even put the switch/case within a loop because it causes a segfault with the Python interpreter due to the way trace functions are manipulated. I have not dug too deeply into this, nor have I tried to fix it, primarily because I've been too busy. Heck, I'm writing this blog post on my birthday, because it's the first spare time I've had in weeks.

Even worse, because of the way that Python handles loops and with statements, jumping in and out of them can lead to some pretty nasty results, potentially leading to a segfault. Those issues are still there, though an intrepid hacker could continue down this path to add checks to the hack to detect loop starts/ends, and check how deep into blocks the goto/labels are. If someone doesn't beat me to it, maybe I'll get to it in a couple months, but I'm not going to make promises.

Now, with the sad truth comes a bit of sunlight to brighten your day: there is a way to fix the segfault. Rather than abusing the system trace function, we can instead set up a sequence of if-statements in a tree structure to find the correct case, in O(log(number of case statements)). This actually wouldn't be all that unreasonable, except that there isn't room inside the existing bytecode to make it happen, so we would need to add extra bytecode to the end of the existing bytecode, or inject the bytecode and adjust the offsets in the bytecode and line number table. Maybe I'll get around to it some day.

Conclusion

You can see, download, and fork gotos in Python at this Github gist. Keep making cool stuff, and don't let party poopers like me get you down.

Friday, March 16, 2012

Why we didn't use a bloom filter

After having been featured on the High Scalability blog, there has been renewed interest in my 6 month old post about Improving Performance by 1000x.

Occasionally I've received a few questions and/or comments about that post in other forums, but today I got a couple posts from two well-intentioned commentators that highlighted some very interesting misunderstandings about how modern computers and compilers work, misunderstandings that are actually very common. Unfortunately, a couple of the author's comments have been removed (I didn't remove them; I only remove obvious spam comments), so I will summarize the questions after I briefly describe what I built.

What I built

Using C, I wrote a handful of lines of C code that took a sorted sequence of 4-byte integers that represented a Twitter user's followers, and I intersected that sequence against thousands of other similar sequences for other Twitter users. By being careful, we went from taking roughly 7 seconds to intersect that data in Redis, to taking 6 milliseconds to intersect it with our custom software. On the memory side of things, we reduced our memory requirements by roughly 20x with our method over the standard Redis method.

Let me be clear on this, Redis is a great tool. I still use it every day (even though I've left Adly for ChowNow, and I'm writing Redis in Action for Manning Publications (I should be working on it now, but I wanted to blog on a Friday night). But for some problems, like the one we had, custom software can be a huge win. On with the questions!

Question: Why not use the high-performance C++ STL set intersection code?

Updated Answer: The STL is a pain to work with, and my experience on the C++ side of things (6 months professionally), told me that I would find neither performance nor simplicity by using it.

On top of all of that, we didn't actually need the list, we only needed the count, so even if we had spent the pain of writing it in C++, we'd get a result that was unnecessary, and have to write our own intersection code in the end anyway.

I originally started with code very similar to the C++ STL set intersection code and was dissatisfied with the results. By observing that the longer sequences will skip more items than the shorter sequences, I wrote a 2-level loop that skips over items in the longer sequence before performing comparisons with the shorter sequence. That got me roughly a 2x performance advantage over the STL variant.

There have been a few comments about my STL concerns about performance, which I wanted to respond to.

My original post mentioned performance issues with using vtables in modern C++. My experience from ~4 years ago was using a combination of the STL and other C++ classes. I may have been incorrectly attributing the relative low performance of some operations to method dispatch, instead of some of the quality of containers we were using. Let me explain.

In 2004-2005, I had built a hash table as part of a search engine in C, which was used as a set (no values, only keys). It used 64 bit hashes and 64 bit values on a 32 bit machine. Using this hash set, I was able to add/update/remove items at a rate of 8 million items/second on a mobile 1.2 ghz Intel Core Solo.

In 2007-2008, I had used std::hash_set to store a set of 64 bit integers, and I was surprised to find that it could only add/update/remove 3 million items/second, despite running on a 3.0 ghz Intel P4 (hyperthreading enabled/disabled only affected performance by 5-10% on that problem). I had shrugged it off and called it a vtable problem*, which was a naive decision that I'd not revisited until now. Rethinking it, it was more likely the combination of both the efficiency of the P4 architecture (specifically with respect to pipeline depths; 11-13 with core solo vs. 20+ with P4) and the specifics of the hash table itself.

* I had been willing to quickly jump to the conclusion that it was the language/compiler because I was used to that on the Python side of things. No really. Why is standard Python slow compared to C/C++/Java? It's simply a matter of time and money not having been spent to research and implement JIT and/or static compilers. Look at Boo (which uses static types), PyPy (which uses a restricted version of Python), or Unladen Swallow (which uses an LLVM backed JIT); once time and money has been spent in an effort to improve Python performance, it happens. Incidentally, that's also the only reason why the JVM is so fast: $Billions have been spent on making it faster in the last 20 years.
Probably mostly wrong portion of the original answer:
Have you ever used the STL? I did for about 6 months professionally, and I'm glad that I could go back to C and Python. More seriously, object orientation in C++ doesn't come for free. Aside from the pain of the syntax of using C++ templates (and difficult to parse error messages), all of that object orientation hides the complexity of method dispatch in what are known as vtables. If your system has to call them to dispatch, you are wasting time in that processing when you could be doing something else. For most software, that doesn't matter. But when we are talking about nanoseconds (and we will shortly), every bit of work costs you.

Even ignoring the vtable cost, we'll pretend that all of the set code was inlined. But now, all of your manipulation and boundary checking is in your core intersection loop, which will introduce more branches, more branch mispredictions, which will ultimately slow down the intersection code.

Question: Why not use a bloom filter?

Answer: Bloom filters are slower and use more memory.

DRAM is actually really slow. When you make a request to read DRAM, it returns a 64 byte cache line from that region in about 15 nanoseconds (in the best case). That's around 4.3 gigs/second random access. If you are accessing memory in a predictable pattern, the second block you request is returned in about 10 nanoseconds, and subsequent blocks can be returned in as fast as 3-4 nanoseconds. So, after the first few "slow" requests, you can read at a peak rate of roughly 21.3 gigs/second for predictable access patterns on high end machines.

Back to our case. We intersected a set of 5 million items against a set of 7.5 million items in 6 milliseconds. Doing the math... 4 bytes * (5m + 7.5m) / 6ms -> 8.3 gigs/second. So, we were within a factor of ~2-2.5x of maxing out the memory bandwidth available to the entire system with our 6ms intersection.

Let's pretend that we had already built a bloom filter, and let's even say that the math said that we only need 7 probes (the reality would probably be closer to 16 or so). At 15ns per probe, because bloom filters have unpredictably random access patterns, 7 probes per 5 million users in our input set, my math says that the intersection would take 525 milliseconds, which is 87 times slower than our intersection method.

But here's the other nasty bit of reality; constructing the bloom filter for that 5 million entry set (because we need to have bloom filter for all sets) takes at least twice as long as querying it. Why? Because to update 1 bit in the filter, you first need to read the cache line, then you need to update the bit in the register, then you need to write the data back. I know what you are thinking, you are thinking that your write-through cache will help you. But you are wrong. Because you are writing randomly to your bloom filter, you run out of cache very quickly, and the caching layers within your processor will quickly start blocking the processor from reading or writing to memory, because it's got a backlog of data to flush to memory that is limited to 15ns per random write. So with a (lower-bounded) 525ms to intersect, that's roughly 1.05 seconds to construct the bloom filter in the best case.

Meanwhile, the slow O(nlogn) standard C sort() finishes in 1.2 seconds, which is within spitting distance of the theoretically optimal bloom filter we just described. We also only sort once, but intersect thousands of times, so even if it did offer a modest performance advantage, it wouldn't be worth the performance penalty during intersection.

Another drawback with the bloom filter is memory. I know what you are thinking, you are thinking "well, with a 7 probe bloom filter, maybe you only need 2-3 bytes per user, instead of the 4 bytes per user". But in order to perform the intersection, you still need your list of users (sorted or not) in addition to the bloom filter, leaving you with 6-7 bytes per user.

Conclusion: bloom filters would be (80-160 times) slower to intersect, would use more memory, don't offer a significant construction time performance advantage (probably closer to 2x slower for a 16 probe bloom filter), and are only approximately correct.

The lesson you should learn

The reason why our custom solution was so much faster than Redis is primarily the memory access patterns and memory efficiency. Large sets in Redis have roughly a 60-80 byte/entry overhead on a 64 bit machine, most of which needs to be read to intersect one item with another. And because of the bucket + linked list (which is how Redis builds sets), intersection in Redis could read more than 60-80 bytes to intersect one item with another. Ours did so well because we reduced our space to 4 bytes per user (from 60-80), only ever read that 4 bytes, and ensured that all of our memory access patterns were sequential. With the practical realities of modern processors and memory, there is no faster way for single-processor set intersections.

If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!

Thursday, January 12, 2012

Creating a Lock with Redis

This post is a partially rewritten excerpt from Redis in Action, the book that I am writing for Manning Publications, which does not currently have a release date. Over time, I hope to include other excerpts from the book as time allows.

Redis Transactions

In the world of concurrent and parallel programming, there will be some point where you will write something that requires access to a resource without any other thread or process accessing it. In Redis this is actually very common; multiple readers, multiple writers, all dealing with the same data structures. Redis includes a combination of five commands for handling optimistic data locking. Those commands are WATCH, UNWATCH, MULTI, EXEC, and DISCARD.

The typical path for code that intends to modify data in Redis that requires checking values before updating will have a 4-step process. First you WATCH data that you want to modify/check on. Next you check your data. If it isn't what you need to continue, you UNWATCH and return. If it was what you were looking for, you start a MULTI transaction, send your commands, then EXEC them. Below is a simple example that transfers money between two accounts, making sure to prevent overdrafts.

def transfer_funds(conn, sender, recipient, amount):
    pipe = conn.pipeline(True)
    while 1:
        try:
            pipe.watch(sender)
            if pipe.hget(sender, 'money') < amount:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.hincrby(sender, 'money', -amount)
            pipe.hincrby(recipient, 'money', amount)
            pipe.execute()
            return True
        except redis.exceptions.WatchError:
            pass
The only command not used above is the DISCARD command, which will discard any commands passed after MULTI. As you can see, we will attempt to loop through the above money transfer until it either fails due to no money, or succeeds. This is fine, but if you have some account that is updated often, or if you have some shared data structure that is constantly being updated, you can fall into the except clause and have to retry the transaction repeatedly.

Locking

In other software, rather than using WATCH/MULTI/EXEC, there is a primitive called a Lock, which allows us to gain exclusive access to a resource. You acquire the lock, perform your operation, then release the lock. Because of the variety of commands in Redis, you can build a lock using the available tools. Unfortunately, using those tools correctly is not easy. In my research about locks in Redis, I haven't found a single implementation available online that is 100% correct (until now). Some of the problems with locks that I have seen are as follows:
  1. A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
  2. A process acquired a lock for an operation that takes a long time, and crashed. Other processes that want the lock don't know what process had the lock, so cannot detect that the process failed, and wastes time waiting for the lock to be released.
  3. One process had a lock, but it timed out. Other processes try to acquire the lock simultaneously, and multiple processes are able to get the lock.
  4. Because of a combination of #1 and #3, many processes now hold the believed exclusive lock, leading to data corruption or other incorrect behavior.
Even if each of these problems had a 1 in a million chance of occurring, Redis' high performance (typically 100,000 to 225,000 operations/second) can cause those problems to occur under heavy load surprisingly often (up to a few times per minute), so it is important to get locking right.

Building a mostly correct lock in Redis is pretty easy. Building a completely correct lock in Redis isn't actually much more difficult, but it requires being extra careful about the operations we use to build it (in the book, we first build a simple lock to show the basics, then add in the full functionality, here we will jump to the fully-featured lock).

The first part of making sure that no other code can run is to "acquire" the lock. The natural building block to use for acquiring a lock is the SETNX command, which will only set a value if the value doesn't already exist. We will set the value to be a unique identifier to ensure that no other process can get the lock. If we were able to set the value (we have acquired the lock), then we immediately set the expiration of the key to ensure that if we take too long with our operation, the lock is eventually released. But if our client happens to crash (and the worst place for it to crash for us is between SETNX and EXPIRE), we still want the lock to eventually time out. To handle that situation, any time a client fails to get the lock, the client will check the expiration on the lock, and if it's not set, set it. Because clients are going to be checking and setting timeouts if they fail to get a lock, the lock will always have a timeout, and will eventually expire, letting other clients get a timed out lock.

But what if multiple clients set the expiration time at the same time? That is fine, they will be running essentially at the same time, so the expiration will be set for the same time.

def acquire_lock(conn, lockname, identifier, atime=10, ltime=10):
    end = time.time() + atime
    while end > time.time():
        if conn.setnx(lockname, identifier):
            conn.expire(lockname, ltime)
            return identifier
        elif not conn.ttl(lockname):
            conn.expire(lockname, ltime)

        time.sleep(.001)

    return False
To release the lock, we have to be at least as careful as when acquiring the lock. Between the time when we acquired the lock and when we are trying to release it, someone may have done bad things to the lock. To release the lock, we actually need to WATCH the lock key, then check to make sure that the value is still the same as what we set it to before we delete it. This also prevents us from releasing a lock multiple times.
def release_lock(conn, lockname, identifier):
    pipe = conn.pipeline(True)

    while True:
        try:
            pipe.watch(lockname)
            if pipe.get(lockname) == identifier:
                pipe.multi()
                pipe.delete(lockname)
                pipe.execute()
                return True

            pipe.unwatch()
            break

        except redis.exceptions.WatchError:
            pass

    # we lost the lock
    return False
And as a bonus, a Python context manager with the updated transfer_funds() code using it...
import contextlib, uuid

class LockException(Exception):
    pass

@contextlib.contextmanager
def redis_lock(conn, lockname, atime=10, ltime=10):
    identifier = str(uuid.uuid4())
    if acquire_lock(**locals()) != identifier:
        raise LockException("could not acquire lock")
    try:
        yield identifier
    finally:
        if not release_lock(conn, lockname, identifier):
            raise LockException("lock was lost")

def transfer_funds(conn, sender, recipient, amount):
    with redis_lock(conn, 'lock:' + sender):
        if conn.hget(sender, 'money') < amount:
            return False

        pipe = conn.pipeline(True)
        pipe.hincrby(sender, 'money', -amount)
        pipe.hincrby(recipient, 'money', amount)
        pipe.execute()
        return True
If you generated your identifier correctly (UUID like we did, or IP address + process id + thread id, etc.), this lock is correct. Not almost correct, not mostly correct, not a little bit wrong. Completely correct. Other locks that I've seen for Redis have one of a couple different mistakes, usually either accidentally resetting the timeout when you shouldn't, or deleting the lock when you shouldn't. Those kinds of errors lead to the 4 problems I listed earlier.

Simplifying Locking

After reading this, you may think that this is a little more work than should be necessary to build a basic lock with timeouts. You are not alone. Two requests have been made to try to help the situation. The first is allowing SETNX to take an optional 3rd argument that is the expiration time if the item was set. That reduces the Redis commands in acquire_lock() to one command that is looped over (the 10 lines above turn into 4 lines). The second is a new command DELC <key> <value>, which will only delete the given key if the current value is the same as the passed value. This reduces the commands in release_lock() to one command that is executed once in the body of the function (the 15 lines above turn into 2 lines). You can read and follow the discussion on the Redis mailing list.


Thank You

If you liked this article about Redis, the section on locks that will be included in the book expands the discussion and includes more examples. If Redis changes to incorporate the new commands and features to make locking easier, you can look forward to another article revisiting classic Redis locking behavior. If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!