Thursday, January 12, 2012

Creating a Lock with Redis

This post is a partially rewritten excerpt from Redis in Action, the book that I am writing for Manning Publications, which does not currently have a release date. Over time, I hope to include other excerpts from the book as time allows.

Redis Transactions

In the world of concurrent and parallel programming, there will be some point where you will write something that requires access to a resource without any other thread or process accessing it. In Redis this is actually very common; multiple readers, multiple writers, all dealing with the same data structures. Redis includes a combination of five commands for handling optimistic data locking. Those commands are WATCH, UNWATCH, MULTI, EXEC, and DISCARD.

The typical path for code that intends to modify data in Redis that requires checking values before updating will have a 4-step process. First you WATCH data that you want to modify/check on. Next you check your data. If it isn't what you need to continue, you UNWATCH and return. If it was what you were looking for, you start a MULTI transaction, send your commands, then EXEC them. Below is a simple example that transfers money between two accounts, making sure to prevent overdrafts.

def transfer_funds(conn, sender, recipient, amount):
    pipe = conn.pipeline(True)
    while 1:
        try:
            pipe.watch(sender)
            if pipe.hget(sender, 'money') < amount:
                pipe.unwatch()
                return False

            pipe.multi()
            pipe.hincrby(sender, 'money', -amount)
            pipe.hincrby(recipient, 'money', amount)
            pipe.execute()
            return True
        except redis.exceptions.WatchError:
            pass
The only command not used above is the DISCARD command, which will discard any commands passed after MULTI. As you can see, we will attempt to loop through the above money transfer until it either fails due to no money, or succeeds. This is fine, but if you have some account that is updated often, or if you have some shared data structure that is constantly being updated, you can fall into the except clause and have to retry the transaction repeatedly.

Locking

In other software, rather than using WATCH/MULTI/EXEC, there is a primitive called a Lock, which allows us to gain exclusive access to a resource. You acquire the lock, perform your operation, then release the lock. Because of the variety of commands in Redis, you can build a lock using the available tools. Unfortunately, using those tools correctly is not easy. In my research about locks in Redis, I haven't found a single implementation available online that is 100% correct (until now). Some of the problems with locks that I have seen are as follows:
  1. A process acquired a lock, operated on data, but took too long and the lock was automatically released. The process doesn't know that it lost the lock, or may even release the lock that some other process has since acquired.
  2. A process acquired a lock for an operation that takes a long time, and crashed. Other processes that want the lock don't know what process had the lock, so cannot detect that the process failed, and wastes time waiting for the lock to be released.
  3. One process had a lock, but it timed out. Other processes try to acquire the lock simultaneously, and multiple processes are able to get the lock.
  4. Because of a combination of #1 and #3, many processes now hold the believed exclusive lock, leading to data corruption or other incorrect behavior.
Even if each of these problems had a 1 in a million chance of occurring, Redis' high performance (typically 100,000 to 225,000 operations/second) can cause those problems to occur under heavy load surprisingly often (up to a few times per minute), so it is important to get locking right.

Building a mostly correct lock in Redis is pretty easy. Building a completely correct lock in Redis isn't actually much more difficult, but it requires being extra careful about the operations we use to build it (in the book, we first build a simple lock to show the basics, then add in the full functionality, here we will jump to the fully-featured lock).

The first part of making sure that no other code can run is to "acquire" the lock. The natural building block to use for acquiring a lock is the SETNX command, which will only set a value if the value doesn't already exist. We will set the value to be a unique identifier to ensure that no other process can get the lock. If we were able to set the value (we have acquired the lock), then we immediately set the expiration of the key to ensure that if we take too long with our operation, the lock is eventually released. But if our client happens to crash (and the worst place for it to crash for us is between SETNX and EXPIRE), we still want the lock to eventually time out. To handle that situation, any time a client fails to get the lock, the client will check the expiration on the lock, and if it's not set, set it. Because clients are going to be checking and setting timeouts if they fail to get a lock, the lock will always have a timeout, and will eventually expire, letting other clients get a timed out lock.

But what if multiple clients set the expiration time at the same time? That is fine, they will be running essentially at the same time, so the expiration will be set for the same time.

def acquire_lock(conn, lockname, identifier, atime=10, ltime=10):
    end = time.time() + atime
    while end > time.time():
        if conn.setnx(lockname, identifier):
            conn.expire(lockname, ltime)
            return identifier
        elif not conn.ttl(lockname):
            conn.expire(lockname, ltime)

        time.sleep(.001)

    return False
To release the lock, we have to be at least as careful as when acquiring the lock. Between the time when we acquired the lock and when we are trying to release it, someone may have done bad things to the lock. To release the lock, we actually need to WATCH the lock key, then check to make sure that the value is still the same as what we set it to before we delete it. This also prevents us from releasing a lock multiple times.
def release_lock(conn, lockname, identifier):
    pipe = conn.pipeline(True)

    while True:
        try:
            pipe.watch(lockname)
            if pipe.get(lockname) == identifier:
                pipe.multi()
                pipe.delete(lockname)
                pipe.execute()
                return True

            pipe.unwatch()
            break

        except redis.exceptions.WatchError:
            pass

    # we lost the lock
    return False
And as a bonus, a Python context manager with the updated transfer_funds() code using it...
import contextlib, uuid

class LockException(Exception):
    pass

@contextlib.contextmanager
def redis_lock(conn, lockname, atime=10, ltime=10):
    identifier = str(uuid.uuid4())
    if acquire_lock(**locals()) != identifier:
        raise LockException("could not acquire lock")
    try:
        yield identifier
    finally:
        if not release_lock(conn, lockname, identifier):
            raise LockException("lock was lost")

def transfer_funds(conn, sender, recipient, amount):
    with redis_lock(conn, 'lock:' + sender):
        if conn.hget(sender, 'money') < amount:
            return False

        pipe = conn.pipeline(True)
        pipe.hincrby(sender, 'money', -amount)
        pipe.hincrby(recipient, 'money', amount)
        pipe.execute()
        return True
If you generated your identifier correctly (UUID like we did, or IP address + process id + thread id, etc.), this lock is correct. Not almost correct, not mostly correct, not a little bit wrong. Completely correct. Other locks that I've seen for Redis have one of a couple different mistakes, usually either accidentally resetting the timeout when you shouldn't, or deleting the lock when you shouldn't. Those kinds of errors lead to the 4 problems I listed earlier.

Simplifying Locking

After reading this, you may think that this is a little more work than should be necessary to build a basic lock with timeouts. You are not alone. Two requests have been made to try to help the situation. The first is allowing SETNX to take an optional 3rd argument that is the expiration time if the item was set. That reduces the Redis commands in acquire_lock() to one command that is looped over (the 10 lines above turn into 4 lines). The second is a new command DELC <key> <value>, which will only delete the given key if the current value is the same as the passed value. This reduces the commands in release_lock() to one command that is executed once in the body of the function (the 15 lines above turn into 2 lines). You can read and follow the discussion on the Redis mailing list.


Thank You

If you liked this article about Redis, the section on locks that will be included in the book expands the discussion and includes more examples. If Redis changes to incorporate the new commands and features to make locking easier, you can look forward to another article revisiting classic Redis locking behavior. If you want to see more posts like this, you can buy my book, Redis in Action from Manning Publications today!