Wednesday, May 29, 2019

Re: Leaky Python Batteries

After reading information about Amber Brown's talk regarding Python and its included "batteries" in the form of the standard library, there is a lot to agree with, and only a small amount to not agree with. I know I'm late.

Here's my take...

For experience consideration; I'm the reason that asyncore/asynchat/... (callback-defined async sockets in Python before asyncio) was included and updated in Python 2.6, and lived to see Python 3.0 (other folks took on maintenance after 2.6.?/3.0.?). I stepped on a lot of toes to make that happen, but I was still using asyncore/asynchat, and not having them in the standard library would have been bad for me and folks like me. So I "stepped up".

My experience is that moving anything in Python in any direction is hard and frustrating, no matter how "sane" you think your words and opinions are*. I remember being confused that I was even having to argue for static byte literals in Python 3.0, never mind pushing for a syntax like b'hello' instead of bytes([104, 101, 108, 108, 111]) [1], or even that an immutable variant should be available (instead of using a tuple for hash keys).

Thinking about it well after the fact, I have the sense it was because I didn't feel like I had any real level of ownership or control. I felt responsible as a member of the community to voice my opinion about the software I was using, to try to contribute. But I also felt as though I didn't have any level of real ownership. Like, somehow there was a path through the Python community that lead to being able to fix a 2 line bug without 25 people signing off on the change. I saw people walking the path, but I couldn't really find it or manage to walk it myself.

In Contrast

But compare that experience to your own project? Or even working in almost any tech-based organization? I was fortunate, in the 21 months I put in at YouTube / Google from 2008-2010, I was made an "oldtuber", and later "owner" of several files early on as I hit readability with Python quickly, and showed that I could actually fix problems with the platform. Within 3 months of joining in 2008, I could write code, ensure it passed the automated tests, commit without *any* code reviews in a half dozen files, and have code running on within a week of writing it (pushes for hotfixes could happen randomly, sometimes 3-4 times a week back then).

Compare that with any library in Python? I can't even remember how many folks needed to sign off on my asyncore / asynchat changes, but it was far more than the 0/1/2 I needed to commit at YouTube. Python has more folks weighing in now than ever before, and I bet that's not making contributing easier.

I don't know how much this feeling pervades other existing / past python contributors, nor how much this is a barrier for creating new contributors. I don't know if this is a pain point for others, but it was big for me.

I don't know if Amber's frustration is similarly comparing her experience in Twisted vs. Python core. Where in Twisted she found it easy to make an impact and help the project, but Python core is uninterested / unwilling in similar insights. I don't know (a lot of things).

Regardless, I don't know if splitting the Python standard library out makes sense. On the one hand, pulling out the standard library would give some people more control and nearly absolute ownership. But that will mostly just create inconsistencies in code quality, maintenance, and interoperability - like we've all seen in our own dependency hells elsewhere. All the while making Python a language without batteries.

Address the Leaky Batteries

Personally, even if the batteries are leaking, I'd rather have them included. Compiling Python issues aside (you can disable compiling tkinter / lxml modules FWIW), that's as much a "work on Python's Makefiles" as it is anything else. Which is a much more approachable and appropriate thing than getting the organization to create dozens of sub-projects.

I do agree with Amber that asyncio's inclusion in Python does make it hard for the community, especially given how long that Twisted has been around and doing much of the same stuff. Heck, I even suggested that if Python were to deprecate and remove modules from the standard library because the functionality were duplicated in Twisted, then the answer was to dump those modules and include Twisted [2] (even though I've actually never used Twisted, and have more recently used asyncore/asynchat, asyncio, pyuv, uvloop, and shrapnel since I read Twisted source). But it's going to be hard to put that genie back in the bottle; best case scenario is that there aren't any more "big" packages included that quickly.

Ultimately, I believe that the answer might just be better organization, better management of maintainers, better expectation management re: possible contributors, and better management of contributions. I think that giving people more ownership of the modules / packages they maintain, but also allow for the community to override a bad maintainer, might help things move forward in each individual module. If everything splits, many sorts of organization-level efficiencies are lost, including just the benefit of being part of a larger project meaning you get more attention at all levels.

Personally, I'd suggest sticking together and working on trying to manage the project itself better. For some subjective definition of "better".

A Mistake 20 Years in the Making

That said, until visiting [3] while writing this entry, I didn't even know there was a formal process of being involved in the Python Software Foundation and moving through the organization. I know I spent 10-20 hours a week trying to help back when I was active on the mailing lists. No wonder I failed to find a path to success in Python core for the 20 years I've used Python; I didn't know you literally needed to be a part of the club. Hah.

C'est la vie.
[1] -
[2] -
[3] -

Wednesday, January 23, 2019

My last 6 months - making StructD worth the time

For those of you who have been paying attention to this blog might be asking, where is Josiah? Well, I've been busy. Not only did I get to go on a little vacation (much needed), but I've also been making StructD (my fork of Redis) better. How? Watch inline below, or if that doesn't work:

The short version for those who didn't / can't watch is:
  • Background memory snapshot memory use worst-case is 99.8%+ reduced
    • Old: 100 Gig Redis could use 200 Gigs of memory during snapshot, or +100 Gigs
    • New: 100 Gig Redis could use 100.2 Gigs of memory during snapshot worst-case, really +198 Megs, a 99.8% reduction
  • Snapshots load 20-75% faster
  • Snapshots created 25-65% faster
  • Cold replicas start 50-70% faster 
Other things not covered in the video:
  • More Lua environment hardening
  • Better Lua tracebacks
  • Temporarily removed Lua debugger (transactions means that we can do this another way)
  • Improved Lua transaction performance
  • Lua Aliases:
    • SCRIPT LOAD <script> [<alias name> [NX|REPLACE]]
    • SCRIPT ALIAS <sha1> [<alias name> [NX|REPLACE]]
    • SCRIPT UNALIAS <alias name> ...
    • EVALSHA <alias name> <keycount> <keys ...> <argv ...>
    • EVAL "return CALL.alias_name({key1, ...}, {argv1, ...})" ...
  • Lua and other operations are better behaved with respect to one another (more things can happen during Lua scripts)
  • crc16 - 6x faster than Redis, 33% faster than Mark Adler / Matt Standcliff
  • crc64 - 6-8x faster than Redis, 33-66% faster than Mark Adler / Matt Standcliff
  • TXN.MULTI to go with TXN.EVAL / TXN.ESHA, for rolling back data changes on command execution error
Combined with our earlier SSL/TLS additions, Lua transactions, etc., we've got faster and lower-memory snapshot operations, more convenience in your Lua scripts, and performance comparable to or better than open source Redis with just about any SSL/TLS unwrapping daemon.

I've also been shopping around talks surrounding the engineering choices leading to the performance improvements and memory savings currently enjoyed by StructD. From these choices, there are clear incremental steps towards a further 50-80% reduction in snapshot creation / load time, plus a general 50-80% reduction in memory use overall. I will get there, just not tomorrow.

As announced several months ago, I've been working towards starting a hosting company. Still moving in that direction, but until that is quite finished, I've decided the best way to get there in the interim is to open up shop for both unpaid public, and paid private Redis and StructD support.

So, if you'd like to get Redis or StructD support from me, download StructD, get StructD hosting in the future, or just feel like emailing me on a mailing list again, please feel free to drop by to learn more.

     StructD 5.0.0 (8d6715c9/1) 64 bit   Mode: standalone   _______
 _______          Build: 20190104.0019   Port: 6380        |   __  \ _
/  _____)     _    PID: 2670   _     |  |  |  | \_
| /\____\) __| |\_   _ _____   _     _    ______  __| |\_  |  |  |  |:| \_
| \/___   (__   __) | '_____) | |\  | |\ /  ____)(__   __) |  |  |  | |:| \
\_____ \-_ \_| |\_\)| / \___\)| | | | | || /\___\)\_| |\_\)|  |__|  | | |:|
 \____\ | \  | | |  | | /     | | | | | || |/       | | |  |_______/ \| | |
 _____/ | |  | | |  | | |     | \_|_/ | || \|___    | | |   \________/ \| |
(______/ \|  |_| |  |_| |     \_____,_/ |\______)   |_| |     \________/ \|
 \_____\_/    \_\|   \_\|      \_____,\/  \_____\)   \_\|       \________/

Tuesday, May 1, 2018

Forking Redis Followup

This past Thursday, I announced the forking of Redis for SSL + Transactions + performance improvements, as well as the release of an updated redis-benchmark client to benchmark your Redis + SSL system. You can read all of it here.

What I didn't know, is that 7 days earlier an engineer for AWS named Madelyn Olson had submitted a PR to provide SSL support to Redis. I missed that PR, and the earlier RCP in my searches. Later Thursday evening, Salvatore stated that he'd be accepting the PR. Having since read the PR, I like her code better than mine, and she solves some ssl buffering and no-disk replication problems better than the solutions I had in mind. Benchmarking her changes sees 5-10% better performance than my changes in the few tests I've run, and reading code tells me that she recognized better integration points, and built a better SSL/TLS integration with fewer memcopies.

So I'm definitely going to be using Madelyn's SSL changes instead of mine, even if I did spend 9 weeks on my changes. Better code is better code. Kudos Madelyn. :)

Salvatore doesn't seem interested in my benchmark changes for more/better metrics and plotting, based on his most recent email. You can always compile from my sources, and I'll try to make it an easy patch to cherry-pick if you just want that one thing, after Madelyn's changes have been merged.

I've also been asked to rename the project, which is not entirely surprising. That will happen in the next few weeks, with a replacement README that points you to the correct place at the original repository location.

For transactions and faster startup; given how many people read my original release notes (12k+), and how much money was donated / pledged ($0), I'm going to go out on a limb and think that folks aren't all that interested in donating money, and instead are looking to get something completely free and open source, or a commercial release to outsource their risk.

As such, instead of spending the next 3-6 weeks finishing and tuning the last of my SSL/TLS changes for that open source release (which no longer seem necessary, given Madelyn's PR), I'm going to spend a bit more time cleaning up my transactions work, get some benchmarks in, provide datasets and timing for my faster startup time work, and do all of the necessary project renames. I don't know when I'll be releasing more than just benchmark changes and/or renames, but I'm working on it 5-6 days a week.

I might do some short videos on the hilarity of someone else beating you to release software, which has happened 3 times in the last 6 months. We will see.

Again, if you're interested in Lua and other transactions with rollbacks, I've got you covered, just let me know. More is in the works for my fork than what I've announced, but I don't like talking unless I've got something done enough to demo. I hope you are solving your problems, and that my fork can help solve more.

Thursday, April 26, 2018

Yes, I'm Forking Redis - SSL/TLS + Transactions

Update2: The followup is here, for later.

If you haven't seen the video announcement on YouTube, and would like to watch/listen to the 5 minute version instead of reading the text version, here you go.


Done? Great! This is the text version of that notice. I am releasing a fork [1] of Redis and tools surrounding Redis to support my needs. This first release is mostly just tools, specifically adding SSL/TLS support to redis-benchmark and hiredis.

In the next couple weeks, I'll also be releasing a version of Redis server (the real fork) that speaks SSL/TLS on standard listening ports, with Cluster gossip, and the MIGRATE call (used both by Redis Cluster and users). I'm releasing the benchmark tools now because they've been baking for a few weeks, and only required a few Makefile cleanups to make me feel good about releasing them.

The only thing really delaying the full Redis Server release is that I haven't gotten around to redis-cli (because I had clients that already spoke Redis + TLS) or redis-sentinel, and both of those are necessary before I can release the full package (yes, I've already added support for SSL/TLS to the unit tests).

What really isn't delaying the release is that due to some algorithm tuning, and finding better implementations of a few core algorithms, I am seeing 3-5x faster startup times on nearly every dataset I've tested. From huge numbers of short keys, to object models and indexes. All are seeing 3-5x faster startup times to load the full dataset into memory. I will be posting benchmarks of this at the full server release.

In addition to SSL/TLS and the above mentioned performance improvements, this fork will also contain support for my resurrected Transactions in Redis Lua scripts, with expanded support for WATCH/MULTI/EXEC/ROLLBACK style transactions (you send your list of keys with WATCH, and if any operation is on any key not in the list, or if you get an error, your changes are all rolled back).

For those of you who are living in Redis Cluster land, and need transactions on keys that don't live in the same shard/server, I previewed multi-shard Lua script transactions back in February. If you are a commercial user and would like to use any of this in your environment, I offer reasonable support and integration contract rates (and those cluster modifications may require separate licensing).

So, the big news is:

  • SSL/TLS native in Redis (benchmarks below)
    • redis-benchmark SSL/TLS support now
    • hiredis SSL/TLS support now
    • redis-cli soon (not started)
    • redis-sentinel soon (not started)
    • redis-server (server, gossip, migrate, done)
      • unit tests (done)
  • Transactions with rollbacks in Redis (also available in Redis Cluster)
    • Lua scripts rollback on error
    • Rollback on explicit rollback
  • 3-5x faster startup times
    • sample data and more benchmarks later


While not the first feature implemented, encryption over the wire (especially between master/slave replicas, cluster gossip, and during MIGRATE calls) is a basic need. And as a basic need, relying on third party tools for SSL/TLS termination or a transparent VPN solution is a great first step from running without encryption, but it can leave speed on the table. And part of the reason why we use Redis is for speed, right?

Redis itself spends much of its time waiting on network system calls, or to be interrupted to read data from a connection. Quoted benchmarks at conferences since 2015 have claimed that 97% of Redis' time is spent waiting on network-related system calls and interrupts. With 3rd party SSL/TLS termination, that can only get worse. How much worse?

An SSL/TLS terminator needs to read the request (wait, read), then do the decrypt operation, then send the data to Redis (write). Redis does its operations (wait, read, write), then the terminator gets to read from Redis (wait, read), encrypt, then send the response to the client (write). Notice how we went from 3 somewhat basic operations to 9? Those of you running SSL/TLS terminators for your Redis have probably already experienced latency or throughput hits without realizing it. Can we benchmark to see how much native SSL/TLS termination inside Redis buys us? Of course we can.

For this first set of benchmarks, we use 2 computers with the following specifications:

Name: o790
Model: Dell Optiplex 790 workstation
CPU: Core i3-2120 @ 3.30 GHz (2 cores, 2 threads per core, 4 total threads)
RAM: 16GB total (4x 4096 MB DDR3-1333 Synchronous DIMMs)
Media: 1 TB SSD
OS: Ubuntu 14.04
Kernel: 4.4.0-119-generic
SSL/TLS via: OpenSSL 1.1.0h
Compiler: GCC 4.8.4

Name: t7600
Model: Dell Precision T7600 workstation
CPU: 2x Xeon E5-2670 @ 2.60/3.30 GHz (2 CPUs, 8 cores per CPU, 2 threads per core, 32 total threads)
RAM: 128GB total (16x 8192 MB DDR3-1333 Registered DIMMs)
Media: 1 TB SSD
OS: Ubuntu 16.04
Kernel: 4.13.0-38-generic
SSL/TLS via: OpenSSL 1.1.0h
Compiler: GCC 5.4.0

The computers are connected to a Trendnet TEG-S82g 1 gigabit 8 port desktop switch, along with several other devices. Ifconfig reports 1500 byte frames No attempt at isolating traffic on the network was done. Occasional blips in performance were observed due to logging in and out of the workstations to check progress, which is why each command is run 10 times, with averages and standard deviations in response rates plotted.

We consider the following testing variations:

Client test variations:
  • redis-benchmark no TLS at any layer  (Redis-4.0.9)
  • redis-benchmark with included TLS  (Redis-4.0.9, patches)
  • redis-benchmark with external Stunnel TLS (Redis-4.0.9, Stunnel-5.44)
  • redis-benchmark with external Hitch TLS (Redis-4.0.9, Hitch-1.4.8)
Redis server test variations:
  • redis-server with no TLS at any layer (Redis-4.0.9)
  • redis-server with included TLS (Redis-4.0.9, patches)
  • redis-server with external Stunnel TLS (Redis-4.0.9, Stunnel-5.44)
  • redis-server with external Hitch TLS (Redis-4.0.9, Hitch-1.4.8)
Benchmark command-line test variations:
  • -t SET,GET -n 5000000 -r 4750000
    • 5 million SET/GET operations over a 4.75m entry keyspace
  • -t SET,GET -n 25000000 -r 4750000 -P 5
    • 25 million SET/GET operations over a 4.75m entry keyspace, 5 operations at a time
We run each benchmark in each chosen variation 10 times with average and standard deviation plotted (top of the range limited to actual throughput seen). Some combinations of endpoints were not tested due to timing as of the publishing of this article [2], and others due to non-obvious runtime errors [3]. Data from each kept run is combined [4] and plotted using Matplotlib. The x-axis on all plots are "seconds since the start of redis-benchmark". Operations per second numbers are the exact number of operations performed in the last 1000 milliseconds, sampled every 250 milliseconds (see redis-benchmark source code for details). Colors are consistent across plots. Red for 'null-null' represents no SSL/TLS termination on either end. Blue for 'native-native' represents redis-benchmark with '--ssl' connecting directly to our patched Redis. Orange for 'native-stunnel' means redis-benchmark with '--ssl' connecting to Stunnel SSL/TLS termination, feeding into an unpatched Redis 4.0.9 installation. Other lines have similar testing/configuration implications in their naming.


On our lower-powered i3 (2 cores, 4 threads), we are heavily CPU bound, so the least-overhead SSL/TLS termination option (native-native in blue, more information later), is fastest among our encrypted options.

Oddly enough, Stunnel into native here doesn't work well at all, which suggests some mismatch between buffer sizes, request sizes, and/or timing of network operations, as none of the other configurations suffer as badly. We do have latency profiles of all benchmarks performed, so we can look at those in the future for direction on better tuning if this is a common platform in the wild.

That said, once we are less constrained by CPU, it turns out that Stunnel on both ends is fastest here on a total operations/second performance metric. But at what cost? Here we restart the daemons between each run, then get the total CPU time used by each of Redis, Hitch, and Stunnel in each of our 4 variations of recipient on the t7600.

For the t7600 native client into a Hitch-terminated Redis, I saw Hitch use 1h 11m 30s of CPU time, with Redis using 45m 11s. Going into Stunnel-terminated Redis, I saw Stunnel use 4h 12m 44s, Redis using 49m 52s. When we use our native SSL/TLS patches for the same set of tests, Redis uses 49m 53s of CPU. Disabling all SSL/TLS, gets us 41m 11s used by Redis. So relatively speaking, we've got ~1h 57m Redis+Hitch vs. ~5h 3m Redis+Stunnel vs. 50m Redis+native vs 41m for no SSL/TLS termination. Or 185%, 639%, and 22% relative CPU overhead compared to no termination.

With our slow machine again as the server, native termination definitely comes out as fastest in individual operations again, for the performance reasons likely described before. We checked the CPU time again on the o790 side of things.

For the t7600 native client into a Hitch-terminated Redis, I saw Hitch use 1h 33m 24s of CPU time, with Redis using 32m 11s. Going into Stunnel-terminated Redis, I saw Stunnel use 2h 23m 27s, Redis using 51m 45s. When we use our native SSL/TLS patches for the same set of tests, Redis uses 52m 32s of CPU. Disabling all SSL/TLS, gets us 33m 40s. So relatively speaking, we've got 2h 5m Redis+Hitch vs. 3h 18m Redis+Stunnel vs. 53m Redis+native vs 34m for no SSL/TLS termination. Or 268%, 482%, or 56% CPU overhead for SSL/TLS termination relative to no termination.

Higher overhead (relatively speaking) for Hitch and native isn't surprising here, and is likely due to the higher actual cost of the AES encrypt/decrypt performed during SSL/TLS operations on the o790 vs. the t7600. We can partly verify that by checking the speed of the operations on each machine (we can explicitly verify later with timing around encrypt/decrypt operations inside Redis):

o790 $ openssl speed -evp aes-256-gcm
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm      64949.38k    71785.37k   261766.40k   286591.66k   293898.92k

t7600 $ openssl speed -evp aes-256-gcm
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     251445.05k   684496.60k  1000988.42k  1110611.29k  1141981.18k

And again, native termination has the highest throughput.


So, looking at clear wins here, if you've got an under-powered server, going with native SSL/TLS termination inside Redis is likely to be faster for you, and substantially so, depending on your existing SSL/TLS termination method. If you're looking to save CPU time, native SSL/TLS termination is also a win, as it used less CPU time compared to other methods tested, as expected.

I would have liked to test Nginx as a general SSL/TLS unwrapper, but general unwrapping requires a commercial license that I don't have yet. Perhaps someone with a license can run a couple comparisons on their hardware and report back.

How to help

If you'd like to support this work and continuing work, you can do a few things.



[1] This open source fork will be updated and maintained, likely tied to the most recent stable release for the short term, expanding as needs dictate.
[2] We did not try connecting Stunnel to <->Hitch, nor have we tried cross-machine Hitch to Hitch. We may update the graphs, and this note, along with any other applicable notes as this article and benchmarks are updated.
[3] As of this writing, I was seeing a runtime error for --client configured Hitch servers on the t7600, so could not use the t7600 to start a Hitch tunnel.
[4] Benchmark and analysis tools used are available in our forked repository, where you can see the numeric details of our sample averaging, as well as the raw data we produced in our runs.
[5] Repository:

Sunday, May 15, 2016

rom Indexes and Search

So... some updates on rom.

The end of January, I posted a teaser picture about "Interleaved Indexes" in rom/Redis via tweet. If you haven't seen the picture, it's here:

Interleaved index?

I ended up building an interleaved index in Redis using a bit of Lua scripting and Python. No ZRANGEBYLEX, surprisingly. What is an interleaved index? It's what I was calling an indexing mode that has the same ordering and filtering options as a few different multi-dimensional indexes stored as tree-structures. Specifically some KD trees, Redshift interleaved sort keys, BSP trees, a specific implementation of crit-bit trees, and several others offer the specific functionality I was after.


The simple reason why I was implementing an interleaved index was because I see some intersection options on data in Redis to be potentially less efficient than a proper multi-dimensional index would or could be. Long story short, it worked, but not without issues. I mentioned some of these issues in a series of tweets 1, 2, and 3, semi-abandoned the code in late-February, and now am ultimately not releasing it. Why? Because it doesn't actually add anything. It was complex, borrowed about 750 lines of code I wrote 5 1/2 years ago, and ... no.

A better option

There were a few very simple wins that I knew could be made with the query optimizer, including a fix on my side for a bug when calling TYPE from within a Lua script (which returns a table instead of a string). The ultimate result of that work is that some queries on numeric ranges can be hundreds or thousands of times faster on large indexes in theory. Partly due to starting with the correct set/sorted set to start, but also implementing a direct scan of an index instead of intersect/union + delete outside ranges.

Sometimes being more specific for optimizations is worth it. Definitely is in this case. For one of my use-cases involving search, I'm seeing 10-20x faster queries in practice, and 150x faster in a simple canned test.

I also removed non-Lua writing mode code. Sorry for those of you living in pre-2.6 days, but you'll have to upgrade. Hell, even with Lua scripting turned off, the query optimizer still used Lua, so if this worked in Redis 2.4 recently, I'd be surprised.

So that's what's going on right now.

Rebuild your indexes

Yeah. And rebuild your indexes. I'm sorry. Whenever I'm using rom as a cache or index of some kind, I re-cache and re-index daily so things like this always eventually resolve themselves, especially immediately after a library upgrade. Not a new idea; Google did it with their bigtables for cleanup, Postgres does auto-vacuum. Call this a manual vacuum via re-indexing.

Once/day or however often, maybe during off hours:

# import all of your models first
# then...
from rom import columns, util
for model in columns.MODELS.values():

That will rebuild all of the indexes on all of your models.

Almost P.S. - Loadable modules in Redis?

Redisconf 2016 happened last week and loadable modules were announced. I think that for people who host their own Redis, it could be awesome. Think of it like an answer to Postgres plugins. Hope I can pay someone to run my loadable modules, if I ever get around to building any :P

Wednesday, July 29, 2015

Transactions in Redis

Over the last few months, I've been thinking about and implementing transactions for Lua scripting in Redis. Not everyone understands why I'm doing this, so let me explain with a bit of history.

MySQL and Postgres

In 1998-2003 if you wanted to start a serious database driven web site/service and didn't have money to pay Microsoft or Oracle for their databases, you picked either MySQL or Postgres. A lot of people picked MySQL because it was faster, and much of that was due to the MyISAM storage engine that traded performance for a lack of transaction capability - speed is speed. Some people went with Postgres because despite its measurably slower performance on the same hardware, you could rely on Postgres to not lose your data (to be fair, the data loss with MySQL was relatively rare, but data loss is never fun).

A lot of time has passed since then; MySQL moved on from MyISAM as the default storage engine to InnoDB (which has been available for a long time now), gained full transaction support in the storage engine, and more. At the same time, Postgres got faster, and added a continually expanding list of features to distinguish itself in the marketplace. And now the choice of whether to use MySQL or Postgres usually boils down to experience and preference, though occasionally business or regulatory needs dictate other choices.

TL;DR; data integrity

In a lot of ways, Redis up to now is a lot like MySQL was back before InnoDB was an option. There is already a reasonable best-effort to ensure data integrity (replication, AOF, etc.), and the introduction of Lua scripting in Redis 2.6 has helped Redis grow up considerably in its capabilities and the overall simplification of writing software that uses Redis.

Comparatively, Lua scripting operates very much like stored procedures in other databases, but script execution itself has a few caveats. The most important caveat for this post is that once a Lua script has written to the database, it will execute until any one of the following occurs:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script hits an error and exits in the middle, all writes that were done up to the error have occurred, but no more writes will be done from the script
  3. Redis is shut down without saving via SHUTDOWN NOSAVE
  4. You attach a debugger and "fix" your script to get it to do #1 or #2 (or some other heroic deed that allows you to not lose data)
To anyone who is writing software against a database, I would expect that you agree that only case #1 in that list is desirable. Cases #2, #3, and #4 are situations where you can end up with data corruption (cases #2 and #4) and/or data loss (cases #3 and #4). If you care about your data, you should be doing just about anything possible to prevent data corruption and loss. This is not philosophy, this is doing your job. Unfortunately, current Redis doesn't offer a lot of help here. I want to change that.

Transactions in Lua

I am seeking to eliminate cases #2, #3, and #4 above, replacing the entire list with:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script exits with an error, no changes have been made (all writes were rolled back)
No data loss. Either everything is written, or nothing is written. This should be the expectation of any database, and I intend to add it to the expectations that we all have about Redis.

The current pull request is a proof of concept. It does what it says it does, removing the need to lose data as long as you either a) explicitly run your scripts using the transactional variants, or b) force all Lua script calls to have transactional semantics with a configuration option.

There are many ways the current patch can be made substantially better, and I hope for help from Salvatore (the author of Redis) and the rest of the community.

Wednesday, November 26, 2014

Introduction to rate limiting with Redis [Part 2]

This article first appeared on November 3, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

In Introduction to rate limiting with Redis [Part 1], I described some motivations for rate limiting, as well as provided some Python and Lua code for offering basic and intermediate rate limiting functionality. If you haven’t already read it, you should, because I’m going to discuss several points from the article. In this post, I will talk about and address some problems with the previous methods, while also introducing sliding window functionality and-cost requests.

Problems with previous methods

The last rate limiting function that we wrote was over_limit_multi_lua(), which used server-side Lua scripting in Redis to do the heavy lifting of actually performing the rate limiting calculations. It is included below with the Python wrapper as a reference.

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count ='INCR', key)'EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
return 0

Hidden inside this code are several problems that can limit its usefulness and correctness when used for its intended purpose. These problems and their solutions are listed below.

Generating keys in the script

One of the first problems you might notice was mentioned in a comment by a commenter named Tobias on the previous post, which is that we are constructing keys inside the Lua script. If you’ve read the Redis documentation about Lua scripting, you should know that we are supposed to be passing all keys to be used in the script from outside when calling it.

The requirement to pass keys into the script is how Redis attempts to future-proof Lua scripts that are being written, as Redis Cluster (currently in beta) distributes keys across multiple servers. By having your keys known in advance, you can calculate which Redis Cluster server the script should run on, and if keys are on multiple Cluster servers, that the script can’t run properly.

Our first problem is that generating keys inside the script can make the script violate Redis Cluster assumptions, which makes it incompatible with Redis Cluster, and generally makes it incompatible with most key-based sharding techniques for Redis.

To address this issue for Redis Cluster and other client-sharded Redis setups, we must use a method that handles rate limiting with a single key. Unfortunately, this can prevent atomic execution for multiple identifiers for Redis Cluster, but you can either rely on a single identifier (user id OR IP address, instead of both), or stick with non-clustered and non-sharded Redis in those cases.

What we count matters

Looking at our function definition, we can see that our default limits were 10 requests per second, 120 requests per minute, and 240 requests per hour. If you remember from the “Counting correctly” section, in order for our rate limiter to complete successfully, we needed to only increment one counter at a time, and we needed to stop counting if that counter went over the limit.

But if we were to reverse the order that the limits were defined, resulting in us checking our per-hour, then per-minute, then per-second limits (instead of per-second, minute, then hour), we would have our original counting problem all over again. Unfortunately, due to details too involved to explain here, just sorting by bucket size (smallest to largest) doesn’t actually solve the problem, and even the original order could result in requests failing that should have succeeded. Ultimately our problem is that we are counting all requests, both successful and unsuccessful (those that were prevented due to being over the limit).

To address the issue with what we count, we must perform two passes while rate limiting. Our first pass checks to see if the request would succeed (cleaning out old data as necessary), and the second pass increments the counters. In previous rate limiters, we were basically counting requests (successful and unsuccessful). With this new version, we are going to only count successful requests.
Stampeding elephants

One of the most consistent behaviors that can be seen among APIs or services that have been built with rate limiting in mind is that usually request counts get reset at the beginning of the rate limiter’s largest (and sometimes only) time slice. In our example, at every hour on the hour, every counter that had been incremented is reset.

One common result for APIs with these types of limits and limit resets is what’s sometimes referred to as the “stampeding elephants” problem. Because every user has their counts reset at the same time, when an API offers access to in-demand data, many requests will occur almost immediately after limits are reset. Similarly, if the user knows that they have outstanding requests that they can make near the end of a time slice, they will make those requests in order to “use up” their request credit that they would otherwise lose.

We partially addressed this issue by introducing multiple bucket sizes for our counters, specifically our per-second and per-minute buckets. But to fully address the issue, we need to implement a sliding-window rate limiter, where the count for requests that come in at 6:01PM and 6:59PM aren’t reset until roughly an hour later at 7:01PM and 7:59PM, respectively, not at 7:00PM. Further details about sliding windows are a little later.

Bonus feature: variable-cost requests

Because we are checking our limits before incrementing our counts, we can actually allow for variable-cost requests. The change to our algorithm will be minor, adding an increment for a variable weight instead of 1.

Sliding Windows

The biggest change to our rate limiting is actually the process of changing our rate limiting from individual buckets into sliding windows. One way of understanding sliding window rate limiting is that each user is given a number of tokens that can be used over a period of time. When you run out of tokens, you don't get to make any more requests. And when a token is used, that token is restored (and can be used again) after the the time period has elapsed.

As an example, if you have 240 tokens that can be used in an hour, and you used 20 tokens at 6:05PM, you would only be able to make up to another 220 requests until 7:04PM. At 7:05PM, you would get those 20 tokens back (and if you made any other requests between 6:06PM and 7:05PM, those tokens would be restored later).

With our earlier rate limiting, we basically incremented counters, set an expiration time, and compared our counters to our limits. With sliding window rate limiting, incrementing a counter isn’t enough; we must also keep history about requests that came in so that we can properly restore request tokens.

One way of keeping a history, which is the method that we will use, is to imagine the whole window as being one large bucket with a single count (the window has a ‘duration’), similar to what we had before, with a bunch of smaller buckets inside it, each of which has their own individual counts. As an example, if we have a 1-hour window, we could use smaller buckets of 1 minute, 5 minutes, or even 15 minutes, depending on how precise we wanted to be, and how much memory and time we wanted to dedicate (more smaller buckets = more memory + more cleanup work). We will call the sizes of the smaller buckets their “precision.” You should notice that when duration is the same as precision, we have regular rate limits. You can see a picture of various precision buckets in a 1 hour window below.

As before, we can consider the smaller buckets to be labeled with individual times, say 6:00PM, 6:01PM, 6:02PM, etc. But as the current time becomes 7:00PM, what we want to do is to reset the count on the 6:00PM bucket to 0, adjust the whole window’s count, and re-label the bucket to 7:00PM. We would do the same thing to the 6:01PM bucket at 7:01PM, etc.

Data representation

We’ve now gotten to the point where we need to start talking about data representation. We didn’t really worry about representation before simply because we were storing a handful of counters per identifier. But now, we are no longer just storing 1 count for a 1 hour time slice, we could store 60 counts for a 1 hour time slice (or more if you wanted more precision), plus a timestamp that represents our oldest mini-bucket label.

For a simpler version of sliding windows, I had previously used a Redis LIST to represent the whole window, with each item in the LIST including both a time label, as well as the count for the smaller buckets. This can work for limited sliding windows, but restricts our flexibility when we want to use multiple rate limits (Redis LISTs have slow random access speeds).

Instead, we will use a Redis HASH as a miniature keyspace, which will store all count information related to rate limits for an identifier in a single HASH. Generally, for a sliding window of a specified duration and precision for an identifier, we will have the HASH stored at the key named by the identifier, with contents of the form:

<duration>:<precision>:o --> <timestamp of oldest entry>
<duration>:<precision>: --> <count of successful requests in this window>
<duration>:<precision>:<ts> --> <count of successful requests in this bucket>

For sliding windows where more than one sub-bucket has had successful requests, there can be multiple <duration>:<precision>:<ts> entries that would each represent one of the smaller buckets. For regular rate limits (not sliding window), the in-Redis schema is the same, though there will be at most one <duration>:<precision>:<ts> key, and duration is equal to precision for regular rate limits (as we mentioned before).

Because of the way we named the keys in our HASH, a single HASH can contain an arbitrary number of rate limits, both regular and windowed, without colliding with one another.

Putting it all together

And finally, we are at the fun part; actually putting all of these ideas together. First off, we are going to use a specification for our rate limits to simultaneously support regular and sliding window rate limits, which looks a lot like our old specification.

One limit is: [duration, limit, precision], with precision being optional. If you omit the precision option, you get regular rate limits (same reset semantics as before). If you include the precision option, then you get sliding window rate limits. To pass one or more rate limits to the Lua script, we just wrap the series of individual limits in a list: [[duration 1, limit 1], [duration 2, limit 2, precision 2], ...], then encode it as JSON and pass it to the script.

Inside the script we need to make two passes over our limits and data. Our first pass cleans up old data while checking whether this request would put the user over their limit, the second pass increments all of the bucket counters to represent that the request was allowed.

To explain the implementation details, I will be including blocks of Lua that can be logically considered together, describing generally what each section does after. Our first block of Lua script will include argument decoding, and cleaning up regular rate limits:

local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
local weight = tonumber(ARGV[3] or '1')
local longest_duration = limits[1][1] or 0
local saved_keys = {}
-- handle cleanup and limit checks
for i, limit in ipairs(limits) do

    local duration = limit[1]
    longest_duration = math.max(longest_duration, duration)
    local precision = limit[3] or duration
    precision = math.min(precision, duration)
    local blocks = math.ceil(duration / precision)
    local saved = {}
    table.insert(saved_keys, saved)
    saved.block_id = math.floor(now / precision)
    saved.trim_before = saved.block_id - blocks + 1
    saved.count_key = duration .. ':' .. precision .. ':'
    saved.ts_key = saved.count_key .. 'o'
    for j, key in ipairs(KEYS) do

        local old_ts ='HGET', key, saved.ts_key)
        old_ts = old_ts and tonumber(old_ts) or saved.trim_before
        if old_ts > now then
            -- don't write in the past
            return 1

        -- discover what needs to be cleaned up
        local decr = 0
        local dele = {}
        local trim = math.min(saved.trim_before, old_ts + blocks)
        for old_block = old_ts, trim - 1 do
            local bkey = saved.count_key .. old_block
            local bcount ='HGET', key, bkey)
            if bcount then
                decr = decr + tonumber(bcount)
                table.insert(dele, bkey)

        -- handle cleanup
        local cur
        if #dele > 0 then
  'HDEL', key, unpack(dele))
            cur ='HINCRBY', key, saved.count_key, -decr)
            cur ='HGET', key, saved.count_key)

        -- check our limits
        if tonumber(cur or '0') + weight > limit[2] then
            return 1

Going section by section though the code visually, where a blank line distinguishes individual sections, we can see 6 sections in the above code:
  1. Argument decoding, and starting the for loop that iterates over all rate limits
  2. Prepare our local variables, prepare and save our hash keys, then start iterating over the provided user identifiers (yes, we still support multiple identifiers for non-clustered cases, but you should only pass one identifier for Redis Cluster)
  3. Make sure that we aren’t writing data in the past
  4. Find those sub-buckets that need to be cleaned up
  5. Handle sub-bucket cleanup and window count updating
  6. Finally check the limit, returning 1 if the limit would have been exceeded
Our second and last block of Lua operates under the precondition that the request should succeed correctly, so we only need to increment a few counters and set a few timestamps:

-- there is enough resources, update the counts
for i, limit in ipairs(limits) do
    local saved = saved_keys[i]

    for j, key in ipairs(KEYS) do
        -- update the current timestamp, count, and bucket count'HSET', key, saved.ts_key, saved.trim_before)'HINCRBY', key, saved.count_key, weight)'HINCRBY', key, saved.count_key .. saved.block_id, weight)

-- We calculated the longest-duration limit so we can EXPIRE
-- the whole HASH for quick and easy idle-time cleanup :)
if longest_duration > 0 then
    for _, key in ipairs(KEYS) do'EXPIRE', key, longest_duration)

return 0

Going section by section one last time gets us:
  1. Start iterating over the limits and grab our saved hash keys
  2. Set the oldest data timestamp, and update both the window and buckets counts for all identifiers passed
  3. To ensure that our data is automatically cleaned up if requests stop coming in, set an EXPIRE time on the keys where our hash(es) are stored
  4. Return 0, signifying that the user is not over the limit

Optional fix: use Redis time

As part of our process for checking limits, we fetch the current unix timestamp in seconds. We use this timestamp as part of the sliding window start and end times and which sub-bucket to update. If clients are running on servers with reasonably correct clocks (within 1 second of each other at least, within 1 second of the true time optimally), then there isn’t much to worry about. But if your clients are running on servers with drastically different system clocks, or on systems where you can’t necessarily fix the system clock, we need to use a more consistent clock.

While we can’t always be certain that the system clock on our Redis server is necessarily correct (just like we can’t for our other clients), if every client uses the time returned by the TIME command from the same Redis server, then we can be reasonably assured that clients will have fairly consistent behavior, limited to the latency of a Redis round trip with command execution.

As part of our function definition, we will offer the option to use the result of the TIME command instead of system time. This will result in one additional round trip between the client and Redis to fetch the time before passing it to the Lua script.

Add in our Python wrapper, which handles the optional Redis time and request weight parameters, and we are done:

def over_limit_sliding_window(conn, weight=1, limits=[(1, 10), (60, 120), (3600, 240, 60)], redis_time=False):
    if not hasattr(conn, 'over_limit_sliding_window_lua'):
        conn.over_limit_sliding_window_lua = conn.register_script(over_limit_sliding_window_lua_)

    now = conn.time()[0] if redis_time else time.time()
    return conn.over_limit_sliding_window_lua(
        keys=get_identifiers(), args=[json.dumps(limits), now, weight])

If you would like to see all of the rate limit functions and code in one place, including the over_limit_sliding_window() Lua script with wrapper, you can visit this Github gist.

Wrap up and conclusion

Congratulations on getting this far! I know, it was a slog through problems and solutions, followed by a lot of code, and now after seeing all of it I get to tell you what you should learn after reading through all of this.

Obviously, the first thing you should get out of this article is an implementation of sliding window rate limiting in Python, which is trivially ported to other languages -- all you need to do is handle the wrapper. Just be careful when sending timestamps, durations, and precision values to the script, as the EXPIRE call at the end expects all timestamp values to be in seconds, but some languages natively return timestamps as milliseconds instead of seconds.

You should also have learned that performing rate limiting with Redis can range from trivial (see our first example in part 1) to surprisingly complex, depending on the features required, and how technically correct you want your rate limiting to be. It also turns out that the problems that were outlined at the beginning of this article aren’t necessarily deal-breakers for many users, and I have seen many implementations similar to the over_limit_multi_lua() method from part 1 that are perfectly fine for even heavy users*. Really it just means that you have a choice about how you want to rate limit.

And finally, you may also have learned that you can use Redis hashes as miniature keyspaces to collect data together. This can be used for rate limiting as we just did, as well as a DB row work-alike (the hash keys are like named columns, with values the row content), unique (but unsorted) indexes (i.e. email to user id lookup table, id to encoded data lookup table, ...), sharded data holders, and more.

For more from me on Redis and Python, you can check out the rest of my blog at

* When Twitter first released their API, they had a per-hour rate limit that was reset at the beginning of every hour, just like our most basic rate limiter from part 1. The current Twitter API has a per-15 minute rate limit, reset at the beginning of every 15 minute interval (on the hour, then 15, 30, and 45 minutes after the hour) for many of their APIs. (I have no information on whether Twitter may or may not be using Redis for rate limiting, but they have admitted to using Redis in some capacity by virtue of their release of Twemproxy/Nutcracker).