Tuesday, May 1, 2018

Forking Redis Followup

This past Thursday, I announced the forking of Redis for SSL + Transactions + performance improvements, as well as the release of an updated redis-benchmark client to benchmark your Redis + SSL system. You can read all of it here.

What I didn't know, is that 7 days earlier an engineer for AWS named Madelyn Olson had submitted a PR to provide SSL support to Redis. I missed that PR, and the earlier RCP in my searches. Later Thursday evening, Salvatore stated that he'd be accepting the PR. Having since read the PR, I like her code better than mine, and she solves some ssl buffering and no-disk replication problems better than the solutions I had in mind. Benchmarking her changes sees 5-10% better performance than my changes in the few tests I've run, and reading code tells me that she recognized better integration points, and built a better SSL/TLS integration with fewer memcopies.

So I'm definitely going to be using Madelyn's SSL changes instead of mine, even if I did spend 9 weeks on my changes. Better code is better code. Kudos Madelyn. :)

Salvatore doesn't seem interested in my benchmark changes for more/better metrics and plotting, based on his most recent email. You can always compile from my sources, and I'll try to make it an easy patch to cherry-pick if you just want that one thing, after Madelyn's changes have been merged.

I've also been asked to rename the project, which is not entirely surprising. That will happen in the next few weeks, with a replacement README that points you to the correct place at the original repository location.

For transactions and faster startup; given how many people read my original release notes (12k+), and how much money was donated / pledged ($0), I'm going to go out on a limb and think that folks aren't all that interested in donating money, and instead are looking to get something completely free and open source, or a commercial release to outsource their risk.

As such, instead of spending the next 3-6 weeks finishing and tuning the last of my SSL/TLS changes for that open source release (which no longer seem necessary, given Madelyn's PR), I'm going to spend a bit more time cleaning up my transactions work, get some benchmarks in, provide datasets and timing for my faster startup time work, and do all of the necessary project renames. I don't know when I'll be releasing more than just benchmark changes and/or renames, but I'm working on it 5-6 days a week.

I might do some short videos on the hilarity of someone else beating you to release software, which has happened 3 times in the last 6 months. We will see.

Again, if you're interested in Lua and other transactions with rollbacks, I've got you covered, just let me know. More is in the works for my fork than what I've announced, but I don't like talking unless I've got something done enough to demo. I hope you are solving your problems, and that my fork can help solve more.

Thursday, April 26, 2018

Yes, I'm Forking Redis - SSL/TLS + Transactions

Update2: The followup is here, for later.

If you haven't seen the video announcement on YouTube, and would like to watch/listen to the 5 minute version instead of reading the text version, here you go.


Announcement

Done? Great! This is the text version of that notice. I am releasing a fork [1] of Redis and tools surrounding Redis to support my needs. This first release is mostly just tools, specifically adding SSL/TLS support to redis-benchmark and hiredis.

In the next couple weeks, I'll also be releasing a version of Redis server (the real fork) that speaks SSL/TLS on standard listening ports, with Cluster gossip, and the MIGRATE call (used both by Redis Cluster and users). I'm releasing the benchmark tools now because they've been baking for a few weeks, and only required a few Makefile cleanups to make me feel good about releasing them.

The only thing really delaying the full Redis Server release is that I haven't gotten around to redis-cli (because I had clients that already spoke Redis + TLS) or redis-sentinel, and both of those are necessary before I can release the full package (yes, I've already added support for SSL/TLS to the unit tests).

What really isn't delaying the release is that due to some algorithm tuning, and finding better implementations of a few core algorithms, I am seeing 3-5x faster startup times on nearly every dataset I've tested. From huge numbers of short keys, to object models and indexes. All are seeing 3-5x faster startup times to load the full dataset into memory. I will be posting benchmarks of this at the full server release.

In addition to SSL/TLS and the above mentioned performance improvements, this fork will also contain support for my resurrected Transactions in Redis Lua scripts, with expanded support for WATCH/MULTI/EXEC/ROLLBACK style transactions (you send your list of keys with WATCH, and if any operation is on any key not in the list, or if you get an error, your changes are all rolled back).

For those of you who are living in Redis Cluster land, and need transactions on keys that don't live in the same shard/server, I previewed multi-shard Lua script transactions back in February. If you are a commercial user and would like to use any of this in your environment, I offer reasonable support and integration contract rates (and those cluster modifications may require separate licensing).

So, the big news is:

  • SSL/TLS native in Redis (benchmarks below)
    • redis-benchmark SSL/TLS support now
    • hiredis SSL/TLS support now
    • redis-cli soon (not started)
    • redis-sentinel soon (not started)
    • redis-server (server, gossip, migrate, done)
      • unit tests (done)
  • Transactions with rollbacks in Redis (also available in Redis Cluster)
    • Lua scripts rollback on error
    • Rollback on explicit rollback
    • WATCH/MULTI/EXEC/ROLLBACK
  • 3-5x faster startup times
    • sample data and more benchmarks later

Why SSL/TLS?

While not the first feature implemented, encryption over the wire (especially between master/slave replicas, cluster gossip, and during MIGRATE calls) is a basic need. And as a basic need, relying on third party tools for SSL/TLS termination or a transparent VPN solution is a great first step from running without encryption, but it can leave speed on the table. And part of the reason why we use Redis is for speed, right?

Redis itself spends much of its time waiting on network system calls, or to be interrupted to read data from a connection. Quoted benchmarks at conferences since 2015 have claimed that 97% of Redis' time is spent waiting on network-related system calls and interrupts. With 3rd party SSL/TLS termination, that can only get worse. How much worse?

An SSL/TLS terminator needs to read the request (wait, read), then do the decrypt operation, then send the data to Redis (write). Redis does its operations (wait, read, write), then the terminator gets to read from Redis (wait, read), encrypt, then send the response to the client (write). Notice how we went from 3 somewhat basic operations to 9? Those of you running SSL/TLS terminators for your Redis have probably already experienced latency or throughput hits without realizing it. Can we benchmark to see how much native SSL/TLS termination inside Redis buys us? Of course we can.

For this first set of benchmarks, we use 2 computers with the following specifications:

Name: o790
Model: Dell Optiplex 790 workstation
CPU: Core i3-2120 @ 3.30 GHz (2 cores, 2 threads per core, 4 total threads)
RAM: 16GB total (4x 4096 MB DDR3-1333 Synchronous DIMMs)
Media: 1 TB SSD
OS: Ubuntu 14.04
Kernel: 4.4.0-119-generic
SSL/TLS via: OpenSSL 1.1.0h
Compiler: GCC 4.8.4

Name: t7600
Model: Dell Precision T7600 workstation
CPU: 2x Xeon E5-2670 @ 2.60/3.30 GHz (2 CPUs, 8 cores per CPU, 2 threads per core, 32 total threads)
RAM: 128GB total (16x 8192 MB DDR3-1333 Registered DIMMs)
Media: 1 TB SSD
OS: Ubuntu 16.04
Kernel: 4.13.0-38-generic
SSL/TLS via: OpenSSL 1.1.0h
Compiler: GCC 5.4.0

The computers are connected to a Trendnet TEG-S82g 1 gigabit 8 port desktop switch, along with several other devices. Ifconfig reports 1500 byte frames No attempt at isolating traffic on the network was done. Occasional blips in performance were observed due to logging in and out of the workstations to check progress, which is why each command is run 10 times, with averages and standard deviations in response rates plotted.

We consider the following testing variations:

Client test variations:
  • redis-benchmark no TLS at any layer  (Redis-4.0.9)
  • redis-benchmark with included TLS  (Redis-4.0.9, patches)
  • redis-benchmark with external Stunnel TLS (Redis-4.0.9, Stunnel-5.44)
  • redis-benchmark with external Hitch TLS (Redis-4.0.9, Hitch-1.4.8)
Redis server test variations:
  • redis-server with no TLS at any layer (Redis-4.0.9)
  • redis-server with included TLS (Redis-4.0.9, patches)
  • redis-server with external Stunnel TLS (Redis-4.0.9, Stunnel-5.44)
  • redis-server with external Hitch TLS (Redis-4.0.9, Hitch-1.4.8)
Benchmark command-line test variations:
  • -t SET,GET -n 5000000 -r 4750000
    • 5 million SET/GET operations over a 4.75m entry keyspace
  • -t SET,GET -n 25000000 -r 4750000 -P 5
    • 25 million SET/GET operations over a 4.75m entry keyspace, 5 operations at a time
We run each benchmark in each chosen variation 10 times with average and standard deviation plotted (top of the range limited to actual throughput seen). Some combinations of endpoints were not tested due to timing as of the publishing of this article [2], and others due to non-obvious runtime errors [3]. Data from each kept run is combined [4] and plotted using Matplotlib. The x-axis on all plots are "seconds since the start of redis-benchmark". Operations per second numbers are the exact number of operations performed in the last 1000 milliseconds, sampled every 250 milliseconds (see redis-benchmark source code for details). Colors are consistent across plots. Red for 'null-null' represents no SSL/TLS termination on either end. Blue for 'native-native' represents redis-benchmark with '--ssl' connecting directly to our patched Redis. Orange for 'native-stunnel' means redis-benchmark with '--ssl' connecting to Stunnel SSL/TLS termination, feeding into an unpatched Redis 4.0.9 installation. Other lines have similar testing/configuration implications in their naming.

Benchmarks




On our lower-powered i3 (2 cores, 4 threads), we are heavily CPU bound, so the least-overhead SSL/TLS termination option (native-native in blue, more information later), is fastest among our encrypted options.




Oddly enough, Stunnel into native here doesn't work well at all, which suggests some mismatch between buffer sizes, request sizes, and/or timing of network operations, as none of the other configurations suffer as badly. We do have latency profiles of all benchmarks performed, so we can look at those in the future for direction on better tuning if this is a common platform in the wild.

That said, once we are less constrained by CPU, it turns out that Stunnel on both ends is fastest here on a total operations/second performance metric. But at what cost? Here we restart the daemons between each run, then get the total CPU time used by each of Redis, Hitch, and Stunnel in each of our 4 variations of recipient on the t7600.

For the t7600 native client into a Hitch-terminated Redis, I saw Hitch use 1h 11m 30s of CPU time, with Redis using 45m 11s. Going into Stunnel-terminated Redis, I saw Stunnel use 4h 12m 44s, Redis using 49m 52s. When we use our native SSL/TLS patches for the same set of tests, Redis uses 49m 53s of CPU. Disabling all SSL/TLS, gets us 41m 11s used by Redis. So relatively speaking, we've got ~1h 57m Redis+Hitch vs. ~5h 3m Redis+Stunnel vs. 50m Redis+native vs 41m for no SSL/TLS termination. Or 185%, 639%, and 22% relative CPU overhead compared to no termination.





With our slow machine again as the server, native termination definitely comes out as fastest in individual operations again, for the performance reasons likely described before. We checked the CPU time again on the o790 side of things.

For the t7600 native client into a Hitch-terminated Redis, I saw Hitch use 1h 33m 24s of CPU time, with Redis using 32m 11s. Going into Stunnel-terminated Redis, I saw Stunnel use 2h 23m 27s, Redis using 51m 45s. When we use our native SSL/TLS patches for the same set of tests, Redis uses 52m 32s of CPU. Disabling all SSL/TLS, gets us 33m 40s. So relatively speaking, we've got 2h 5m Redis+Hitch vs. 3h 18m Redis+Stunnel vs. 53m Redis+native vs 34m for no SSL/TLS termination. Or 268%, 482%, or 56% CPU overhead for SSL/TLS termination relative to no termination.

Higher overhead (relatively speaking) for Hitch and native isn't surprising here, and is likely due to the higher actual cost of the AES encrypt/decrypt performed during SSL/TLS operations on the o790 vs. the t7600. We can partly verify that by checking the speed of the operations on each machine (we can explicitly verify later with timing around encrypt/decrypt operations inside Redis):

o790 $ openssl speed -evp aes-256-gcm
[snip]
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm      64949.38k    71785.37k   261766.40k   286591.66k   293898.92k

t7600 $ openssl speed -evp aes-256-gcm
[snip]
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256-gcm     251445.05k   684496.60k  1000988.42k  1110611.29k  1141981.18k





And again, native termination has the highest throughput.

Conclusion

So, looking at clear wins here, if you've got an under-powered server, going with native SSL/TLS termination inside Redis is likely to be faster for you, and substantially so, depending on your existing SSL/TLS termination method. If you're looking to save CPU time, native SSL/TLS termination is also a win, as it used less CPU time compared to other methods tested, as expected.

I would have liked to test Nginx as a general SSL/TLS unwrapper, but general unwrapping requires a commercial license that I don't have yet. Perhaps someone with a license can run a couple comparisons on their hardware and report back.

How to help

If you'd like to support this work and continuing work, you can do a few things.

Update:



References

[1] This open source fork will be updated and maintained, likely tied to the most recent stable release for the short term, expanding as needs dictate.
[2] We did not try connecting Stunnel to <->Hitch, nor have we tried cross-machine Hitch to Hitch. We may update the graphs, and this note, along with any other applicable notes as this article and benchmarks are updated.
[3] As of this writing, I was seeing a runtime error for --client configured Hitch servers on the t7600, so could not use the t7600 to start a Hitch tunnel.
[4] Benchmark and analysis tools used are available in our forked repository, where you can see the numeric details of our sample averaging, as well as the raw data we produced in our runs.
[5] Repository: https://github.com/josiahcarlson/redis-tls/commits/benchmark-ssl

Sunday, May 15, 2016

rom Indexes and Search

So... some updates on rom.

The end of January, I posted a teaser picture about "Interleaved Indexes" in rom/Redis via tweet. If you haven't seen the picture, it's here:


Interleaved index?

I ended up building an interleaved index in Redis using a bit of Lua scripting and Python. No ZRANGEBYLEX, surprisingly. What is an interleaved index? It's what I was calling an indexing mode that has the same ordering and filtering options as a few different multi-dimensional indexes stored as tree-structures. Specifically some KD trees, Redshift interleaved sort keys, BSP trees, a specific implementation of crit-bit trees, and several others offer the specific functionality I was after.

Why

The simple reason why I was implementing an interleaved index was because I see some intersection options on data in Redis to be potentially less efficient than a proper multi-dimensional index would or could be. Long story short, it worked, but not without issues. I mentioned some of these issues in a series of tweets 1, 2, and 3, semi-abandoned the code in late-February, and now am ultimately not releasing it. Why? Because it doesn't actually add anything. It was complex, borrowed about 750 lines of code I wrote 5 1/2 years ago, and ... no.

A better option

There were a few very simple wins that I knew could be made with the query optimizer, including a fix on my side for a bug when calling TYPE from within a Lua script (which returns a table instead of a string). The ultimate result of that work is that some queries on numeric ranges can be hundreds or thousands of times faster on large indexes in theory. Partly due to starting with the correct set/sorted set to start, but also implementing a direct scan of an index instead of intersect/union + delete outside ranges.

Sometimes being more specific for optimizations is worth it. Definitely is in this case. For one of my use-cases involving search, I'm seeing 10-20x faster queries in practice, and 150x faster in a simple canned test.

I also removed non-Lua writing mode code. Sorry for those of you living in pre-2.6 days, but you'll have to upgrade. Hell, even with Lua scripting turned off, the query optimizer still used Lua, so if this worked in Redis 2.4 recently, I'd be surprised.

So that's what's going on right now.

Rebuild your indexes

Yeah. And rebuild your indexes. I'm sorry. Whenever I'm using rom as a cache or index of some kind, I re-cache and re-index daily so things like this always eventually resolve themselves, especially immediately after a library upgrade. Not a new idea; Google did it with their bigtables for cleanup, Postgres does auto-vacuum. Call this a manual vacuum via re-indexing.

Once/day or however often, maybe during off hours:

# import all of your models first
# then...
from rom import columns, util
for model in columns.MODELS.values():
    util.show_progress(util.refresh_indices(model))

That will rebuild all of the indexes on all of your models.

Almost P.S. - Loadable modules in Redis?

Redisconf 2016 happened last week and loadable modules were announced. I think that for people who host their own Redis, it could be awesome. Think of it like an answer to Postgres plugins. Hope I can pay someone to run my loadable modules, if I ever get around to building any :P

Wednesday, July 29, 2015

Transactions in Redis

Over the last few months, I've been thinking about and implementing transactions for Lua scripting in Redis. Not everyone understands why I'm doing this, so let me explain with a bit of history.

MySQL and Postgres

In 1998-2003 if you wanted to start a serious database driven web site/service and didn't have money to pay Microsoft or Oracle for their databases, you picked either MySQL or Postgres. A lot of people picked MySQL because it was faster, and much of that was due to the MyISAM storage engine that traded performance for a lack of transaction capability - speed is speed. Some people went with Postgres because despite its measurably slower performance on the same hardware, you could rely on Postgres to not lose your data (to be fair, the data loss with MySQL was relatively rare, but data loss is never fun).

A lot of time has passed since then; MySQL moved on from MyISAM as the default storage engine to InnoDB (which has been available for a long time now), gained full transaction support in the storage engine, and more. At the same time, Postgres got faster, and added a continually expanding list of features to distinguish itself in the marketplace. And now the choice of whether to use MySQL or Postgres usually boils down to experience and preference, though occasionally business or regulatory needs dictate other choices.

TL;DR; data integrity

In a lot of ways, Redis up to now is a lot like MySQL was back before InnoDB was an option. There is already a reasonable best-effort to ensure data integrity (replication, AOF, etc.), and the introduction of Lua scripting in Redis 2.6 has helped Redis grow up considerably in its capabilities and the overall simplification of writing software that uses Redis.

Comparatively, Lua scripting operates very much like stored procedures in other databases, but script execution itself has a few caveats. The most important caveat for this post is that once a Lua script has written to the database, it will execute until any one of the following occurs:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script hits an error and exits in the middle, all writes that were done up to the error have occurred, but no more writes will be done from the script
  3. Redis is shut down without saving via SHUTDOWN NOSAVE
  4. You attach a debugger and "fix" your script to get it to do #1 or #2 (or some other heroic deed that allows you to not lose data)
To anyone who is writing software against a database, I would expect that you agree that only case #1 in that list is desirable. Cases #2, #3, and #4 are situations where you can end up with data corruption (cases #2 and #4) and/or data loss (cases #3 and #4). If you care about your data, you should be doing just about anything possible to prevent data corruption and loss. This is not philosophy, this is doing your job. Unfortunately, current Redis doesn't offer a lot of help here. I want to change that.

Transactions in Lua

I am seeking to eliminate cases #2, #3, and #4 above, replacing the entire list with:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script exits with an error, no changes have been made (all writes were rolled back)
No data loss. Either everything is written, or nothing is written. This should be the expectation of any database, and I intend to add it to the expectations that we all have about Redis.

The current pull request is a proof of concept. It does what it says it does, removing the need to lose data as long as you either a) explicitly run your scripts using the transactional variants, or b) force all Lua script calls to have transactional semantics with a configuration option.

There are many ways the current patch can be made substantially better, and I hope for help from Salvatore (the author of Redis) and the rest of the community.

Wednesday, November 26, 2014

Introduction to rate limiting with Redis [Part 2]

This article first appeared on November 3, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

In Introduction to rate limiting with Redis [Part 1], I described some motivations for rate limiting, as well as provided some Python and Lua code for offering basic and intermediate rate limiting functionality. If you haven’t already read it, you should, because I’m going to discuss several points from the article. In this post, I will talk about and address some problems with the previous methods, while also introducing sliding window functionality and-cost requests.

Problems with previous methods

The last rate limiting function that we wrote was over_limit_multi_lua(), which used server-side Lua scripting in Redis to do the heavy lifting of actually performing the rate limiting calculations. It is included below with the Python wrapper as a reference.

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count = redis.call('INCR', key)
        redis.call('EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
        end
    end
end
return 0
'''

Hidden inside this code are several problems that can limit its usefulness and correctness when used for its intended purpose. These problems and their solutions are listed below.

Generating keys in the script

One of the first problems you might notice was mentioned in a comment by a commenter named Tobias on the previous post, which is that we are constructing keys inside the Lua script. If you’ve read the Redis documentation about Lua scripting, you should know that we are supposed to be passing all keys to be used in the script from outside when calling it.

The requirement to pass keys into the script is how Redis attempts to future-proof Lua scripts that are being written, as Redis Cluster (currently in beta) distributes keys across multiple servers. By having your keys known in advance, you can calculate which Redis Cluster server the script should run on, and if keys are on multiple Cluster servers, that the script can’t run properly.

Our first problem is that generating keys inside the script can make the script violate Redis Cluster assumptions, which makes it incompatible with Redis Cluster, and generally makes it incompatible with most key-based sharding techniques for Redis.

To address this issue for Redis Cluster and other client-sharded Redis setups, we must use a method that handles rate limiting with a single key. Unfortunately, this can prevent atomic execution for multiple identifiers for Redis Cluster, but you can either rely on a single identifier (user id OR IP address, instead of both), or stick with non-clustered and non-sharded Redis in those cases.

What we count matters

Looking at our function definition, we can see that our default limits were 10 requests per second, 120 requests per minute, and 240 requests per hour. If you remember from the “Counting correctly” section, in order for our rate limiter to complete successfully, we needed to only increment one counter at a time, and we needed to stop counting if that counter went over the limit.

But if we were to reverse the order that the limits were defined, resulting in us checking our per-hour, then per-minute, then per-second limits (instead of per-second, minute, then hour), we would have our original counting problem all over again. Unfortunately, due to details too involved to explain here, just sorting by bucket size (smallest to largest) doesn’t actually solve the problem, and even the original order could result in requests failing that should have succeeded. Ultimately our problem is that we are counting all requests, both successful and unsuccessful (those that were prevented due to being over the limit).

To address the issue with what we count, we must perform two passes while rate limiting. Our first pass checks to see if the request would succeed (cleaning out old data as necessary), and the second pass increments the counters. In previous rate limiters, we were basically counting requests (successful and unsuccessful). With this new version, we are going to only count successful requests.
Stampeding elephants

One of the most consistent behaviors that can be seen among APIs or services that have been built with rate limiting in mind is that usually request counts get reset at the beginning of the rate limiter’s largest (and sometimes only) time slice. In our example, at every hour on the hour, every counter that had been incremented is reset.

One common result for APIs with these types of limits and limit resets is what’s sometimes referred to as the “stampeding elephants” problem. Because every user has their counts reset at the same time, when an API offers access to in-demand data, many requests will occur almost immediately after limits are reset. Similarly, if the user knows that they have outstanding requests that they can make near the end of a time slice, they will make those requests in order to “use up” their request credit that they would otherwise lose.

We partially addressed this issue by introducing multiple bucket sizes for our counters, specifically our per-second and per-minute buckets. But to fully address the issue, we need to implement a sliding-window rate limiter, where the count for requests that come in at 6:01PM and 6:59PM aren’t reset until roughly an hour later at 7:01PM and 7:59PM, respectively, not at 7:00PM. Further details about sliding windows are a little later.

Bonus feature: variable-cost requests

Because we are checking our limits before incrementing our counts, we can actually allow for variable-cost requests. The change to our algorithm will be minor, adding an increment for a variable weight instead of 1.

Sliding Windows

The biggest change to our rate limiting is actually the process of changing our rate limiting from individual buckets into sliding windows. One way of understanding sliding window rate limiting is that each user is given a number of tokens that can be used over a period of time. When you run out of tokens, you don't get to make any more requests. And when a token is used, that token is restored (and can be used again) after the the time period has elapsed.

As an example, if you have 240 tokens that can be used in an hour, and you used 20 tokens at 6:05PM, you would only be able to make up to another 220 requests until 7:04PM. At 7:05PM, you would get those 20 tokens back (and if you made any other requests between 6:06PM and 7:05PM, those tokens would be restored later).

With our earlier rate limiting, we basically incremented counters, set an expiration time, and compared our counters to our limits. With sliding window rate limiting, incrementing a counter isn’t enough; we must also keep history about requests that came in so that we can properly restore request tokens.

One way of keeping a history, which is the method that we will use, is to imagine the whole window as being one large bucket with a single count (the window has a ‘duration’), similar to what we had before, with a bunch of smaller buckets inside it, each of which has their own individual counts. As an example, if we have a 1-hour window, we could use smaller buckets of 1 minute, 5 minutes, or even 15 minutes, depending on how precise we wanted to be, and how much memory and time we wanted to dedicate (more smaller buckets = more memory + more cleanup work). We will call the sizes of the smaller buckets their “precision.” You should notice that when duration is the same as precision, we have regular rate limits. You can see a picture of various precision buckets in a 1 hour window below.


As before, we can consider the smaller buckets to be labeled with individual times, say 6:00PM, 6:01PM, 6:02PM, etc. But as the current time becomes 7:00PM, what we want to do is to reset the count on the 6:00PM bucket to 0, adjust the whole window’s count, and re-label the bucket to 7:00PM. We would do the same thing to the 6:01PM bucket at 7:01PM, etc.

Data representation

We’ve now gotten to the point where we need to start talking about data representation. We didn’t really worry about representation before simply because we were storing a handful of counters per identifier. But now, we are no longer just storing 1 count for a 1 hour time slice, we could store 60 counts for a 1 hour time slice (or more if you wanted more precision), plus a timestamp that represents our oldest mini-bucket label.

For a simpler version of sliding windows, I had previously used a Redis LIST to represent the whole window, with each item in the LIST including both a time label, as well as the count for the smaller buckets. This can work for limited sliding windows, but restricts our flexibility when we want to use multiple rate limits (Redis LISTs have slow random access speeds).

Instead, we will use a Redis HASH as a miniature keyspace, which will store all count information related to rate limits for an identifier in a single HASH. Generally, for a sliding window of a specified duration and precision for an identifier, we will have the HASH stored at the key named by the identifier, with contents of the form:

<duration>:<precision>:o --> <timestamp of oldest entry>
<duration>:<precision>: --> <count of successful requests in this window>
<duration>:<precision>:<ts> --> <count of successful requests in this bucket>

For sliding windows where more than one sub-bucket has had successful requests, there can be multiple <duration>:<precision>:<ts> entries that would each represent one of the smaller buckets. For regular rate limits (not sliding window), the in-Redis schema is the same, though there will be at most one <duration>:<precision>:<ts> key, and duration is equal to precision for regular rate limits (as we mentioned before).

Because of the way we named the keys in our HASH, a single HASH can contain an arbitrary number of rate limits, both regular and windowed, without colliding with one another.

Putting it all together

And finally, we are at the fun part; actually putting all of these ideas together. First off, we are going to use a specification for our rate limits to simultaneously support regular and sliding window rate limits, which looks a lot like our old specification.

One limit is: [duration, limit, precision], with precision being optional. If you omit the precision option, you get regular rate limits (same reset semantics as before). If you include the precision option, then you get sliding window rate limits. To pass one or more rate limits to the Lua script, we just wrap the series of individual limits in a list: [[duration 1, limit 1], [duration 2, limit 2, precision 2], ...], then encode it as JSON and pass it to the script.

Inside the script we need to make two passes over our limits and data. Our first pass cleans up old data while checking whether this request would put the user over their limit, the second pass increments all of the bucket counters to represent that the request was allowed.

To explain the implementation details, I will be including blocks of Lua that can be logically considered together, describing generally what each section does after. Our first block of Lua script will include argument decoding, and cleaning up regular rate limits:

local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
local weight = tonumber(ARGV[3] or '1')
local longest_duration = limits[1][1] or 0
local saved_keys = {}
-- handle cleanup and limit checks
for i, limit in ipairs(limits) do

    local duration = limit[1]
    longest_duration = math.max(longest_duration, duration)
    local precision = limit[3] or duration
    precision = math.min(precision, duration)
    local blocks = math.ceil(duration / precision)
    local saved = {}
    table.insert(saved_keys, saved)
    saved.block_id = math.floor(now / precision)
    saved.trim_before = saved.block_id - blocks + 1
    saved.count_key = duration .. ':' .. precision .. ':'
    saved.ts_key = saved.count_key .. 'o'
    for j, key in ipairs(KEYS) do

        local old_ts = redis.call('HGET', key, saved.ts_key)
        old_ts = old_ts and tonumber(old_ts) or saved.trim_before
        if old_ts > now then
            -- don't write in the past
            return 1
        end

        -- discover what needs to be cleaned up
        local decr = 0
        local dele = {}
        local trim = math.min(saved.trim_before, old_ts + blocks)
        for old_block = old_ts, trim - 1 do
            local bkey = saved.count_key .. old_block
            local bcount = redis.call('HGET', key, bkey)
            if bcount then
                decr = decr + tonumber(bcount)
                table.insert(dele, bkey)
            end
        end

        -- handle cleanup
        local cur
        if #dele > 0 then
            redis.call('HDEL', key, unpack(dele))
            cur = redis.call('HINCRBY', key, saved.count_key, -decr)
        else
            cur = redis.call('HGET', key, saved.count_key)
        end

        -- check our limits
        if tonumber(cur or '0') + weight > limit[2] then
            return 1
        end
    end
end

Going section by section though the code visually, where a blank line distinguishes individual sections, we can see 6 sections in the above code:
  1. Argument decoding, and starting the for loop that iterates over all rate limits
  2. Prepare our local variables, prepare and save our hash keys, then start iterating over the provided user identifiers (yes, we still support multiple identifiers for non-clustered cases, but you should only pass one identifier for Redis Cluster)
  3. Make sure that we aren’t writing data in the past
  4. Find those sub-buckets that need to be cleaned up
  5. Handle sub-bucket cleanup and window count updating
  6. Finally check the limit, returning 1 if the limit would have been exceeded
Our second and last block of Lua operates under the precondition that the request should succeed correctly, so we only need to increment a few counters and set a few timestamps:

-- there is enough resources, update the counts
for i, limit in ipairs(limits) do
    local saved = saved_keys[i]

    for j, key in ipairs(KEYS) do
        -- update the current timestamp, count, and bucket count
        redis.call('HSET', key, saved.ts_key, saved.trim_before)
        redis.call('HINCRBY', key, saved.count_key, weight)
        redis.call('HINCRBY', key, saved.count_key .. saved.block_id, weight)
    end
end

-- We calculated the longest-duration limit so we can EXPIRE
-- the whole HASH for quick and easy idle-time cleanup :)
if longest_duration > 0 then
    for _, key in ipairs(KEYS) do
        redis.call('EXPIRE', key, longest_duration)
    end
end

return 0

Going section by section one last time gets us:
  1. Start iterating over the limits and grab our saved hash keys
  2. Set the oldest data timestamp, and update both the window and buckets counts for all identifiers passed
  3. To ensure that our data is automatically cleaned up if requests stop coming in, set an EXPIRE time on the keys where our hash(es) are stored
  4. Return 0, signifying that the user is not over the limit

Optional fix: use Redis time

As part of our process for checking limits, we fetch the current unix timestamp in seconds. We use this timestamp as part of the sliding window start and end times and which sub-bucket to update. If clients are running on servers with reasonably correct clocks (within 1 second of each other at least, within 1 second of the true time optimally), then there isn’t much to worry about. But if your clients are running on servers with drastically different system clocks, or on systems where you can’t necessarily fix the system clock, we need to use a more consistent clock.

While we can’t always be certain that the system clock on our Redis server is necessarily correct (just like we can’t for our other clients), if every client uses the time returned by the TIME command from the same Redis server, then we can be reasonably assured that clients will have fairly consistent behavior, limited to the latency of a Redis round trip with command execution.

As part of our function definition, we will offer the option to use the result of the TIME command instead of system time. This will result in one additional round trip between the client and Redis to fetch the time before passing it to the Lua script.

Add in our Python wrapper, which handles the optional Redis time and request weight parameters, and we are done:

def over_limit_sliding_window(conn, weight=1, limits=[(1, 10), (60, 120), (3600, 240, 60)], redis_time=False):
    if not hasattr(conn, 'over_limit_sliding_window_lua'):
        conn.over_limit_sliding_window_lua = conn.register_script(over_limit_sliding_window_lua_)

    now = conn.time()[0] if redis_time else time.time()
    return conn.over_limit_sliding_window_lua(
        keys=get_identifiers(), args=[json.dumps(limits), now, weight])

If you would like to see all of the rate limit functions and code in one place, including the over_limit_sliding_window() Lua script with wrapper, you can visit this Github gist.

Wrap up and conclusion

Congratulations on getting this far! I know, it was a slog through problems and solutions, followed by a lot of code, and now after seeing all of it I get to tell you what you should learn after reading through all of this.

Obviously, the first thing you should get out of this article is an implementation of sliding window rate limiting in Python, which is trivially ported to other languages -- all you need to do is handle the wrapper. Just be careful when sending timestamps, durations, and precision values to the script, as the EXPIRE call at the end expects all timestamp values to be in seconds, but some languages natively return timestamps as milliseconds instead of seconds.

You should also have learned that performing rate limiting with Redis can range from trivial (see our first example in part 1) to surprisingly complex, depending on the features required, and how technically correct you want your rate limiting to be. It also turns out that the problems that were outlined at the beginning of this article aren’t necessarily deal-breakers for many users, and I have seen many implementations similar to the over_limit_multi_lua() method from part 1 that are perfectly fine for even heavy users*. Really it just means that you have a choice about how you want to rate limit.

And finally, you may also have learned that you can use Redis hashes as miniature keyspaces to collect data together. This can be used for rate limiting as we just did, as well as a DB row work-alike (the hash keys are like named columns, with values the row content), unique (but unsorted) indexes (i.e. email to user id lookup table, id to encoded data lookup table, ...), sharded data holders, and more.

For more from me on Redis and Python, you can check out the rest of my blog at dr-josiah.com.

* When Twitter first released their API, they had a per-hour rate limit that was reset at the beginning of every hour, just like our most basic rate limiter from part 1. The current Twitter API has a per-15 minute rate limit, reset at the beginning of every 15 minute interval (on the hour, then 15, 30, and 45 minutes after the hour) for many of their APIs. (I have no information on whether Twitter may or may not be using Redis for rate limiting, but they have admitted to using Redis in some capacity by virtue of their release of Twemproxy/Nutcracker).

Monday, November 3, 2014

Introduction to rate limiting with Redis [Part 1]

This article first appeared on October 9, 2014 over on Binpress at this link. I am reposting it here so my readers can find it easily.

Over the years, I've written several different rate limiting methods using Redis for both commercial and personal projects. This two-part tutorial intends to cover two different but related methods of performing rate limiting in Redis using standard Redis commands and Lua scripting. Each method expands the number of use-cases for rate limiting, and cleans up some of the rougher edges of previous rate limiters.

This post assumes some experience with Python and Redis, and to a lesser extent Lua, but new users still reading docs should be okay.

Why rate limit?


Most uses of rate limiting on the web today are generally intended to limit the effect that someone can have on a given platform. Whether it is API limits at Twitter, posting limits at Reddit, or posting limits at StackOverflow, some limit resource utilization, and others limit the effect a spammer account can have. Whatever the reason, let's start with saying that we need to count actions as they happen, and we need to prevent an action from happening if the user has reached or gone over their limit. Let's start with the plan of building a rate limiter for an API where we need to restrict users to 240 requests per hour per user.

We know that we need to count and limit a user, so let's get some utility code out of the way. First, we need to have a function that gives us one or more identifiers for the user performing an action. Sometimes that is just a user id, other times it's the remote IP address; I usually use both when available, and at least IP address if the user hasn't logged in yet. Below is a function that gets the IP address and user id (when available) using Flask with the Flask-Login plugin.

from flask import g, request

def get_identifiers():
    ret = ['ip:' + request.remote_addr]
    if g.user.is_authenticated():
        ret.append('user:' + g.user.get_id())
    return ret

Just use a counter

Now that we have a function that returns a list of identifiers for an action, let's start counting and limiting. One of the simplest rate limiting methods available in Redis starts by taking the times of the actions as they happen, and buckets actions into ranges of times, counting them as they occur. If the number of actions in a bucket exceeds the limit, we don't allow the action. Below is a function that performs the rate limiting using an automatically-expiring counter that uses 1 hour buckets.

import time

def over_limit(conn, duration=3600, limit=240):
    bucket = ':%i:%i'%(duration, time.time() // duration)
    for id in get_identifiers():
        key = id + bucket

        count = conn.incr(key)
        conn.expire(key, duration)
        if count > limit:
            return True

    return False

This function shouldn't be too hard to understand; for each identifier we increment the appropriate key in Redis, set the key to expire in an hour, and if the count is more than the limit, we return True, signifying that we are over the limit. Otherwise we return False.

And that's it. Well, sort of. This gets us past our initial goal of having a basic rate limiter to limit each user to 240 requests per hour. But reality has a tendency to catch us when we aren't looking, and clients using the API have noticed that their limit is reset at the top of every hour. Now users have started making all 240 requests in the first few seconds they can, so all of our work limiting requests is wasted, right?

Multiple bucket sizes

Our initial rate limiting on a per-hour basis was successful in that it limited users on an hourly basis, but users started using all of their API requests as soon as they could (at the beginning of the hour). Looking at the problem, it seems almost obvious that in addition to a per-hour rate limit, we should probably also have a per-second and/or per-minute rate limit to smooth out peak request rates.

Let's say that we determined that 10 requests per second, 120 requests per minute, and 240 requests per hour were fair enough to our users, and let us better distribute requests over time. We could simply re-use our earlier over_limit() function to offer this functionality.

def over_limit_multi(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    for duration, limit in limits:
        if over_limit(conn, duration, limit):
            return True
    return False

This will work for our intended use, but with 3 rate limit calls, which can result in two counter updates and two expire calls (one for each of IP and user keys), and we may need to perform 12 total round trips to Redis just to say whether someone is over their limit. One common method of minimizing the number of round trips to Redis is to use what is called 'pipelining'. Pipelining in the Redis context will send multiple commands to Redis in a single round trip, which can reduce overall latency.

Coincidentally, our over_limit() function is written in such a way that we could easily replace our INCR and EXPIRE calls with a single pipelined request to increment the count and update the key expiration. The updated function can be seen below, and cuts our number of round trips from 12 to 6 when combined with over_limit_multi().

def over_limit(conn, duration=3600, limit=240):
    # Replaces the earlier over_limit() function and reduces round trips with
    # pipelining.
    pipe = conn.pipeline(transaction=True)
    bucket = ':%i:%i'%(duration, time.time() // duration)
    for id in get_identifiers():
        key = id + bucket

        pipe.incr(key)
        pipe.expire(key, duration)
        if pipe.execute()[0] > limit:
            return True

    return False

Halving the number of round trips to Redis is great, but we are still performing 6 round trips just to say whether a user can make an API call. We could write a replacement over_limit_multi() that makes all increment and expire operations at once, checking the limits after, but the obvious implementation actually has a counting bug that can prevent users from being able to make 240 successful requests in an hour (in the worst-case, a client may experience 10 successful requests in an hour, despite making over 100 requests per second for the entire hour). This counting bug can be fixed with a second round trip to Redis, but lets instead shift our logic into Redis.

Counting correctly

Instead of trying to fix a fully pipelined version, we can use the ability to execute Lua scripts inside Redis to perform the same operation while also keeping to one round trip. The specific operations we are going to perform in Lua are almost the exact same operations as we were originally performing in Python. We are going to iterate over the limits themselves, and for each identifier, we are going to increment a counter, update the expiration time of the updated counter, then check to see if we are over the limit. We will also use a small Python wrapper around our Lua to handle argument conversion and to hide the details of script loading.

import json

def over_limit_multi_lua(conn, limits=[(1, 10), (60, 120), (3600, 240)]):
    if not hasattr(conn, 'over_limit_lua'):
        conn.over_limit_lua = conn.register_script(over_limit_multi_lua_)

    return conn.over_limit_lua(
        keys=get_identifiers(), args=[json.dumps(limits), time.time()])

over_limit_multi_lua_ = '''
local limits = cjson.decode(ARGV[1])
local now = tonumber(ARGV[2])
for i, limit in ipairs(limits) do
    local duration = limit[1]

    local bucket = ':' .. duration .. ':' .. math.floor(now / duration)
    for j, id in ipairs(KEYS) do
        local key = id .. bucket

        local count = redis.call('INCR', key)
        redis.call('EXPIRE', key, duration)
        if tonumber(count) > limit[2] then
            return 1
        end
    end
end
return 0
'''

With the section of code starting with 'local bucket', you will notice that our Lua looks very much like and performs the same operations as our original over_limit() function, with the remaining code handling argument unpacking and iterating over the individual limits.

Conclusion

At this point we have built a rate limiting method that handles multiple levels of timing granularity, can handle multiple identifiers for a single user, and can be performed in a single round trip between the client and Redis. We started from a single-bucket rate limiter to a rate limiter that can evaluate multiple limits simultaneously.

Any of the rate limiting functions discussed in this post are usable for many different applications. In part two, I'll cover a different way of approaching rate limiting, which rounds out the remaining rough edges in our rate limiter. Read it over on Binpress.

More detailed information on Lua scripting can be found in the help for the EVAL command at Redis.io.

Friday, June 20, 2014

Why thorium reactors are important

This post is the first of a series of posts that I've been wanting to write for a long time, but I haven't been able to pick from among the collection of topics that I wanted to write about. After taking a quick vote on what people actually want me to write about, this one got the most votes, so you get to read about why I think that we (as a society) should take hurcluean efforts to get liquid fluoride thorium reactors available as quickly as possible.

Disclaimer: before you invest $1 on what I say below, think, read, and verify.

Doom and gloom

Climate change is happening: 97% of climate scientists agree. The single best thing that we can do to address climate change is to substantially reduce our production of greenhouse gasses, which at present is primarily in the form of exhaust from the burning of fossil fuels. The problem is that fossil fuels are used as an energy source in our transport and electric grid, the combination of which allows for the efficient transport of people and goods, and allows for the supporting of a modern energy-consuming society (lights, heat, computers, communication, ...).

In order to truly replace fossil fuels, we must find a power generation technology that can be put in locations where both small to large power generation facilities already exist, and where we can replace the most polluting of vehicular transport (large transport ships). Cars like the Tesla, and those evolved from Tesla's patents will be available in the coming years, and we can put highways underground, keep people out of them, and use automated vehicular transport. Think automated Uber. Then you get 4-6x the vehicle density per lane, and adding lanes doesn't make traffic worse. :D

This requires power generation in the 10 megawatt - 10 gigawatt range, if we are looking to cover up to the current largest fossil fuel and current technology nuclear reactors. I'll explain why replacing current nuclear technology is important later.

Now enters a challenger

Liquid fluoride thorium reactor technology was first developed in the 1950's by a group at General Electric that were tasked with coming up with a nuclear generator that could be placed in an airplane, which was followed up by mostly the same group of people working at the Oak Ridge National Lab in the mid/late 60's. They ended up with a thorium-based technology that uses molten fluoride salts to carry the nuclear material (addressing partial burns, structural breakdown of fuel, ...), which allows for a 96%+ burning of the fuel - compared to the 1-5% efficiency of the best traditional nuclear reactors available today (quoted numbers vary from 1-2% to 3-5% for current tech, so I'm using the full range). One other feature is that you can introduce other spent nuclear fuels that are currently stored at hundreds of sites all over the world, and you can burn it. Yep, thorium plants can recycle nuclear waste.

One convenient win is that with proper work, there is no reason why liquid fluoride thorium reactor technology can't be scaled from smaller 10-20 megawatt reactors that can be sealed and buried for zero maintenance for a decade or more for small municipalities (though this wouldn't happen for a while, the economics need to make sense), up to replace the horribly polluting 90 megawatt-equivalent diesel engines in large ships (we'll need more nuclear engineers), to arrays of reactors for 10+ gigawatt power generation stations near our mega cities.

Why replace current reactor technology?

While "modern" reactors are generally safe (though there are several unsafe reactors currently being used around the world), the ugly part of current reactor technology is a combination of fuel efficiency and waste. There is immense amounts of energy stored in refined nuclear fuels, and with only 1-5% of that fuel being burned with current nuclear technology, you are left with a massive amount of waste that will be dangerously radioactive for the next 10,000 years or more. It is a scary problem, and thorium is a solution.

This nuclear waste storage problem is not going away. The only way we can really address the issue is to stop using current nuclear technology and find another solution. I claim that liquid fluoride thorium reactor technology is the best option we have right now, as not only can it actually generate the power we need economically (we need roughly 400 tons of thorium ore to power all of the energy needs of the US for one year, and there are 160,000 tons of ore that is economically accessible at current thorium prices), but it can burn the wastes that we need to get rid of in the long term anyway.

And the vast majority of the waste is generally considered safe in 300 years, compared to 10,000+ for typical reactors.

Why not solar/wind/hydro?

The technologies behind solar, wind, and hydro generation are all great, and wonderful, and they work. And it even turns out that the largest power generation stations in the world are hydro-electric plants. The major problem with these technologies is that they are difficult to scale. Specifically, we've more or less tapped the majority of the reasonably usable hydro power available, and the environmental fallout of flooding land is not always worth the power generated.

Solar is great (I was just looking into the cost viability of installing solar panels on my house), but it requires a large amount of land to be viable, and the cost of production is still high. Solar also has a nasty cousin, typically coal or natural gas power plants that support the grid when the sun isn't shining. When looking at thermal solar plants (instead of photovoltaics), you get the option for storing the heat for release at night, but thermal solar plants are still rare, and my research suggests that only one solar plant in the US has heat storage (in molten salt). Finally, the land and financial cost make solar difficult to scale up to utility-level requirements (currently maxing out at roughly 500 megawatts), and make it essentially unusable as a mobile power generation method for commercial transport ships (you can check out The Guardian's article on shipping pollution for why this is necessary).

Wind is also great, but there are limited locations where it makes sense, and there are political challenges with locating windmills near occupied land. It's also not a mobile technology, so is not viable for addressing the container ship problem. Wind power also has all of the same problems with consistency as solar without the simple heat storage method of thermal solar.

Ultimately, what we don't need is more "bursting" power sources that require secondary generation technologies, and which partially compromise the overall environmental benefit of using renewable sources like wind and solar.

What about fusion?

There have been recent reports that hydrogen-fusion based nuclear plants are now to the point where there are obvious paths of research and development that will lead directly to commercially viable power generation in the next 10-20 years. This is wonderful, but it is still all theoretical. The research isn't done, so the opportunity doesn't yet exist. These types of plants are also of the form that they only make sense to build huge plants, which makes manufacture of the plants a difficult investment, but also prevents the reactors from being mobile.

That's not to say that we couldn't end up in a future world where fusion power is available as a standard on-land reactor technology, while huge ships (shipping, navies, etc.) run thorium plants. But given the current state of technology, and how little money (relatively speaking) is being spent on thorium generator technology, there are some early and obvious wins that can be had by investing in thorium today, while the long-term benefits of fusion power are still attainable and can be developed on a parallel path.

But cars!

Cars will go electric with batteries. Think an entry-level Tesla. Also think battery swap technology for sub 5-minute recharges. We just need to make the technology cheap enough to be available to everyone. With Elon Musk releasing Tesla's patents, this is possible!

Long term destination

One of the reasons why I really like the idea of liquid fluoride thorium reactor technology is that it is the first obvious step towards reduction/elimination of fossil fuel use. That solves our short/mid-term global warming problem. But it also solves our mid-long term power problem when we finally get around to putting people on other planets in our solar system.

I know, I'm getting a few decades ahead of things here, but if we have any chance in hell of having a viable settlement on *any* other planet in this solar system (never mind in another solar system), we need a compact and efficient power generation technology whose fuel can be found on other planets. Thorium is available on the Moon, Mars, and Mercury, which are the three most likely locations for an off-world human settlement with current or soon-available technology. No, solar isn't enough (especially on Mars, which gets significantly less sunlight that we get here on Earth).

To make our lives even easier, we can actually detect whether a planet has available thorium, how much, and where it is located. And we can do this from space. Yes, we can scan for thorium from space. With current technology. Could thorium get any better?

Why isn't it done yet?

The simple answer is because insufficient time and money have been spent to make it commercially viable. There are various historical, political, and strictly human reasons for this. From it not being usable for generating materials usable in nuclear weapons, to it not being preferred by the guy who decided to use uranium-based reactors in the US navy's largest ships, to it being a different technology than the currently understood standard nuclear technology. These are all valid excuses, but that doesn't mean that we can't go beyond these excuses and make it happen.

It also doesn't mean that we can't move past these excuses and realize that the ultimate destination of a clean and readily available power source is within our grasp! Not in 50 years, but easily within 5 years for a test reactor, and 10 years for a serious commercial reactor in the gigawatt+ range. The challenges for the liquid fluoride thorium reactor design are strictly of the materials science, chemistry, regulatory, and investment kind. And guess what? The materials science and chemistry part have been worked on for the last half-dozen+ years, and are effectively solved. Whether or not regulators and investors are willing to make it happen is what is the real question.

Some detractors will claim it's still 40-70 years out, but to them I would just point out that the original nuclear reactors didn't take 40-70 years from conception through design, test reactors, and final construction. From first commissioning of the design for the USS Nautilus in December 1949 until it sailed under nuclear power in January 1955, there were barely more than 5 years of concentrated effort to go from the order to design it to it being built and used to travel farther and faster underwater than any other submarine before.

There are also several world governments that are working on thorium-based liquid salt reactor technology, among which include: USA, China, Australia, Czech Republic, Russia, India, and the UK. It's going to happen, and it can't happen soon enough. What is $10 billion in the next 5-10 years if it means that everyone would have access to inexpensive, clean, life-altering electric supply. With electricity, you have water. With water, you have food, trees, and can bring more rain for more water. It is an amazing virtuous cycle, and can start reversing the CO2 levels in our atmosphere *now*.

Which, incidentally, might very well be the most important thing that we can do if we want to avoid a global drought. It is scary stuff, and we need to reforest the planet yesterday. This gets us there globally, and will teach us what we need for when we try to terraform Mars in the next 30-40 years. If we make an effort, we could seriously be living on Mars in our lifetimes.

Where can I get more information?

First off, watch this hour-long video: Dr. Joe Bonometti explains Thorium. That should give a pretty broad overview of what is going on with the history, economics, etc., of the reactor technology, and it includes a link to Energy From Thorium. Generally, I like most of the information provided on the site, but there are several articles that I've read there that are a long way from being professional enough from an advocacy perspective. Also, there seems to be more interest in raising awareness rather than actually doing things (like getting a reactor built and used).

A group of MIT graduates at Transatomic Power have been working on liquid fluoride thorium reactor technology for a few years now and have released a whitepaper describing their design in detail. I don't know their funding situation, or what sorts of regulatory challenges they are facing, but they do have a collection of experienced advisors helping them.

Another group has been working for and advocating the design at Flibe Energy for several years now, and are targeting the military base market, which seems to offer a potential bypass of both the investment and regulatory issues that may be blocking other efforts. They've also got a collection of great advisors and top-notch founding team members. They haven't released their design, but I think it would be foolish to believe that they haven't at least looked at Transatomic Power's design for ideas and possible improvements.

There are also links and information from one of a dozen different Wikipedia pages, for individual thorium reactors, including existing molten salt thorium reactors, information about companies, etc. Start from the page on Liquid fluoride thorium reactor and follow the links

What can you do?

Unfortunately, I don't have a list of things that you can do to help in the advancement of liquid fluoride thorium reactor technology. Heck, in writing this article I'm hard-pressed to find something for *me* to do beyond just writing a general advocacy and quick overview of why I think that thorium is the future. But as a start, we can become educated about what is going on, and attempt to understand and spread the fact that there are no other technologies that could offer such a change in energy generation technologies in such a short timespan. Nothing.

Update:

Okay, thanks to Eric Snow, from the comment, I have just learned about a potentially viable fusion reactor technology. This is what could get us there too (thorium is still viable for helping to burn up old nuclear waste, so keep looking and reading). First, read this article on Gizmag, and then go here to donate. I donated last night. And then I just read this article, which was written May 31, 2014: Record setting temperatures of roughly 1.0 - 1.8 billion degrees kelvin. This could happen people.