Thursday, October 24, 2013

Multi-column (SQL-like) sorting in Redis

Recently, I received an email from a wayward Redis user asking about using Redis Sets and Sorted Sets to sort multiple columns of data, with as close to the same semantics as a traditional SQL-style "order by" clause. Well it is possible, with limitations, keep reading to find out how.

What is Redis?


For those people who don't quite know what Redis is already, the TLDR version is: an in-memory data structure server that maps from string keys to one of 5 different data structures, providing high-speed remote access to shared data, and optional on-disk persistence. In a lot of ways, you can think of Redis like a version of Memcached where your data doesn't disappear if your machine restarts, and which supports a wider array of commands to store, retrieve, and manipulate data in different ways.

The setup


With that out of the way, our intrepid Redis user had come to me with a pretty reasonable problem to have; he needed to build an application to display a listing of businesses, sorted by several criteria. In his case, he had "price", "distance[1]", and "rating". In many cases that we have all seen in recent years with individual retailer searches, never mind restaurant searches on Yelp and similar applications, when searching for something in the physical world, there are a few things you care about primarily. These usually break down preferentially as lowest distance, lowest price, highest rating. In a relational database/SQL world, these fields would all be columns in a table (or spread out over several tables or calculated in real-time), so we are going to be referring to them as "sort columns" from here on.

Now, depending on preferences, you can sometimes get column preference and ascending/descending changes, which is why we need to build a system that can support reordering columns *and* switching the order of each individual column. Say that we really want the highest rating, lowest distance, lowest price? We need to support that too, and we can.

The concept


Because we are dealing with sort orders, we have two options. We can either use the Redis SORT command, or we can use sorted sets. There are ways of building this using the SORT command, but it is much more complicated and requires quite a bit of precomputation, so we'll instead use sorted sets.

We will start by making sure that every business has an entry in each of 3 different sorted sets representing price, distance, and rating. If a business has an "id" of 5, has a price of 20, distance of 4, and a rating of 8, then at some point the commands "ZADD price 20 5", "ZADD distance 4 5", and "ZADD rating 8 5" will have been called.

Once all of our data is in Redis, we then need to determine the maximum value of each of our sort columns. If you have ranges that you know are fixed, like say that you know that all of your prices and distance will all be 0 to 100, and your rating will always be 0 to 10, then you can save yourself a round-trip. We'll go ahead and build this assuming that you don't know your ranges in advance.

We are trying to gather our data range information in advance in order to carve up the 53 bits of precision [2] available in the floating-point doubles that are available in the sorted set scores. If we know our data ranges, then we know how many "bits" to dedicate to each column, and we know whether we can actually sort our data exactly, without losing precision.

If you remember our price, distance, and range information, you can imagine that (borrowing our earlier data) if we have price=20, distance=4, rating=8, and we want to sort by distance, price, -rating, we want to construct a "score" that will sort the same as the "tuple" comparison (20, 4, -8). By gathering range information, we could (for example) translate that tuple into a score of "20042", which you can see is basically the concatenation of "20", "04", and 10-8 (we subtract from 10 here because the rating column is reversed, and it helps to understand how we got the values).

Note: because of our construction, scores that are not whole numbers may not produce completely correct sorts.

The code


Stepping away from the abstract and into actual code, we are going to perform computationally what I just did above with some string manipulation.  We are going to numerically shift our data into columns, accounting for the magnitude of the data, as well as negative values in the columns (which won't affect our results). As a bonus, this method will even tell you if it believes that you could have a lower-quality sort because your data range is too wide[3].

import math
import warnings

def sort_zset_cols(conn, result_key, sort=('dist', 'price', '-score')):
    current_multiplier = 1
    keys = {result_key: 0}
    sort = list(reversed(sort))

    # Gets the max/min values in a sort column
    pipe = conn.pipeline(True)
    for sort_col in sort:
        pipe.zrange(sort_col, 0, 0, withscores=True)
        pipe.zrange(sort_col, -1, -1, withscores=True)
    ranges = pipe.execute()

    for i, sort_col in enumerate(sort):
        # Auto-scaling for negative values
        low, high = ranges[i*2][1], ranges[i*2+1][1]
        maxv = int(math.ceil(max(abs(low), abs(high))))

        # Adjusts the weights based on the magnitude and sort order of the
        # column
        old_multiplier = current_multiplier
        desc = sort_col.startswith('-')
        sort_col = sort_col.lstrip('-')
        current_multiplier *= maxv

        # Assign the sort key a weight based on all of the lower-priority
        # sort columns
        keys[sort_col] = -old_multiplier if desc else old_multiplier

    if current_multiplier >= 2**53:
        warnings.warn("The total range of your values is outside the "
            "available score precision, and your sort may not be precise")

    # The sort results are available in the passed result_key
    return conn.zinterstore(result_key, keys)

If you prefer to check the code out at Github, here is the gist. Two notes about this code:
  • If the maximum or minimum values in any of the indexed columns becomes more extreme between the data range check and the actual query execution, some entries may have incorrect ordering (this can be fixed by translating the above to Lua and use Redis 2.6 and later support for Lua scripting)
  • If any of your data is missing in any of the indexes, then that entry will not appear in the results
Within the next few weeks, I'll be adding this functionality to rom, my Python Redis object mapper.

Interested in more tips and tricks with Redis? My book, Redis in Action (Amazon link), has dozens of other examples for new and seasoned users alike.


[1] For most applications, the distance criteria is something that would need to be computed on a per-query basis, and our questioning developer already built that part, so we'll assume that is available already.
[2] Except for certain types of extreme-valued doubles, you get 52 bits of actual precision, and 1 bit of implied precision. We'll operate under the assumption that we'll always be within the standard range, so we'll always get the full 53 bits.
[3] There are ways of adjusting the precision of certain columns of the data (by scaling values), but that can (and very likely would) result in scores with fractional components, which may break our sort order (as mentioned in the notes).

Edit on 2015-08-09: Changed the "maxv =" assignment above to be correct in all cases, and made sure the revered sort (for calculating ranges) is a list for repeated-iteration.

6 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Can you please explain why should it work? Lets say I have users:

    id name age weight
    1 Nick 26 70
    2 John 27 75
    3 Dough 26 80
    4 Kate 27 50

    Ages set is: 26 26 27 27
    Weights set is: 50 70 75 80

    And I want to sort by age and weight, so I do (based on the code):
    ZINTERSCORE res 2 ages weights 1 27

    That will result in:
    Kate 1377
    Nick 1916
    John 2052
    Dough 2186

    While expected is:
    Nick
    Dough
    Kate
    John

    ReplyDelete
    Replies
    1. When ordering your sort columns, you must pick the *highest* priority sort column first. The order you are getting is saying that *weight* is more important than *age*, so you are saying sort=('weight', 'age'). If you pass sort=('age', 'weight'), then you will get:
      ZINTERSTORE res 2 weights ages 1 80

      Which will result in:
      Nick 2150
      Dough 2160
      Kate 2210
      John 2235

      Which is the order you are asking for.

      As for what it's doing... Let's pretend for a moment that the underlying "score" inside a ZSET was a 64 bit integer (it's actually an IEEE 754 floating point double, but what happens is easier to explain with an integer). We pack the integer based on segments. The lowest-bits are packed with scores from the least-important data, and the highest-bits are packed with scores from the most-important data:
      [high bits ... low bits] -> [important scores ... less important scores]

      This gives us the equivalent of a tuple inside an integer (or double), which is how SQL sorts its data when saying "ORDER BY age, weight".

      Note: I just made 2 small fixes to the code in the gist and post, which fixes a couple issues in practice.

      Delete
    2. It makes it clear. Thank you.

      Delete
  3. Hi, awesome article.
    But what if one of the columns were a string? How would the approach change then?

    ReplyDelete
    Replies
    1. Strings are not usable as scores in sorted sets, generally, so that isn't a concern.

      If you used something like this: https://github.com/josiahcarlson/rom/blob/master/rom/util.py#L310 ... which generates scores from strings, then you are likely limited to a single-column sort, not a multi-column tuple sort like most relational databases offer.

      Delete