Thursday, April 6, 2023

CRC64 in Redis is getting even faster

On Friday last week, I submitted this PR to Redis: https://github.com/redis/redis/pull/11988 .

In the PR, I make single-threaded CRC64 53-73% faster for a CPU that is 13 years old, and ~67% faster for a CPU that is 5 years old. I'll get into the details more as to how / why, but modern CPUs could see 2x+, and an update to the algorithm should allow for leveraging more CPU hardware as it scales. This is also generally applicable to all CRC polynomials, and brings software CRC within 2x of hardware CRC performance.

Oh, then I also made CRC concatenation >10,000x faster, but I suspect that was previously known, and just not reported for what it was.

I know, this is going to feel like "learn this one simple trick to make all your programs faster" clickbait, but for some algorithms, 2x faster is just the start. (and extrapolating to other CRCs

Computer Architecture Background

For folks who already know about the history of computers, and how / why CPUs do what they do, jump ahead to "Making it faster".

 Decades before I was born, Alan Turing and John von Neumann developed competing thoughts about how to conceptualize and build computers. Ultimately, the von Neuman architecture won out. It is best characterized by a central processing unit with control flow, arithmetic operations, along with an interface to memory, input devices, and output devices. To go a little further for our specific example, processors have been built to basically do a few things really well: read instructions, decode those instructions into sub-instructions, wait for data to arrive, then finally execute on any data that is available.

Fetching things can sometimes take a while, and execution isn't nearly an instant operation. Even with your 3 or 4 gigahertz PC, what's really going on is that most CPU instructions, especially complex mathematical functions like square root, division, and modulo, all take multiple steps. So perhaps if you are performing a division operation, you may only get a few bits each cycle, and you may need to wait tens of cycles to get your result. Normally this is all hidden, so you only get your final results, and not incomplete results.

In the specification manuals for most CPUs, you can look up information about how many instructions of what types can be dispatched in a cycle, and how many cycles each operation will take. Depending on the operation, and other things going on at the time, sometimes you can start multiple operations of the same type in adjacent "pipelines", like having multiple lines in a factory. Sometimes you can add items to the lines every cycle, and sometimes each step takes longer than a cycle and items to be added to the pipeline have to wait. Sometimes, there are groups of 2, 4, or 8 items that need to be processed at once, and we have instructions specifically designed to process these groups in what is called vectorization.

Making it faster

Back in 2013, Mark Adler released performance improvements to the CRC class of algorithms. I didn't learn of them until several years later, through Matt Standcliff's Redis work and optimizations around 2017 or 2018. Huge thank you to Mark for writing the code, Matt for making it work with Redis, and Madelyn Olson for finally getting it merged in 2020.

A friend and former coworker sent me a reference to the "hacker's delight CRC", which was posted in 2014. That looks to use the same techniques that Mark Adler released in 2013.

I don't know who originated the work, but I'm going to guess it was Mark Adler or one of his colleagues, being he is the creator of the Adler CRC32, and co-creator of the Zip format, among others.

What was done? Mark transformed the typically bit or byte-wise operations of CRC into what could be done up to 8 bytes at a time. CPUs are built to do many such operations at the same time, so this naturally sped up CRC from around 400-408 MB/second to ~1420 MB/second, at least on one of my daily-workhorse Intel Xeon 2670's. That's a ~3.5x speedup by switching from 1 byte at a time to 8 bytes. Quite incredible if I didn't compile, run, and compare the outputs myself.

Not used, and rarely mentioned, I noticed that Mark had provided a method to do CRC combining. Where normally if you had some data, you had to have one function run from start to finish over that one whole block of data. You could pause, but you had to do the first part first. This limitation is very common among hashing algorithms.

Again, I don't know the origination of the idea, but Mark determined a way to merge the crcs of conceptually adjacent segments. So you can cut your 1 segment into 2, compute over the segments in parallel, combine the results, and get the same result as if you had computed the value serially.

Normally, most folks would add some threads, build a thread scheduler, and work out how to make this faster with threads. I did this, but only because I already had a threaded snapshot engine for Redis (not a part of this PR), so my CRCs were naturally faster. But I wasn't just interested in threads, and Mark's improvements weren't from thread shuffling on one CPU, it was from putting more operations in the CPU from one thread.

Armed with the knowledge that I could extend a CRC, I pulled the CRC operation over 8 bytes at a time into a group of macros, then had the single thread run over 2 or 3 segments producing 2 or 3 concurrent 8-byte segment CRCs over the data, all in one thread.

I had difficulty making CRC64 faster for data smaller than about 2-3 megabytes until I noticed something important. The original CRC combine method was taking ~270 microseconds to staple ~1 meg CRCs together. After some vectorization, manual and automatic, I settled on a compiler-vectorized method that got me down to ~29 microseconds for the CRC combine operation back in spring 2018.

On Wednesday, March 29, after looking at the algorithm further while preparing the patch for what ultimately became this pull request, I noticed that if you were to cache a certain set of data, much like we had done with the CRC computation in the first place, you could eliminate much of that remaining ~29 microseconds. So I did.

After a quick cache initialization that takes around 140 microseconds, or just more than half the time of the original staple over 1 meg, a 32 kilobyte cache is populated for your polynomial that allows CRC64 stapling. Due to an apparent cycle in the CRC after 64 bits of merge, we can actually merge unlimited length segments. After initialization, calls to combine about a megabyte now take roughly 25 nanoseconds. Yes, nanoseconds. From 270 microseconds without vectorization and caching, to 29 microseconds with vectorization, then all the way to 25 nanoseconds with caching.

So at least in Redis-land, CRC64 combining is about 10,800 times faster overall with the use of a 32 kilobyte cache and some auto-vectorization.

Wow.

While I am pretty sure that the caching portion is a well-known optimization, at least to Mark Adler, as I see a similar set of CRC combine optimizations in the most recent version of his codebase, which makes me think I should do a literature check before I write software. That said, Mark does not provide the additional vector operations, and without them, the cache speedup (at least in my version) is limited to getting us to 8 microseconds, instead of 50 nanoseconds typical worst-case.

So overall, the Redis CRC combine is about 160x faster than CRCany's combine, and nets >10,000x speedup from what I was originally working with, compared to around a 60x speedup with caching, or only 10x with vectorization (the combination is super effective).

Here are some benchmark outputs in a Github gist.

Overall, that gets us from 400-408 Megs/second for 1 byte at a time, to ~1420 megs/second for 8 bytes at a time, to up to ~2325 megs/second. So roughly 63% faster on an Intel Xeon E5-2670 v0 @ 2.60GHz with 20Mb of cache.

That is an old CPU, originally marketed Q1 2012. So it has a relatively smaller number of load, store, and arithmetic units compared to a more modern processor. I happen to have a slightly newer Intel Core i3-8130U @ 2.20 ghz, which was first listed for sale Q1 2018, and which I got in the summer of 2019, inside the laptop in which it currently runs.

Again, we see 400-408 Megs/second for 1 byte at a time processing. The 8-byte at a time version gets us up to ~1400 megs/second, or 3.5x faster, with the 24-byte at a time version seeing ~2350 megs/second on larger data. Ours gets us an approximate 68% speedup at 24 bytes, over the 8 byte at a time method.

In some benchmarks on ~10k of data (fitting inside L1 cache), I saw crc64 achieve upwards of 5670 megs/second on the 8130U.

Beating 1 byte/cycle

On my 5 year old mobile CPU, we see the 2.2ghz processor handling 2350 Megs/second, which is 1.068 bytes/cycle. If the data happens to already be loaded into cache after being created by another step, as is the case in many Redis snapshot operations, we see 2.6 bytes/cycle processed.

This speed puts us on par with the performance of "properly engineered" hashes that don't use any tables, and we are within a factor of 2 of the hardware CRC64 available in some more modern CPUs. For those about to ask, no we can't use that CRC in Redis or generally, it is not the right polynomial; if you run the functions on the data, you get different outputs.

I do not know if you can convert the output of CRC from one polynomial to another, but I suspect the answer is no. Might be an interesting question to ask a good mathematician.

Similar optimizations on 16 and 24-way (or more) processing can be done on CRC16 in Redis, or any other CRCs anywhere, as the 8 bytes at a time processing has already been implemented in CRCany. We skip this 16/24 byte optimization in Redis because CRC16 is typically used there for cluster slot identification, which is performed on keys that are generally in the tens to maybe hundreds of bytes in size, and there isn't a benefit to the optimization until 128 bytes for 16-way, and ~2k for 24-way. On the other hand, CRC64 is typically performed over snapshots of keys and data, which can reach to tens of gigabytes in size.

I'll try to get the optimizations upstream to crcany soon.

Summary

That's a lot of information, but CRC64 in Redis is getting about 60-70% faster for old machines, and likely 2x or faster on even newer hardware. For CRC combining, we achieve ~50ns worst-case time.

References

Pull request: https://github.com/redis/redis/pull/11988

Banchmark outputs: https://gist.github.com/josiahcarlson/9a147d39cc492951faffb723149f7926

Mark Adler's crcany project: https://github.com/madler/crcany/

Thursday, January 26, 2023

Getting an early birthday present from YouTube

Back in 2008-2010, I was working as one of the software engineers tasked with developing features and functionality on YouTube for Google. I had joined as the Santa Monica team was growing, right before a summer-long hiring freeze, about 3 months after officially getting my doctorate in 2008. Among other things, I worked on auto-redirection based on demographics (age-gating alcohol-related Youtube pages, but it was nearly Turing-Complete; using user profiles as nodes, and cookies as the tape), home-page customization, user-page customization (whee, CSS sanitizers!), but the biggest thing I did at YouTube was rewrite the built-in social Groups platform.

Humble beginnings


I can't recall the original author of the original YouTube Groups platform, but the software had several bothersome limitations and design constraints. The first part was that no more than (if I remember correctly) 5000 videos could be posted to the group, and that after a video was posted, there could be up to 1000 unthreaded comments related to that video. To find the conversation, you looked through the reverse-chronological listing by posting date. What happened when you hit 5000? Sorry, you couldn't post any more videos. Oops. What about text posts? Well, text posts only existed as chunks of text like any standard bulletin board at the time, and was mostly just a repeating wall of author, text, horizontal line.

While some of these could be fixed with some minor edits, nearly everyone that I spoke with internally wanted to delete Groups. Not for any reason except that it was pulling less than 1 million total views weekly (this was shortly before YouTube publicly announced 1 billion daily watch page views), and occasionally changes to some Python-generated html widget would cause a maintenance headache as someone needs to figure out what broke Groups. It also didn't help that the design was a bit out of date, owing to zero real updates in the several years since it had been written.

A first step


By the time I started paying attention in late 2008, I had hacked together a mini Twitter / Reddit / Digg / whatever forum thingy for posting daily standup statuses for our group, and it had more interaction functionality than the existing YouTube Groups functionality. I had built it using AppEngine as an, "I should use AppEngine for something so I can learn it" project, and after working up enough "I am embarrassed by YouTube Groups" intellectual inertia, I started considering building a new YouTube Groups in AppEngine.

After designing the database schema for the non-relational AppEngine datastore, there was a request from our Chief Architect Mike Solomon to also provide a MySQL schema as an option. At the time, Groups would be the only YouTube property not using either the sharded or non-sharded MySQL databases that stored everything not pictures or video. While comments would eventually move to one of the BigTable successors in a separate process / migration, Groups ultimately ended up using the non-sharded MySQL database, as we never expected the thing to grow very big. We were wrong. :)

Once the schema was approved, I started hacking on a horribly designed proof-of-concept 20% project. A month or two later, after it was far enough along to understand what I was going for (think threaded LiveJournal / Reddit with video posts and comments you could upvote / downvote), my manager James Phillips convinced our PM to get one of the YouTube designers on it. Fortunately for me, I ended up getting the illustrious Gunthar Hartwig to help integrate the developing YouTube profile / channel styles into YouTube Groups. This integration ended up going pretty far, and by the summer of '09, old groups migration and new groups release, Groups was using nearly all of the same widget building, widget positioning, data packing, and unpacking code as YouTube Channels (I also attempted to get the same code pushed to the homepage for customization there, but the homepage team went with something different, and looking at LinkedIn profiles after the fact, apparently earned a few patents for their efforts).

Pitfalls


No project is complete without hitting some bumps, or in my case, pretty deep pitfalls. Along the way in rebuilding Groups, I made several mistakes.

The first obvious mistake was that I built groups against what we called "mono", which was a single large database that was meant to hold things that didn't make sense to, or couldn't be sharded across our sharded DB instances. Clearly groups could have been sharded on a per-group basis, with the only real costs being a lookup of non-cached group name -> id (we sharded on ids), and a need to hit all shards to get a listing of groups.

This was compounded with a particular design choice / mistake made where I both avoided joins, and tried to satisfy the 4 most common queries with covering indexes. Which queries?
  • Show me the top k videos for this group
  • Show me the most recent k videos for this group
  • Show me the top k videos for this topic
  • Show me the most recent k videos for this topic
Four queries, four covering indexes. I didn't worry about people posting, that wasn't a big deal; just a main table + 4 index writes (all of whom were spam-checked, throttled, etc.). But the problem was deletion. Because there were no joins, to remove a thread (and its associated videos) from the 4 indexes, all video posts needed to be removed from the index. This was (relatively) rare, but occurred often enough and for long enough threads that I had added 'order by ... limit 1000' to keep the database from stuttering, with a secondary batch process that ran on a daily basis to clean up any threads longer than 1000 posts. Had I sharded the data by group in the first place, used a join to store the thread's delete bit in 1 place (indexed), and considered just using caches instead of covering indexes, I could have avoided most of the database load that I had induced. To the point: occasionally, those delete thread operations would be reported in our weekly database query reports as one of the top-100 latency queries. Funny how just a little join on the queries and some caching could have solved it.

Foolishly, I also wrote my own javascript Ajax mini-library separate from what most of the rest of YouTube was using. I wrote it because I couldn't find the documentation for any of the internal libraries that YouTube had been using for Javascript, and literally everyone else at my site just magically seemed to know how to use it. This included the recently graduated undergrad who started the same day I did! I didn't find out until a bit later, but the recent grad had done an internship with YouTube the summer before, so already knew the tools, people, process, ... I probably could have saved myself a lot of time if I'd asked, "where are the docs?" Alternatively, maybe I did and I don't remember, as there were several other projects I was working on at the time, and I was overworked.

If you want me to be honest, internally much of the code I wrote for Groups was crap. Some of the HTML templates weren't too bad, but the inline onclick bindings nowadays might get me fired in some shops, even if the css selectors didn't. There was also an if / elif block in the Python-side API that dispatched 95% of the interactive elements processing based on a variety of context (upvotes, downvotes, posting, replies, content pagination, ...), and it got worse to maintain with every addition.

My only real excuse for any of this is that working on YouTube was the first web application I touched at any level, and I got to work on: subscription email libraries, memcache libraries, sharded + non-sharded sql schemas and indexes, homepage html, css, javascript, user profile pages, all of the (groups) templates, all of the (groups) css, all of the (groups) javascript, all of the (groups) Python, (groups) data migration, daily (groups) batch processes, (groups) API design, ensuring proper layout for RTL languages for user channels and groups, ... along with several other side projects and helping others as necessary.

Trouble in paradise


Having just come out of a much smaller organization before Google, I had been used to being recognized and promoted for my abilities and achievements. That doesn't really / actually happen at Google. It's a bit of a fight, and the best way you can fight your way to the top is to get folks with a higher position than you to give you a recommendation for what amounts to a 'promotion packet' (conversations suggested that 3-5 was the minimum for promotion, but more is better). I didn't put those numbers together early enough (I didn't know the game), and by the time I did, whatever career I might have had at Google had been sand-bagged by not doing the right work for the right people, while also stepping on toes trying to create something new / better.

Once I realized that my time at YouTube / Google was coming to a close, I started looking at what was causing the most issues with Groups stability. And when I ran the numbers, nearly 99% of all issues with Groups were caused by periodic updates and incompatibility with the Channels carousel / player widget and backend. I'm not going to get into details, but keeping that API stable was tough, breaking some part of the YouTube world roughly 1 out of every 3 deploys.

Several times over the 9 months after migration to what we called "New Groups", I needed to write fixes during Wednesday deploys, which should have been prevented / nearly impossible, except for the following series of events that seemed to occur with startling regularity...

1. Code that breaks channel carousel would be committed just prior to code freeze on a Friday/Monday (maybe or maybe not breaking Groups).
2. Monday/Tuesday we find out that the carousel is broken.
2a. Because I was very difficult to get LGTM from me during code reviews, the breaker commits new fixes *after* I leave for the day with someone else's approval on Monday/Tuesday at 5-6PM.
3. Newly broken groups would then go out on deploy Wednesday late morning / midday, even when flags were raised because, "your shit should have been fixed".
4. This inevitably causes production errors, as new groups is 10x-100x bigger than old groups almost overnight.
5. Write fixes, hopefully to get re-deployed that day, in the Wednesday (to sometimes Thursday afternoon) "oh shit" deploy shuffle / kerfluffle.
6. In the worst-case, groups carousel was broken for a week (usually only 2 days, thanks to other hotfixes going out on Friday).

Oops.

For the future!


As part of a going-away birthday present to myself (I left Google in March 2010, my birthday is in March), and as a way to future-proof against breakage, I wrote a new player widget specifically for Groups that bypassed the carousel. Instead of calling the shared API endpoint, I called my own endpoint, that only did the 4 things I needed it to do. Okay, exception solved, but what about the carousel itself? Can't rely on the templates either...

So I wrote a new player; inside a conversation thread, you could watch a big video at the top, while interacting with the comment stream below, or with the video in a thumbnail in the upper-right. On all pages, a scrollable and paginated list of top / recent videos for that thread / group was available for browsing and viewing. As you scrolled down and interacted with posts and topics, you could watch videos, cue up videos to play, vote on videos, start a video without leaving the page, and more. This worked great on every browser - including IE6, as I was mostly returning to the original viewer interface I'd written before YouTube deprecated IE6 [1] and I was "encouraged" to use the carousel for a consistent experience. Ultimately it was very similar to the carousel (in terms of total functionality), but with a completely different backend and widget, so any later changes to the channel carousel wouldn't affect it (hard to break a code path you don't interact with, at least not without trying).

The new player / interface wasn't supposed to turn on until my birthday, a few weeks after I had left. Sadly, the world doesn't quite work the way we intend. The last day I was working, I was still committing code for the groups revamp, and it turns out that resurrecting code while being plied with whiskey shots at a rate of 1.5 / hour is tough. I got up to 11 shots by the time I wrote "gvn commit" and had my exit interview at 5PM, which was probably not all that good for the quality of the code I was committing. Also; clearly I was drunken rambling to my exit interviewer.

Thinking that was the end of that, I got another call just one week later, as my former coworker broke the carousel for Groups. I had let a few folks in on the surprise, so when the carousel was inevitably broken, they called to ask which line they should change to make it live. After updating the line, then running a test on their local machine, I get the question... "um, did you forget to commit a template?" Indeed, 11 shots is too many. After giving them the password to my old workstation, they got the template, checked again, committed, and groups limped along for another 9 months.

That any of this was possible was fortunate, as pursuant to our development group's hoarding tendencies[2], one of my coworkers had kept my old workstation aside as a "backup workstation" away from IT re-imaging, "just in case".

During the migration I performed for Old Groups to New Groups in August 2009, Groups had been seeing ~1 million weekly video views. By December 2009, that had grown to ~10 million weekly video views, or ~80% growth month-over-month. When I left in March 2010, hundreds of new groups were being created every day, people were posting and voting on content, ... and it was pretty incredible. By the time Groups was turned down in December 2010, there were groups with hundreds of thousands of members and videos posted, along with hundreds of millions of votes for videos and posts. All of that is gone now, the 8195 lines of Python, HTML, Javascript, CSS, and SQL archived in an old SVN (or now git) commit, a database backup, and whatever is scattered in the Internet Archive. I won't miss banning another Sub4Sub group, but I will miss running into the anime and game fandom music video groups.

I suspect YouTube ultimately turned Groups off because it was unnecessary database load, no one wanted to maintain it, and it was suffering from 9+ months of bit rot. I'd rewritten old groups because it wasn't great at what it should have been, but I suspect that Groups was also a partial victim of its own success. Good enough for users to use it / abuse it, but not good enough for Google to want to dedicate even a single engineer to the task of maintenance. This was tough for me, because ad revenue for Groups could have paid for several full-time engineers by December 2009, and even a team by the time they turned it off.

My updated groups was a perfect example of someone building up a revenue stream for Google on an existing service, allowing for Google to control the media, as well as the message boards discussing and disseminating the media. Can you imagine YouTube was hosting hundreds of individual forums of 200k+ individuals each, with 100k+ posts in each of hundreds of groups, after ~ 15 months of use? That dwarfs many current "big name" social media sites, and that was in 2010.

Alternatively, because groups self-organized, it also helped filter (bad) content, as bad content would gather in Groups postings for mass-flagging and deletion. Groups are the perfect opportunity for honeypots; groups intended to gather bad content and users to feed into the automated filters for blocking, banning, and/or legal action. Some people like to think that algorithms are good at filtering data, and I agree. But people are really good at gathering around their interests and collecting videos related to it.

But Groups is gone, my birthday present relegated to the corners of the Internet Archive.

Looking back at a different path


As I get older, I wonder if there was something else that I could / should have done, or could have known earlier that would have provided me with success at Google. About the only thing I can come up with was that just a few months prior to my leaving, my manager left to join early SpaceX. Now I'm not saying I should have followed him to SpaceX, but maybe I should have tried to step up into his position at Google.

At the time, I had quite a few ideas about engineering, and about applying what I knew to solving problems for startups. I thought that if I could find a startup with the right mix of business model, technology, people, and funding, that I could beat the 90-95% 2-year failure rate for startups (I did 4 times in a row, FWIW). Ultimately, I ended up leaving and starting down a path of startups and entrepreneurship.

I sometimes wonder if trying to become a manager at Google for YouTube would have been a good path, or if I had so many ideas about engineering and software design that I wouldn't have been able to give it up completely. Hard to say, considering that I still write software today, some 13 years later.


[1] This is a great story from one of the SBO team's efforts to kill IE6, which talks about what OldTuber is and what it meant: http://blog.chriszacharias.com/a-conspiracy-to-kill-ie6

[2] My particular tendencies were extreme; I had a Mac desktop workstation (originally allocated), Linux desktop workstation (acquired from leaving coworker), Windows laptop workstation (swapped with my original Powerbook, used as travel / SSH + Synergy host), and 2x 30" monitors. Normally 1 desktop, 1 laptop, and your choice of 2x 24" monitors or 1x 30" was standard, but I'd acquired or traded up enough to have an extra linux desktop, an extra 30" monitor, and when my cube-mate wasn't around (3-4 days a week; due to travel and WFH), a 3rd 30" monitor. I had more monitor area than our site director and original GnuPlot author Thomas Williams by at least 50%. It was glorious.