Thursday, May 9, 2024

CRC64 in Valkey got faster

As teased in my last blog post, CRC64 in Redis was supposed to get faster, but it didn't. About a year after I posted my PR, Redis the company (formerly Garantia Data, then Redis Labs, whom I've worked a booth with at AWS, written articles for, gave several talks for, etc.) changed the license of Redis the software. I don't like the new license. Subsequently, Madelyn Olson asked me to port my changes over to Valkey, and I want to make the world better, so I did.

When I originally ported to Valkey, it was using the generally unused `SERVER_TEST` define, which was not exposed in a convenient manner. Since then, Madelyn has migrated those tests over to the new C-level unittest framework, so instructions have changed since my changes, and like most software engineers, I like to be able to test my things, and tell you how to test them too.

Testing updated crc64 performance in Valkey

The first thing you'll want is to get the most recent unstable branch of Valkey (if you want to get all of Valkey / Redis history, omit the '--depth 1' argument), which will require (at minimum) git, make, and a recent gcc or clang:

 

% git clone --depth 1 https://github.com/valkey-io/valkey.git
%
cd valkey/src
%
make valkey-unit-tests
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000 --combine

The first unit test should output something like...

[START] - test_crc64combine.c
test size=10000000 algorithm=crc_1byte 375 M/sec matches=1
test size=10000000 algorithm=crcspeed 1345 M/sec matches=1
test size=10000000 algorithm=crcdual 2246 M/sec matches=1
test size=10000000 algorithm=crctri 2428 M/sec matches=1
test size=6093750 algorithm=crc_1byte 402 M/sec matches=1
test size=6093750 algorithm=crcspeed 1343 M/sec matches=1
test size=6093750 algorithm=crcdual 2254 M/sec matches=1
test size=6093750 algorithm=crctri 2423 M/sec matches=1
...

Examining the output, wherever you see crcdual and crctri start being slower than crcspeed, is where you should set your dual / tri cutoffs. At present, these cutoffs are set at default levels that seem to work well with Intel processors in the Sandy Bridge era or later (OEM shipments starting in 2013). I don't know about performance on AMD cpus, or even ARM cpus on the new macs, but if someone wants to test and let me know, I'd be appreciative.

Similarly, we can see how our crc extension performs with the additional --combine argument, producing something like:

[START] - test_crc64combine.c
init_64 size=18446744073709551615 in 689000 nsec
hash_as_size_combine size=10000000 in 1036 nsec
largest_combine size=18446744073709551615 in 10187 nsec
combine size=10000000 in 1049 nsec
combine size=6093750 in 2333 nsec
combine size=3713381 in 1576 nsec
combine size=2262843 in 1560 nsec
combine size=1378922 in 1217 nsec
combine size=840282 in 1431 nsec
combine size=512048 in 1030 nsec
...

We have generally set our cutoffs higher for non-x86 (arm-64 on MacOS / Raspberry Pi) because we don't have an explicitly vectorized combine function for them. Mostly the combine function is a matter of curiosity, but for longer-term scaling, it is important, as it sets the minimum size where combining makes sense vs. doing the regular calculation.

How did we get here, and can we go faster?

I originally added combine functionality in 2019 as I was building my own threaded Redis fork called StructD (now defunct). Because I had threads, spawning multiple threads in addition to optimizing individual thread performance would allow me to process entire files in parallel, chunked out to workers, merging and verifying the results at the end.

While it might not seem necessary to perform CRC64 verification with such speed, we were both creating and loading 10-100 gigabyte snapshots. As I was building StructD, the crcspeed changes from Matt Standcliff had not yet been merged in, and we were seeing 4+ minutes to verify our largest snapshots.

With 1.4 gigs/second after merging crcspeed, that pushed timing down to ~75 seconds just to verify. But with my updated 2.4 gig/second crctri, plus being able to go multi-threaded, my record for verifying a 40 gig snapshot on my local machine (file stored in shared memory to eliminate disk IO) was about 2.5 seconds, compared with about 30 seconds for April 25, 2020 crcspeed Redis, and 4+ minutes with one byte at a time classic. Overall, seeing a 100x performance improvement just in verifying snapshots.

Testing updated performance in smhasher

Clocking in at about 15 gigs/second CRC64 computation using 8 threads is not quite a linear speedup, but it beats every other real CRC of any bit width available in the smhasher suite of hashes, hardware or software. But to compare properly, our single-threaded CRC64 variant is about half the speed of hardware CRC64 on my machine, using only software. Available as crc64_jones, crc64_jones1, crc64_jones2, and crc64_jones3 for the Valkey automatic 1-3 pipeline, 1, 2, and 3-pipeline versions, respectively (you will need git, make, cmake, gcc, and g++ to build smhasher via 'sh build.sh' then './SMHasher --test=Speed crc64_jones').

On the whole, my changes demonstrate a path for single-threaded CRC seeing near-hardware performance for any polynomial on modern CPUs with enough spare pipelines, L1/L2 cache, and memory bandwidth. When combined with threads directly, we become limited by data source IO speeds and memory bandwidth - not CRC algorithm performance.

For more free software, please feel free to sponsor me over on Github.

Thursday, April 6, 2023

CRC64 in Redis is getting even faster

On Friday last week, I submitted this PR to Redis: https://github.com/redis/redis/pull/11988 .

In the PR, I make single-threaded CRC64 53-73% faster for a CPU that is 13 years old, and ~67% faster for a CPU that is 5 years old. I'll get into the details more as to how / why, but modern CPUs could see 2x+, and an update to the algorithm should allow for leveraging more CPU hardware as it scales. This is also generally applicable to all CRC polynomials, and brings software CRC within 2x of hardware CRC performance.

Oh, then I also made CRC concatenation >10,000x faster, but I suspect that was previously known, and just not reported for what it was.

I know, this is going to feel like "learn this one simple trick to make all your programs faster" clickbait, but for some algorithms, 2x faster is just the start. (and extrapolating to other CRCs

Computer Architecture Background

For folks who already know about the history of computers, and how / why CPUs do what they do, jump ahead to "Making it faster".

 Decades before I was born, Alan Turing and John von Neumann developed competing thoughts about how to conceptualize and build computers. Ultimately, the von Neuman architecture won out. It is best characterized by a central processing unit with control flow, arithmetic operations, along with an interface to memory, input devices, and output devices. To go a little further for our specific example, processors have been built to basically do a few things really well: read instructions, decode those instructions into sub-instructions, wait for data to arrive, then finally execute on any data that is available.

Fetching things can sometimes take a while, and execution isn't nearly an instant operation. Even with your 3 or 4 gigahertz PC, what's really going on is that most CPU instructions, especially complex mathematical functions like square root, division, and modulo, all take multiple steps. So perhaps if you are performing a division operation, you may only get a few bits each cycle, and you may need to wait tens of cycles to get your result. Normally this is all hidden, so you only get your final results, and not incomplete results.

In the specification manuals for most CPUs, you can look up information about how many instructions of what types can be dispatched in a cycle, and how many cycles each operation will take. Depending on the operation, and other things going on at the time, sometimes you can start multiple operations of the same type in adjacent "pipelines", like having multiple lines in a factory. Sometimes you can add items to the lines every cycle, and sometimes each step takes longer than a cycle and items to be added to the pipeline have to wait. Sometimes, there are groups of 2, 4, or 8 items that need to be processed at once, and we have instructions specifically designed to process these groups in what is called vectorization.

Making it faster

Back in 2013, Mark Adler released performance improvements to the CRC class of algorithms. I didn't learn of them until several years later, through Matt Standcliff's Redis work and optimizations around 2017 or 2018. Huge thank you to Mark for writing the code, Matt for making it work with Redis, and Madelyn Olson for finally getting it merged in 2020.

A friend and former coworker sent me a reference to the "hacker's delight CRC", which was posted in 2014. That looks to use the same techniques that Mark Adler released in 2013.

I don't know who originated the work, but I'm going to guess it was Mark Adler or one of his colleagues, being he is the creator of the Adler CRC32, and co-creator of the Zip format, among others.

What was done? Mark transformed the typically bit or byte-wise operations of CRC into what could be done up to 8 bytes at a time. CPUs are built to do many such operations at the same time, so this naturally sped up CRC from around 400-408 MB/second to ~1420 MB/second, at least on one of my daily-workhorse Intel Xeon 2670's. That's a ~3.5x speedup by switching from 1 byte at a time to 8 bytes. Quite incredible if I didn't compile, run, and compare the outputs myself.

Not used, and rarely mentioned, I noticed that Mark had provided a method to do CRC combining. Where normally if you had some data, you had to have one function run from start to finish over that one whole block of data. You could pause, but you had to do the first part first. This limitation is very common among hashing algorithms.

Again, I don't know the origination of the idea, but Mark determined a way to merge the crcs of conceptually adjacent segments. So you can cut your 1 segment into 2, compute over the segments in parallel, combine the results, and get the same result as if you had computed the value serially.

Normally, most folks would add some threads, build a thread scheduler, and work out how to make this faster with threads. I did this, but only because I already had a threaded snapshot engine for Redis (not a part of this PR), so my CRCs were naturally faster. But I wasn't just interested in threads, and Mark's improvements weren't from thread shuffling on one CPU, it was from putting more operations in the CPU from one thread.

Armed with the knowledge that I could extend a CRC, I pulled the CRC operation over 8 bytes at a time into a group of macros, then had the single thread run over 2 or 3 segments producing 2 or 3 concurrent 8-byte segment CRCs over the data, all in one thread.

I had difficulty making CRC64 faster for data smaller than about 2-3 megabytes until I noticed something important. The original CRC combine method was taking ~270 microseconds to staple ~1 meg CRCs together. After some vectorization, manual and automatic, I settled on a compiler-vectorized method that got me down to ~29 microseconds for the CRC combine operation back in spring 2018.

On Wednesday, March 29, after looking at the algorithm further while preparing the patch for what ultimately became this pull request, I noticed that if you were to cache a certain set of data, much like we had done with the CRC computation in the first place, you could eliminate much of that remaining ~29 microseconds. So I did.

After a quick cache initialization that takes around 140 microseconds, or just more than half the time of the original staple over 1 meg, a 32 kilobyte cache is populated for your polynomial that allows CRC64 stapling. Due to an apparent cycle in the CRC after 64 bits of merge, we can actually merge unlimited length segments. After initialization, calls to combine about a megabyte now take roughly 25 nanoseconds. Yes, nanoseconds. From 270 microseconds without vectorization and caching, to 29 microseconds with vectorization, then all the way to 25 nanoseconds with caching.

So at least in Redis-land, CRC64 combining is about 10,800 times faster overall with the use of a 32 kilobyte cache and some auto-vectorization.

Wow.

While I am pretty sure that the caching portion is a well-known optimization, at least to Mark Adler, as I see a similar set of CRC combine optimizations in the most recent version of his codebase, which makes me think I should do a literature check before I write software. That said, Mark does not provide the additional vector operations, and without them, the cache speedup (at least in my version) is limited to getting us to 8 microseconds, instead of 50 nanoseconds typical worst-case.

So overall, the Redis CRC combine is about 160x faster than CRCany's combine, and nets >10,000x speedup from what I was originally working with, compared to around a 60x speedup with caching, or only 10x with vectorization (the combination is super effective).

Here are some benchmark outputs in a Github gist.

Overall, that gets us from 400-408 Megs/second for 1 byte at a time, to ~1420 megs/second for 8 bytes at a time, to up to ~2325 megs/second. So roughly 63% faster on an Intel Xeon E5-2670 v0 @ 2.60GHz with 20Mb of cache.

That is an old CPU, originally marketed Q1 2012. So it has a relatively smaller number of load, store, and arithmetic units compared to a more modern processor. I happen to have a slightly newer Intel Core i3-8130U @ 2.20 ghz, which was first listed for sale Q1 2018, and which I got in the summer of 2019, inside the laptop in which it currently runs.

Again, we see 400-408 Megs/second for 1 byte at a time processing. The 8-byte at a time version gets us up to ~1400 megs/second, or 3.5x faster, with the 24-byte at a time version seeing ~2350 megs/second on larger data. Ours gets us an approximate 68% speedup at 24 bytes, over the 8 byte at a time method.

In some benchmarks on ~10k of data (fitting inside L1 cache), I saw crc64 achieve upwards of 5670 megs/second on the 8130U.

Beating 1 byte/cycle

On my 5 year old mobile CPU, we see the 2.2ghz processor handling 2350 Megs/second, which is 1.068 bytes/cycle. If the data happens to already be loaded into cache after being created by another step, as is the case in many Redis snapshot operations, we see 2.6 bytes/cycle processed.

This speed puts us on par with the performance of "properly engineered" hashes that don't use any tables, and we are within a factor of 2 of the hardware CRC64 available in some more modern CPUs. For those about to ask, no we can't use that CRC in Redis or generally, it is not the right polynomial; if you run the functions on the data, you get different outputs.

I do not know if you can convert the output of CRC from one polynomial to another, but I suspect the answer is no. Might be an interesting question to ask a good mathematician.

Similar optimizations on 16 and 24-way (or more) processing can be done on CRC16 in Redis, or any other CRCs anywhere, as the 8 bytes at a time processing has already been implemented in CRCany. We skip this 16/24 byte optimization in Redis because CRC16 is typically used there for cluster slot identification, which is performed on keys that are generally in the tens to maybe hundreds of bytes in size, and there isn't a benefit to the optimization until 128 bytes for 16-way, and ~2k for 24-way. On the other hand, CRC64 is typically performed over snapshots of keys and data, which can reach to tens of gigabytes in size.

I'll try to get the optimizations upstream to crcany soon.

Summary

That's a lot of information, but CRC64 in Redis is getting about 60-70% faster for old machines, and likely 2x or faster on even newer hardware. For CRC combining, we achieve ~50ns worst-case time.

References

Pull request: https://github.com/redis/redis/pull/11988

Banchmark outputs: https://gist.github.com/josiahcarlson/9a147d39cc492951faffb723149f7926

Mark Adler's crcany project: https://github.com/madler/crcany/

Thursday, January 26, 2023

Getting an early birthday present from YouTube

Back in 2008-2010, I was working as one of the software engineers tasked with developing features and functionality on YouTube for Google. I had joined as the Santa Monica team was growing, right before a summer-long hiring freeze, about 3 months after officially getting my doctorate in 2008. Among other things, I worked on auto-redirection based on demographics (age-gating alcohol-related Youtube pages, but it was nearly Turing-Complete; using user profiles as nodes, and cookies as the tape), home-page customization, user-page customization (whee, CSS sanitizers!), but the biggest thing I did at YouTube was rewrite the built-in social Groups platform.

Humble beginnings


I can't recall the original author of the original YouTube Groups platform, but the software had several bothersome limitations and design constraints. The first part was that no more than (if I remember correctly) 5000 videos could be posted to the group, and that after a video was posted, there could be up to 1000 unthreaded comments related to that video. To find the conversation, you looked through the reverse-chronological listing by posting date. What happened when you hit 5000? Sorry, you couldn't post any more videos. Oops. What about text posts? Well, text posts only existed as chunks of text like any standard bulletin board at the time, and was mostly just a repeating wall of author, text, horizontal line.

While some of these could be fixed with some minor edits, nearly everyone that I spoke with internally wanted to delete Groups. Not for any reason except that it was pulling less than 1 million total views weekly (this was shortly before YouTube publicly announced 1 billion daily watch page views), and occasionally changes to some Python-generated html widget would cause a maintenance headache as someone needs to figure out what broke Groups. It also didn't help that the design was a bit out of date, owing to zero real updates in the several years since it had been written.

A first step


By the time I started paying attention in late 2008, I had hacked together a mini Twitter / Reddit / Digg / whatever forum thingy for posting daily standup statuses for our group, and it had more interaction functionality than the existing YouTube Groups functionality. I had built it using AppEngine as an, "I should use AppEngine for something so I can learn it" project, and after working up enough "I am embarrassed by YouTube Groups" intellectual inertia, I started considering building a new YouTube Groups in AppEngine.

After designing the database schema for the non-relational AppEngine datastore, there was a request from our Chief Architect Mike Solomon to also provide a MySQL schema as an option. At the time, Groups would be the only YouTube property not using either the sharded or non-sharded MySQL databases that stored everything not pictures or video. While comments would eventually move to one of the BigTable successors in a separate process / migration, Groups ultimately ended up using the non-sharded MySQL database, as we never expected the thing to grow very big. We were wrong. :)

Once the schema was approved, I started hacking on a horribly designed proof-of-concept 20% project. A month or two later, after it was far enough along to understand what I was going for (think threaded LiveJournal / Reddit with video posts and comments you could upvote / downvote), my manager James Phillips convinced our PM to get one of the YouTube designers on it. Fortunately for me, I ended up getting the illustrious Gunthar Hartwig to help integrate the developing YouTube profile / channel styles into YouTube Groups. This integration ended up going pretty far, and by the summer of '09, old groups migration and new groups release, Groups was using nearly all of the same widget building, widget positioning, data packing, and unpacking code as YouTube Channels (I also attempted to get the same code pushed to the homepage for customization there, but the homepage team went with something different, and looking at LinkedIn profiles after the fact, apparently earned a few patents for their efforts).

Pitfalls


No project is complete without hitting some bumps, or in my case, pretty deep pitfalls. Along the way in rebuilding Groups, I made several mistakes.

The first obvious mistake was that I built groups against what we called "mono", which was a single large database that was meant to hold things that didn't make sense to, or couldn't be sharded across our sharded DB instances. Clearly groups could have been sharded on a per-group basis, with the only real costs being a lookup of non-cached group name -> id (we sharded on ids), and a need to hit all shards to get a listing of groups.

This was compounded with a particular design choice / mistake made where I both avoided joins, and tried to satisfy the 4 most common queries with covering indexes. Which queries?
  • Show me the top k videos for this group
  • Show me the most recent k videos for this group
  • Show me the top k videos for this topic
  • Show me the most recent k videos for this topic
Four queries, four covering indexes. I didn't worry about people posting, that wasn't a big deal; just a main table + 4 index writes (all of whom were spam-checked, throttled, etc.). But the problem was deletion. Because there were no joins, to remove a thread (and its associated videos) from the 4 indexes, all video posts needed to be removed from the index. This was (relatively) rare, but occurred often enough and for long enough threads that I had added 'order by ... limit 1000' to keep the database from stuttering, with a secondary batch process that ran on a daily basis to clean up any threads longer than 1000 posts. Had I sharded the data by group in the first place, used a join to store the thread's delete bit in 1 place (indexed), and considered just using caches instead of covering indexes, I could have avoided most of the database load that I had induced. To the point: occasionally, those delete thread operations would be reported in our weekly database query reports as one of the top-100 latency queries. Funny how just a little join on the queries and some caching could have solved it.

Foolishly, I also wrote my own javascript Ajax mini-library separate from what most of the rest of YouTube was using. I wrote it because I couldn't find the documentation for any of the internal libraries that YouTube had been using for Javascript, and literally everyone else at my site just magically seemed to know how to use it. This included the recently graduated undergrad who started the same day I did! I didn't find out until a bit later, but the recent grad had done an internship with YouTube the summer before, so already knew the tools, people, process, ... I probably could have saved myself a lot of time if I'd asked, "where are the docs?" Alternatively, maybe I did and I don't remember, as there were several other projects I was working on at the time, and I was overworked.

If you want me to be honest, internally much of the code I wrote for Groups was crap. Some of the HTML templates weren't too bad, but the inline onclick bindings nowadays might get me fired in some shops, even if the css selectors didn't. There was also an if / elif block in the Python-side API that dispatched 95% of the interactive elements processing based on a variety of context (upvotes, downvotes, posting, replies, content pagination, ...), and it got worse to maintain with every addition.

My only real excuse for any of this is that working on YouTube was the first web application I touched at any level, and I got to work on: subscription email libraries, memcache libraries, sharded + non-sharded sql schemas and indexes, homepage html, css, javascript, user profile pages, all of the (groups) templates, all of the (groups) css, all of the (groups) javascript, all of the (groups) Python, (groups) data migration, daily (groups) batch processes, (groups) API design, ensuring proper layout for RTL languages for user channels and groups, ... along with several other side projects and helping others as necessary.

Trouble in paradise


Having just come out of a much smaller organization before Google, I had been used to being recognized and promoted for my abilities and achievements. That doesn't really / actually happen at Google. It's a bit of a fight, and the best way you can fight your way to the top is to get folks with a higher position than you to give you a recommendation for what amounts to a 'promotion packet' (conversations suggested that 3-5 was the minimum for promotion, but more is better). I didn't put those numbers together early enough (I didn't know the game), and by the time I did, whatever career I might have had at Google had been sand-bagged by not doing the right work for the right people, while also stepping on toes trying to create something new / better.

Once I realized that my time at YouTube / Google was coming to a close, I started looking at what was causing the most issues with Groups stability. And when I ran the numbers, nearly 99% of all issues with Groups were caused by periodic updates and incompatibility with the Channels carousel / player widget and backend. I'm not going to get into details, but keeping that API stable was tough, breaking some part of the YouTube world roughly 1 out of every 3 deploys.

Several times over the 9 months after migration to what we called "New Groups", I needed to write fixes during Wednesday deploys, which should have been prevented / nearly impossible, except for the following series of events that seemed to occur with startling regularity...

1. Code that breaks channel carousel would be committed just prior to code freeze on a Friday/Monday (maybe or maybe not breaking Groups).
2. Monday/Tuesday we find out that the carousel is broken.
2a. Because I was very difficult to get LGTM from me during code reviews, the breaker commits new fixes *after* I leave for the day with someone else's approval on Monday/Tuesday at 5-6PM.
3. Newly broken groups would then go out on deploy Wednesday late morning / midday, even when flags were raised because, "your shit should have been fixed".
4. This inevitably causes production errors, as new groups is 10x-100x bigger than old groups almost overnight.
5. Write fixes, hopefully to get re-deployed that day, in the Wednesday (to sometimes Thursday afternoon) "oh shit" deploy shuffle / kerfluffle.
6. In the worst-case, groups carousel was broken for a week (usually only 2 days, thanks to other hotfixes going out on Friday).

Oops.

For the future!


As part of a going-away birthday present to myself (I left Google in March 2010, my birthday is in March), and as a way to future-proof against breakage, I wrote a new player widget specifically for Groups that bypassed the carousel. Instead of calling the shared API endpoint, I called my own endpoint, that only did the 4 things I needed it to do. Okay, exception solved, but what about the carousel itself? Can't rely on the templates either...

So I wrote a new player; inside a conversation thread, you could watch a big video at the top, while interacting with the comment stream below, or with the video in a thumbnail in the upper-right. On all pages, a scrollable and paginated list of top / recent videos for that thread / group was available for browsing and viewing. As you scrolled down and interacted with posts and topics, you could watch videos, cue up videos to play, vote on videos, start a video without leaving the page, and more. This worked great on every browser - including IE6, as I was mostly returning to the original viewer interface I'd written before YouTube deprecated IE6 [1] and I was "encouraged" to use the carousel for a consistent experience. Ultimately it was very similar to the carousel (in terms of total functionality), but with a completely different backend and widget, so any later changes to the channel carousel wouldn't affect it (hard to break a code path you don't interact with, at least not without trying).

The new player / interface wasn't supposed to turn on until my birthday, a few weeks after I had left. Sadly, the world doesn't quite work the way we intend. The last day I was working, I was still committing code for the groups revamp, and it turns out that resurrecting code while being plied with whiskey shots at a rate of 1.5 / hour is tough. I got up to 11 shots by the time I wrote "gvn commit" and had my exit interview at 5PM, which was probably not all that good for the quality of the code I was committing. Also; clearly I was drunken rambling to my exit interviewer.

Thinking that was the end of that, I got another call just one week later, as my former coworker broke the carousel for Groups. I had let a few folks in on the surprise, so when the carousel was inevitably broken, they called to ask which line they should change to make it live. After updating the line, then running a test on their local machine, I get the question... "um, did you forget to commit a template?" Indeed, 11 shots is too many. After giving them the password to my old workstation, they got the template, checked again, committed, and groups limped along for another 9 months.

That any of this was possible was fortunate, as pursuant to our development group's hoarding tendencies[2], one of my coworkers had kept my old workstation aside as a "backup workstation" away from IT re-imaging, "just in case".

During the migration I performed for Old Groups to New Groups in August 2009, Groups had been seeing ~1 million weekly video views. By December 2009, that had grown to ~10 million weekly video views, or ~80% growth month-over-month. When I left in March 2010, hundreds of new groups were being created every day, people were posting and voting on content, ... and it was pretty incredible. By the time Groups was turned down in December 2010, there were groups with hundreds of thousands of members and videos posted, along with hundreds of millions of votes for videos and posts. All of that is gone now, the 8195 lines of Python, HTML, Javascript, CSS, and SQL archived in an old SVN (or now git) commit, a database backup, and whatever is scattered in the Internet Archive. I won't miss banning another Sub4Sub group, but I will miss running into the anime and game fandom music video groups.

I suspect YouTube ultimately turned Groups off because it was unnecessary database load, no one wanted to maintain it, and it was suffering from 9+ months of bit rot. I'd rewritten old groups because it wasn't great at what it should have been, but I suspect that Groups was also a partial victim of its own success. Good enough for users to use it / abuse it, but not good enough for Google to want to dedicate even a single engineer to the task of maintenance. This was tough for me, because ad revenue for Groups could have paid for several full-time engineers by December 2009, and even a team by the time they turned it off.

My updated groups was a perfect example of someone building up a revenue stream for Google on an existing service, allowing for Google to control the media, as well as the message boards discussing and disseminating the media. Can you imagine YouTube was hosting hundreds of individual forums of 200k+ individuals each, with 100k+ posts in each of hundreds of groups, after ~ 15 months of use? That dwarfs many current "big name" social media sites, and that was in 2010.

Alternatively, because groups self-organized, it also helped filter (bad) content, as bad content would gather in Groups postings for mass-flagging and deletion. Groups are the perfect opportunity for honeypots; groups intended to gather bad content and users to feed into the automated filters for blocking, banning, and/or legal action. Some people like to think that algorithms are good at filtering data, and I agree. But people are really good at gathering around their interests and collecting videos related to it.

But Groups is gone, my birthday present relegated to the corners of the Internet Archive.

Looking back at a different path


As I get older, I wonder if there was something else that I could / should have done, or could have known earlier that would have provided me with success at Google. About the only thing I can come up with was that just a few months prior to my leaving, my manager left to join early SpaceX. Now I'm not saying I should have followed him to SpaceX, but maybe I should have tried to step up into his position at Google.

At the time, I had quite a few ideas about engineering, and about applying what I knew to solving problems for startups. I thought that if I could find a startup with the right mix of business model, technology, people, and funding, that I could beat the 90-95% 2-year failure rate for startups (I did 4 times in a row, FWIW). Ultimately, I ended up leaving and starting down a path of startups and entrepreneurship.

I sometimes wonder if trying to become a manager at Google for YouTube would have been a good path, or if I had so many ideas about engineering and software design that I wouldn't have been able to give it up completely. Hard to say, considering that I still write software today, some 13 years later.


[1] This is a great story from one of the SBO team's efforts to kill IE6, which talks about what OldTuber is and what it meant: http://blog.chriszacharias.com/a-conspiracy-to-kill-ie6

[2] My particular tendencies were extreme; I had a Mac desktop workstation (originally allocated), Linux desktop workstation (acquired from leaving coworker), Windows laptop workstation (swapped with my original Powerbook, used as travel / SSH + Synergy host), and 2x 30" monitors. Normally 1 desktop, 1 laptop, and your choice of 2x 24" monitors or 1x 30" was standard, but I'd acquired or traded up enough to have an extra linux desktop, an extra 30" monitor, and when my cube-mate wasn't around (3-4 days a week; due to travel and WFH), a 3rd 30" monitor. I had more monitor area than our site director and original GnuPlot author Thomas Williams by at least 50%. It was glorious.

Sunday, December 11, 2022

Goodbye Twitter

Some people have heavy hearts, but I don't. I originally signed up for Twitter for development / API access. I wrote software that analyzed millions of tweets and twitter profiles every day for Adly and our celebrity tweet network. I'm about a decade removed from Adly, never much got into tweeting, and I spend most of my time on Twitter recently posting updates about my software, book, etc. Things that could be blog posts here. I'll probably do that going forward.

I'm leaving Twitter because after Elon Musk's takeover, all I've seen is hard-working people who built the platform being denigrated, while the most useless CEO on the planet tweet-abuses, blocks, and suspends anyone who has a problem with his banality echoing though his slowly emptying anti-social network. Any semblance of an original thought seems to be limited to lightly-edited pro-Russia and pro-MAGA talking points, while pretending that he's the all-seeing eye traversing dimensions.

Imagine having access to billions of dollars... then using it to put Trump on Twitter, where Trump already has a whole social network dedicated to it - "Truth Social" (the irony is not lost on me, and it shouldn't on you). And using it to insult / abuse the people who made Twitter what it was. From developers, to community managers, to the entire Twitter community. Imagine having had the support of most of the world to put rockets on Mars... then squandering it on posting memes to your $40 billion micro-blog platform.

I was over Musk when he talked about couping Bolivia. The union busting at his factories, when he straight fired a huge portion of the Twitter workforce, and the recent public details of his failure at being a father. All of that behavior is anti-people, anti-labor, and obviously seeks to create animosity among as many people as possible.

No Tesla, no powerwall, and no Musk-profiting products, not ever. I've got other, better things to spend my time and money on. I bet you all do too.

Wednesday, May 29, 2019

Re: Leaky Python Batteries

After reading information about Amber Brown's talk regarding Python and its included "batteries" in the form of the standard library, there is a lot to agree with, and only a small amount to not agree with. I know I'm late.

Here's my take...


For experience consideration; I'm the reason that asyncore/asynchat/... (callback-defined async sockets in Python before asyncio) was included and updated in Python 2.6, and lived to see Python 3.0 (other folks took on maintenance after 2.6.?/3.0.?). I stepped on a lot of toes to make that happen, but I was still using asyncore/asynchat, and not having them in the standard library would have been bad for me and folks like me. So I "stepped up".

My experience is that moving anything in Python in any direction is hard and frustrating, no matter how "sane" you think your words and opinions are*. I remember being confused that I was even having to argue for static byte literals in Python 3.0, never mind pushing for a syntax like b'hello' instead of bytes([104, 101, 108, 108, 111]) [1], or even that an immutable variant should be available (instead of using a tuple for hash keys).

Thinking about it well after the fact, I have the sense it was because I didn't feel like I had any real level of ownership or control. I felt responsible as a member of the community to voice my opinion about the software I was using, to try to contribute. But I also felt as though I didn't have any level of real ownership. Like, somehow there was a path through the Python community that lead to being able to fix a 2 line bug without 25 people signing off on the change. I saw people walking the path, but I couldn't really find it or manage to walk it myself.

In Contrast


But compare that experience to your own project? Or even working in almost any tech-based organization? I was fortunate, in the 21 months I put in at YouTube / Google from 2008-2010, I was made an "oldtuber", and later "owner" of several files early on as I hit readability with Python quickly, and showed that I could actually fix problems with the platform. Within 3 months of joining in 2008, I could write code, ensure it passed the automated tests, commit without *any* code reviews in a half dozen files, and have code running on youtube.com within a week of writing it (pushes for hotfixes could happen randomly, sometimes 3-4 times a week back then).

Compare that with any library in Python? I can't even remember how many folks needed to sign off on my asyncore / asynchat changes, but it was far more than the 0/1/2 I needed to commit at YouTube. Python has more folks weighing in now than ever before, and I bet that's not making contributing easier.

I don't know how much this feeling pervades other existing / past python contributors, nor how much this is a barrier for creating new contributors. I don't know if this is a pain point for others, but it was big for me.

I don't know if Amber's frustration is similarly comparing her experience in Twisted vs. Python core. Where in Twisted she found it easy to make an impact and help the project, but Python core is uninterested / unwilling in similar insights. I don't know (a lot of things).

Regardless, I don't know if splitting the Python standard library out makes sense. On the one hand, pulling out the standard library would give some people more control and nearly absolute ownership. But that will mostly just create inconsistencies in code quality, maintenance, and interoperability - like we've all seen in our own dependency hells elsewhere. All the while making Python a language without batteries.

Address the Leaky Batteries


Personally, even if the batteries are leaking, I'd rather have them included. Compiling Python issues aside (you can disable compiling tkinter / lxml modules FWIW), that's as much a "work on Python's Makefiles" as it is anything else. Which is a much more approachable and appropriate thing than getting the organization to create dozens of sub-projects.


I do agree with Amber that asyncio's inclusion in Python does make it hard for the community, especially given how long that Twisted has been around and doing much of the same stuff. Heck, I even suggested that if Python were to deprecate and remove modules from the standard library because the functionality were duplicated in Twisted, then the answer was to dump those modules and include Twisted [2] (even though I've actually never used Twisted, and have more recently used asyncore/asynchat, asyncio, pyuv, uvloop, and shrapnel since I read Twisted source). But it's going to be hard to put that genie back in the bottle; best case scenario is that there aren't any more "big" packages included that quickly.

Ultimately, I believe that the answer might just be better organization, better management of maintainers, better expectation management re: possible contributors, and better management of contributions. I think that giving people more ownership of the modules / packages they maintain, but also allow for the community to override a bad maintainer, might help things move forward in each individual module. If everything splits, many sorts of organization-level efficiencies are lost, including just the benefit of being part of a larger project meaning you get more attention at all levels.

Personally, I'd suggest sticking together and working on trying to manage the project itself better. For some subjective definition of "better".

A Mistake 20 Years in the Making


That said, until visiting [3] while writing this entry, I didn't even know there was a formal process of being involved in the Python Software Foundation and moving through the organization. I know I spent 10-20 hours a week trying to help back when I was active on the mailing lists. No wonder I failed to find a path to success in Python core for the 20 years I've used Python; I didn't know you literally needed to be a part of the club. Hah.

C'est la vie.
[1] - https://mail.python.org/pipermail/python-3000/2007-February/005822.html
[2] - https://mail.python.org/pipermail/python-dev/2007-February/071202.html
[3] - https://www.python.org/psf/membership/

Sunday, May 15, 2016

rom Indexes and Search

So... some updates on rom.

The end of January, I posted a teaser picture about "Interleaved Indexes" in rom/Redis via tweet. If you haven't seen the picture, it's here:


Interleaved index?

I ended up building an interleaved index in Redis using a bit of Lua scripting and Python. No ZRANGEBYLEX, surprisingly. What is an interleaved index? It's what I was calling an indexing mode that has the same ordering and filtering options as a few different multi-dimensional indexes stored as tree-structures. Specifically some KD trees, Redshift interleaved sort keys, BSP trees, a specific implementation of crit-bit trees, and several others offer the specific functionality I was after.

Why

The simple reason why I was implementing an interleaved index was because I see some intersection options on data in Redis to be potentially less efficient than a proper multi-dimensional index would or could be. Long story short, it worked, but not without issues. I mentioned some of these issues in a series of tweets 1, 2, and 3, semi-abandoned the code in late-February, and now am ultimately not releasing it. Why? Because it doesn't actually add anything. It was complex, borrowed about 750 lines of code I wrote 5 1/2 years ago, and ... no.

A better option

There were a few very simple wins that I knew could be made with the query optimizer, including a fix on my side for a bug when calling TYPE from within a Lua script (which returns a table instead of a string). The ultimate result of that work is that some queries on numeric ranges can be hundreds or thousands of times faster on large indexes in theory. Partly due to starting with the correct set/sorted set to start, but also implementing a direct scan of an index instead of intersect/union + delete outside ranges.

Sometimes being more specific for optimizations is worth it. Definitely is in this case. For one of my use-cases involving search, I'm seeing 10-20x faster queries in practice, and 150x faster in a simple canned test.

I also removed non-Lua writing mode code. Sorry for those of you living in pre-2.6 days, but you'll have to upgrade. Hell, even with Lua scripting turned off, the query optimizer still used Lua, so if this worked in Redis 2.4 recently, I'd be surprised.

So that's what's going on right now.

Rebuild your indexes

Yeah. And rebuild your indexes. I'm sorry. Whenever I'm using rom as a cache or index of some kind, I re-cache and re-index daily so things like this always eventually resolve themselves, especially immediately after a library upgrade. Not a new idea; Google did it with their bigtables for cleanup, Postgres does auto-vacuum. Call this a manual vacuum via re-indexing.

Once/day or however often, maybe during off hours:

# import all of your models first
# then...
from rom import columns, util
for model in columns.MODELS.values():
    util.show_progress(util.refresh_indices(model))

That will rebuild all of the indexes on all of your models.

Almost P.S. - Loadable modules in Redis?

Redisconf 2016 happened last week and loadable modules were announced. I think that for people who host their own Redis, it could be awesome. Think of it like an answer to Postgres plugins. Hope I can pay someone to run my loadable modules, if I ever get around to building any :P

Wednesday, July 29, 2015

Transactions in Redis

Over the last few months, I've been thinking about and implementing transactions for Lua scripting in Redis. Not everyone understands why I'm doing this, so let me explain with a bit of history.

MySQL and Postgres

In 1998-2003 if you wanted to start a serious database driven web site/service and didn't have money to pay Microsoft or Oracle for their databases, you picked either MySQL or Postgres. A lot of people picked MySQL because it was faster, and much of that was due to the MyISAM storage engine that traded performance for a lack of transaction capability - speed is speed. Some people went with Postgres because despite its measurably slower performance on the same hardware, you could rely on Postgres to not lose your data (to be fair, the data loss with MySQL was relatively rare, but data loss is never fun).

A lot of time has passed since then; MySQL moved on from MyISAM as the default storage engine to InnoDB (which has been available for a long time now), gained full transaction support in the storage engine, and more. At the same time, Postgres got faster, and added a continually expanding list of features to distinguish itself in the marketplace. And now the choice of whether to use MySQL or Postgres usually boils down to experience and preference, though occasionally business or regulatory needs dictate other choices.

TL;DR; data integrity

In a lot of ways, Redis up to now is a lot like MySQL was back before InnoDB was an option. There is already a reasonable best-effort to ensure data integrity (replication, AOF, etc.), and the introduction of Lua scripting in Redis 2.6 has helped Redis grow up considerably in its capabilities and the overall simplification of writing software that uses Redis.

Comparatively, Lua scripting operates very much like stored procedures in other databases, but script execution itself has a few caveats. The most important caveat for this post is that once a Lua script has written to the database, it will execute until any one of the following occurs:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script hits an error and exits in the middle, all writes that were done up to the error have occurred, but no more writes will be done from the script
  3. Redis is shut down without saving via SHUTDOWN NOSAVE
  4. You attach a debugger and "fix" your script to get it to do #1 or #2 (or some other heroic deed that allows you to not lose data)
To anyone who is writing software against a database, I would expect that you agree that only case #1 in that list is desirable. Cases #2, #3, and #4 are situations where you can end up with data corruption (cases #2 and #4) and/or data loss (cases #3 and #4). If you care about your data, you should be doing just about anything possible to prevent data corruption and loss. This is not philosophy, this is doing your job. Unfortunately, current Redis doesn't offer a lot of help here. I want to change that.

Transactions in Lua

I am seeking to eliminate cases #2, #3, and #4 above, replacing the entire list with:
  1. The script exits naturally after finishing its work, all writes have been applied
  2. The script exits with an error, no changes have been made (all writes were rolled back)
No data loss. Either everything is written, or nothing is written. This should be the expectation of any database, and I intend to add it to the expectations that we all have about Redis.

The current pull request is a proof of concept. It does what it says it does, removing the need to lose data as long as you either a) explicitly run your scripts using the transactional variants, or b) force all Lua script calls to have transactional semantics with a configuration option.

There are many ways the current patch can be made substantially better, and I hope for help from Salvatore (the author of Redis) and the rest of the community.