Thursday, January 26, 2023

Getting an early birthday present from YouTube

Back in 2008-2010, I was working as one of the software engineers tasked with developing features and functionality on YouTube for Google. I had joined as the Santa Monica team was growing, right before a summer-long hiring freeze, about 3 months after officially getting my doctorate in 2008. Among other things, I worked on auto-redirection based on demographics (age-gating alcohol-related Youtube pages, but it was nearly Turing-Complete; using user profiles as nodes, and cookies as the tape), home-page customization, user-page customization (whee, CSS sanitizers!), but the biggest thing I did at YouTube was rewrite the built-in social Groups platform.

Humble beginnings


I can't recall the original author of the original YouTube Groups platform, but the software had several bothersome limitations and design constraints. The first part was that no more than (if I remember correctly) 5000 videos could be posted to the group, and that after a video was posted, there could be up to 1000 unthreaded comments related to that video. To find the conversation, you looked through the reverse-chronological listing by posting date. What happened when you hit 5000? Sorry, you couldn't post any more videos. Oops. What about text posts? Well, text posts only existed as chunks of text like any standard bulletin board at the time, and was mostly just a repeating wall of author, text, horizontal line.

While some of these could be fixed with some minor edits, nearly everyone that I spoke with internally wanted to delete Groups. Not for any reason except that it was pulling less than 1 million total views weekly (this was shortly before YouTube publicly announced 1 billion daily watch page views), and occasionally changes to some Python-generated html widget would cause a maintenance headache as someone needs to figure out what broke Groups. It also didn't help that the design was a bit out of date, owing to zero real updates in the several years since it had been written.

A first step


By the time I started paying attention in late 2008, I had hacked together a mini Twitter / Reddit / Digg / whatever forum thingy for posting daily standup statuses for our group, and it had more interaction functionality than the existing YouTube Groups functionality. I had built it using AppEngine as an, "I should use AppEngine for something so I can learn it" project, and after working up enough "I am embarrassed by YouTube Groups" intellectual inertia, I started considering building a new YouTube Groups in AppEngine.

After designing the database schema for the non-relational AppEngine datastore, there was a request from our Chief Architect Mike Solomon to also provide a MySQL schema as an option. At the time, Groups would be the only YouTube property not using either the sharded or non-sharded MySQL databases that stored everything not pictures or video. While comments would eventually move to one of the BigTable successors in a separate process / migration, Groups ultimately ended up using the non-sharded MySQL database, as we never expected the thing to grow very big. We were wrong. :)

Once the schema was approved, I started hacking on a horribly designed proof-of-concept 20% project. A month or two later, after it was far enough along to understand what I was going for (think threaded LiveJournal / Reddit with video posts and comments you could upvote / downvote), my manager James Phillips convinced our PM to get one of the YouTube designers on it. Fortunately for me, I ended up getting the illustrious Gunthar Hartwig to help integrate the developing YouTube profile / channel styles into YouTube Groups. This integration ended up going pretty far, and by the summer of '09, old groups migration and new groups release, Groups was using nearly all of the same widget building, widget positioning, data packing, and unpacking code as YouTube Channels (I also attempted to get the same code pushed to the homepage for customization there, but the homepage team went with something different, and looking at LinkedIn profiles after the fact, apparently earned a few patents for their efforts).

Pitfalls


No project is complete without hitting some bumps, or in my case, pretty deep pitfalls. Along the way in rebuilding Groups, I made several mistakes.

The first obvious mistake was that I built groups against what we called "mono", which was a single large database that was meant to hold things that didn't make sense to, or couldn't be sharded across our sharded DB instances. Clearly groups could have been sharded on a per-group basis, with the only real costs being a lookup of non-cached group name -> id (we sharded on ids), and a need to hit all shards to get a listing of groups.

This was compounded with a particular design choice / mistake made where I both avoided joins, and tried to satisfy the 4 most common queries with covering indexes. Which queries?
  • Show me the top k videos for this group
  • Show me the most recent k videos for this group
  • Show me the top k videos for this topic
  • Show me the most recent k videos for this topic
Four queries, four covering indexes. I didn't worry about people posting, that wasn't a big deal; just a main table + 4 index writes (all of whom were spam-checked, throttled, etc.). But the problem was deletion. Because there were no joins, to remove a thread (and its associated videos) from the 4 indexes, all video posts needed to be removed from the index. This was (relatively) rare, but occurred often enough and for long enough threads that I had added 'order by ... limit 1000' to keep the database from stuttering, with a secondary batch process that ran on a daily basis to clean up any threads longer than 1000 posts. Had I sharded the data by group in the first place, used a join to store the thread's delete bit in 1 place (indexed), and considered just using caches instead of covering indexes, I could have avoided most of the database load that I had induced. To the point: occasionally, those delete thread operations would be reported in our weekly database query reports as one of the top-100 latency queries. Funny how just a little join on the queries and some caching could have solved it.

Foolishly, I also wrote my own javascript Ajax mini-library separate from what most of the rest of YouTube was using. I wrote it because I couldn't find the documentation for any of the internal libraries that YouTube had been using for Javascript, and literally everyone else at my site just magically seemed to know how to use it. This included the recently graduated undergrad who started the same day I did! I didn't find out until a bit later, but the recent grad had done an internship with YouTube the summer before, so already knew the tools, people, process, ... I probably could have saved myself a lot of time if I'd asked, "where are the docs?" Alternatively, maybe I did and I don't remember, as there were several other projects I was working on at the time, and I was overworked.

If you want me to be honest, internally much of the code I wrote for Groups was crap. Some of the HTML templates weren't too bad, but the inline onclick bindings nowadays might get me fired in some shops, even if the css selectors didn't. There was also an if / elif block in the Python-side API that dispatched 95% of the interactive elements processing based on a variety of context (upvotes, downvotes, posting, replies, content pagination, ...), and it got worse to maintain with every addition.

My only real excuse for any of this is that working on YouTube was the first web application I touched at any level, and I got to work on: subscription email libraries, memcache libraries, sharded + non-sharded sql schemas and indexes, homepage html, css, javascript, user profile pages, all of the (groups) templates, all of the (groups) css, all of the (groups) javascript, all of the (groups) Python, (groups) data migration, daily (groups) batch processes, (groups) API design, ensuring proper layout for RTL languages for user channels and groups, ... along with several other side projects and helping others as necessary.

Trouble in paradise


Having just come out of a much smaller organization before Google, I had been used to being recognized and promoted for my abilities and achievements. That doesn't really / actually happen at Google. It's a bit of a fight, and the best way you can fight your way to the top is to get folks with a higher position than you to give you a recommendation for what amounts to a 'promotion packet' (conversations suggested that 3-5 was the minimum for promotion, but more is better). I didn't put those numbers together early enough (I didn't know the game), and by the time I did, whatever career I might have had at Google had been sand-bagged by not doing the right work for the right people, while also stepping on toes trying to create something new / better.

Once I realized that my time at YouTube / Google was coming to a close, I started looking at what was causing the most issues with Groups stability. And when I ran the numbers, nearly 99% of all issues with Groups were caused by periodic updates and incompatibility with the Channels carousel / player widget and backend. I'm not going to get into details, but keeping that API stable was tough, breaking some part of the YouTube world roughly 1 out of every 3 deploys.

Several times over the 9 months after migration to what we called "New Groups", I needed to write fixes during Wednesday deploys, which should have been prevented / nearly impossible, except for the following series of events that seemed to occur with startling regularity...

1. Code that breaks channel carousel would be committed just prior to code freeze on a Friday/Monday (maybe or maybe not breaking Groups).
2. Monday/Tuesday we find out that the carousel is broken.
2a. Because I was very difficult to get LGTM from me during code reviews, the breaker commits new fixes *after* I leave for the day with someone else's approval on Monday/Tuesday at 5-6PM.
3. Newly broken groups would then go out on deploy Wednesday late morning / midday, even when flags were raised because, "your shit should have been fixed".
4. This inevitably causes production errors, as new groups is 10x-100x bigger than old groups almost overnight.
5. Write fixes, hopefully to get re-deployed that day, in the Wednesday (to sometimes Thursday afternoon) "oh shit" deploy shuffle / kerfluffle.
6. In the worst-case, groups carousel was broken for a week (usually only 2 days, thanks to other hotfixes going out on Friday).

Oops.

For the future!


As part of a going-away birthday present to myself (I left Google in March 2010, my birthday is in March), and as a way to future-proof against breakage, I wrote a new player widget specifically for Groups that bypassed the carousel. Instead of calling the shared API endpoint, I called my own endpoint, that only did the 4 things I needed it to do. Okay, exception solved, but what about the carousel itself? Can't rely on the templates either...

So I wrote a new player; inside a conversation thread, you could watch a big video at the top, while interacting with the comment stream below, or with the video in a thumbnail in the upper-right. On all pages, a scrollable and paginated list of top / recent videos for that thread / group was available for browsing and viewing. As you scrolled down and interacted with posts and topics, you could watch videos, cue up videos to play, vote on videos, start a video without leaving the page, and more. This worked great on every browser - including IE6, as I was mostly returning to the original viewer interface I'd written before YouTube deprecated IE6 [1] and I was "encouraged" to use the carousel for a consistent experience. Ultimately it was very similar to the carousel (in terms of total functionality), but with a completely different backend and widget, so any later changes to the channel carousel wouldn't affect it (hard to break a code path you don't interact with, at least not without trying).

The new player / interface wasn't supposed to turn on until my birthday, a few weeks after I had left. Sadly, the world doesn't quite work the way we intend. The last day I was working, I was still committing code for the groups revamp, and it turns out that resurrecting code while being plied with whiskey shots at a rate of 1.5 / hour is tough. I got up to 11 shots by the time I wrote "gvn commit" and had my exit interview at 5PM, which was probably not all that good for the quality of the code I was committing. Also; clearly I was drunken rambling to my exit interviewer.

Thinking that was the end of that, I got another call just one week later, as my former coworker broke the carousel for Groups. I had let a few folks in on the surprise, so when the carousel was inevitably broken, they called to ask which line they should change to make it live. After updating the line, then running a test on their local machine, I get the question... "um, did you forget to commit a template?" Indeed, 11 shots is too many. After giving them the password to my old workstation, they got the template, checked again, committed, and groups limped along for another 9 months.

That any of this was possible was fortunate, as pursuant to our development group's hoarding tendencies[2], one of my coworkers had kept my old workstation aside as a "backup workstation" away from IT re-imaging, "just in case".

During the migration I performed for Old Groups to New Groups in August 2009, Groups had been seeing ~1 million weekly video views. By December 2009, that had grown to ~10 million weekly video views, or ~80% growth month-over-month. When I left in March 2010, hundreds of new groups were being created every day, people were posting and voting on content, ... and it was pretty incredible. By the time Groups was turned down in December 2010, there were groups with hundreds of thousands of members and videos posted, along with hundreds of millions of votes for videos and posts. All of that is gone now, the 8195 lines of Python, HTML, Javascript, CSS, and SQL archived in an old SVN (or now git) commit, a database backup, and whatever is scattered in the Internet Archive. I won't miss banning another Sub4Sub group, but I will miss running into the anime and game fandom music video groups.

I suspect YouTube ultimately turned Groups off because it was unnecessary database load, no one wanted to maintain it, and it was suffering from 9+ months of bit rot. I'd rewritten old groups because it wasn't great at what it should have been, but I suspect that Groups was also a partial victim of its own success. Good enough for users to use it / abuse it, but not good enough for Google to want to dedicate even a single engineer to the task of maintenance. This was tough for me, because ad revenue for Groups could have paid for several full-time engineers by December 2009, and even a team by the time they turned it off.

My updated groups was a perfect example of someone building up a revenue stream for Google on an existing service, allowing for Google to control the media, as well as the message boards discussing and disseminating the media. Can you imagine YouTube was hosting hundreds of individual forums of 200k+ individuals each, with 100k+ posts in each of hundreds of groups, after ~ 15 months of use? That dwarfs many current "big name" social media sites, and that was in 2010.

Alternatively, because groups self-organized, it also helped filter (bad) content, as bad content would gather in Groups postings for mass-flagging and deletion. Groups are the perfect opportunity for honeypots; groups intended to gather bad content and users to feed into the automated filters for blocking, banning, and/or legal action. Some people like to think that algorithms are good at filtering data, and I agree. But people are really good at gathering around their interests and collecting videos related to it.

But Groups is gone, my birthday present relegated to the corners of the Internet Archive.

Looking back at a different path


As I get older, I wonder if there was something else that I could / should have done, or could have known earlier that would have provided me with success at Google. About the only thing I can come up with was that just a few months prior to my leaving, my manager left to join early SpaceX. Now I'm not saying I should have followed him to SpaceX, but maybe I should have tried to step up into his position at Google.

At the time, I had quite a few ideas about engineering, and about applying what I knew to solving problems for startups. I thought that if I could find a startup with the right mix of business model, technology, people, and funding, that I could beat the 90-95% 2-year failure rate for startups (I did 4 times in a row, FWIW). Ultimately, I ended up leaving and starting down a path of startups and entrepreneurship.

I sometimes wonder if trying to become a manager at Google for YouTube would have been a good path, or if I had so many ideas about engineering and software design that I wouldn't have been able to give it up completely. Hard to say, considering that I still write software today, some 13 years later.


[1] This is a great story from one of the SBO team's efforts to kill IE6, which talks about what OldTuber is and what it meant: http://blog.chriszacharias.com/a-conspiracy-to-kill-ie6

[2] My particular tendencies were extreme; I had a Mac desktop workstation (originally allocated), Linux desktop workstation (acquired from leaving coworker), Windows laptop workstation (swapped with my original Powerbook, used as travel / SSH + Synergy host), and 2x 30" monitors. Normally 1 desktop, 1 laptop, and your choice of 2x 24" monitors or 1x 30" was standard, but I'd acquired or traded up enough to have an extra linux desktop, an extra 30" monitor, and when my cube-mate wasn't around (3-4 days a week; due to travel and WFH), a 3rd 30" monitor. I had more monitor area than our site director and original GnuPlot author Thomas Williams by at least 50%. It was glorious.

No comments:

Post a Comment