• A Prelude to the Inevitable Long Post About MIT Mystery Hunt 2023

    The first time I ever wrote for a puzzlehunt was Mystery Hunt 2013. I had joined Manic Sages in 2012 out of the Canada/USA Mathcamp pipeline. They won that year, and I figured, hey, I’ll help write, why not. Writing sounds fun! I helped out on two puzzles. One was good and the other was bad.

    I did about zero work outside of those puzzles, aside from testsolving Infinite Cryptogram. It took me 9 hours over the course of a week. I remember leaving feedback that it seemed long, but I had fun, and figured it would take a “real programmer” a lot less time. I had just started college and wasn’t even 18 yet. Surely the real hunters would do better!

    During MLK weekend, I didn’t fly into HQ (convincing my parents to let me do that didn’t seem possible), but I helped man answer callbacks from my dorm room. As I called back with more “your answer is incorrect” calls, I could tell things were not going as planned.

    People didn’t really realize what social media was in 2013, including me. I remember refreshing the #MysteryHunt hashtag a lot. In those days, the norm was not to avoid social media during a puzzlehunt. There were many live tweets during Hunt, almost never spoilery, but getting grumpier over time. Of course, some people were nice, and most acknowledged that Manic Sages did not intend to create a Hunt that went into Monday, but, well, you get one guess about which tweets I spent the most time reading.

    Suffice it to say my first puzzle writing experience was not a fun one.

    By the time I did my last answer callback (around 30 minutes before wrap-up), I was apologizing to the team that was calling, for decisions that I had little responsibility in making, but which I felt responsible for anyways. The person I called was a bit taken aback, but assured me that their team simply liked to call in guesses with low odds of success, and they were sure they’d feel dumb when the solution came out.

    Ten years later, teammate wrote another Mystery Hunt that went into Monday, with a similarly large number of free answers as MH 2013. (I don’t remember exactly how many MH 2013 offered, but an article from the time suggests at least 24.)

    I don’t think there was any single reason that Mystery Hunt was so hard this year, but there was definitely a systematic underestimation of difficulty and length. Drawing a comparison to the 2018 Mystery Hunt, the original theme document proposed a Museum of easy-ish puzzles like the Emotions rounds, a simultaneous Factory of medium puzzles, and then AI-gimmicked rounds with intricate structures that could go to eleven. This would have been fewer puzzles than last year, with a larger fraction of feeders at the level of The Investigation / The Ministry.

    We ended up with a Museum of medium puzzles, a Factory of medium-to-hard puzzles, and AI-gimmicked rounds that went to thirteen. So, yeah.

    (Well, at least next year’s job of writing a shorter Hunt will be easier!)

    I’m better at managing myself than I was 10 years ago, so I’ll be fine. I restarted writing puzzles three years ago, and since then I’ve written enough stinkers and highlights to know what I can do as a puzzle constructor. My body of work is long enough that any individual puzzle isn’t as big a deal.

    However, there are some first-time constructors on teammate this year, where their Hunt puzzles are their first puzzles for the public. Just, try not to be too mean? I’ve been pleasantly surprised the response to Hunt so far is not as vitriolic as it was in 2013. Maybe the Internet is nicer these days. It’s a bit more obvious that online discourse has real-life impact and vice versa.

    It’s very important that people submit feedback about Hunt, and explain their lived experience. It’s a key part of how we keep improving Hunt each year. At the same time, avoid making too many assumptions about teammate’s lived experience.

    As for my play-by-play construction story, that’ll come later. I signed on aiming to do around 10 hrs/week of work for Hunt, and ended up averaging 20 hrs/week, including some 100 hour weeks towards the end. I’m pretty sure I’ve spent more time on Hunt this year than I spent in all my past puzzle writing combined. There’ll be a lot to talk about.

  • Generative Modelling is Still Accelerating

    DALL·E 2 was announced in April of this year. As a rule, by the time the paper is public, the model probably existed a month ago, often more.

    It’s been swiftly followed by Midjourney, Stable Diffusion, and Imagen, all developed simultaneously. If you are new to ML, this is pretty common. Although you personally will miss some good ideas, people working on the relevant problems will notice and develop the same good ideas within a few months of each other. My understanding is that diffusion was building up steam, then classifier-free guidance was developed, and that was the key piece that unlocked the improved generations we’re seeing today. I highly recommend Sander Dieleman’s post if you want to learn more of the math.

    In the months since, image generation has gone from a thing some people talked about, to something everyone was talking about. I’ve seen people say “this is a DALL-E ass looking movie poster”, in a bad way. I’ve seen artists preach AI asceticism, meaning “do not use image generation tools, opt out of letting your art get used in these tools if they give you the choice”. I read a post from someone who discussed AI asceticism, and then acknowledged that they could not do it, the image generation was too fun to play with. They were part of a Last Week Tonight episode. Art sites have had to decide whether they would allow computer generated images or not (most have decided “not”, either because they wanted to be a place for human artists, or because people were generating too much content to moderate.) A person won a state fair art contest with Midjourney, in a way almost perfectly designed to be controversial.

    This still feels crazy to me? People have normalized that it is possible to get high quality language-guided image generation really, really quickly. In another world, perhaps it would end there. A lot of wild stuff has been happening. But if I had to estimate where we were on the technology development curve, I’d say we’re about here:

    Image of technology S-curve, with a red dot about 1/3rd of the way up the curve, before the inflection point

    I believe this for two reasons.

    Generative modeling in the past few years were primarily ruled by GANs. The developments in image generation are based not on a better GAN, but on diffusion methods, an entirely different paradigm for viewing ML problems. Anytime you have a new paradigm, you should expect a lot of people to try it on their problem, and then watch some of those people succeed and breakthrough on problems that used to be hard.

    More importantly, diffusion is a very general idea. The current AI news has been powered by images, but nothing about diffusion is image centric. It’s just a generic method for learning a model to match one probability distribution to another one. The machine learning field is very practiced at turning life into math, and there’s more to life than just images.

    When AlphaGo first beat Lee Sedol, I said that it might be the end of turn-based perfect information games - all of them. Go was mountain top, and although AIs wouldn’t exist for other games, no one would doubt that it was possible if someone put a team on solving it.

    Something similar is starting to feel true for problem domains where there is enough human data on the Internet. Before people yell at me: I said starting! I think there’s only a few domains where we actually have enough human data at the moment. If pure data quantity was the only factor that mattered, RL agents playing Atari games should have taken over the world by now.

    What matters right now is how much human output you can get for your problem domain. When I read through the paper for Whisper, a speech recognition system, I found this section especially interesting.

    Many transcripts on the internet are not actually human- generated but the output of existing ASR systems. Recent research has shown that training on datasets of mixed human and machine-generated data can significantly impair the performance of translation systems (Ghorbani et al., 2021). In order to avoid learning “transcript-ese”, we developed many heuristics to detect and remove machine-generated transcripts from the training dataset. Many existing ASR systems output only a limited subset of written language which removes or normalizes away aspects that are difficult to predict from only audio signals such as complex punctuation (exclamation points, commas, and question marks), formatting whitespace such as paragraphs, or stylistic aspects such as capitalization. An all-uppercase or all-lowercase transcript is very unlikely to be human generated. While many ASR systems include some level of inverse text normalization, it is often simple or rule-based and still detectable from other unhandled aspects such as never including commas.

    As data-hungry as the Whisper model is, it is still better to exclude certain kinds of data from its training set. It is not just enough to have a massive pool of data, you still need some management to make sure it is of the right form. Kind of like how swimming pools are not just piles of water, they get chlorinated and processed to remove germs.

    For text, we wrote a bunch of stuff on the Internet, so it was all there. For images, we took a bunch of photos and drew a bunch of art, so it was all there. What else? Well, there’s a bunch of audio files on Bandcamp and Soundcloud and Spotify, I assume people are trying stuff there. There are a gajillion videos on Youtube, and by now I’ve heard 3 different ML research labs talk about “learning from YouTube”. It’s not a secret, it’s just hard.

    Aside from those, I actually don’t know of much else that fits my mental model of “literally billions of people have put this content online for free”. There are lots of effort-heavy datasets for different problem domains (protein folding, theorem proving, GoPro datasets, etc.). These were created with purpose and intention, and that intention limits how big the datasets can be. Some of those will lead to cool things! I don’t think they’ll lead to fundamental floor raising of what we believe ML models are capable of. The problem the authors of Whisper needed to avoid was that you don’t want to learn “transcript-ese” instead of Japanese, and I’m not yet convinced that current models are good enough to cross the transcript-ese barrier and learn from their own outputs. Doing so could be AI-complete.

    Even so, if you assume the improvements stop there, and you just have better image generation, audio generation, and video generation, that still covers…like, a really heavy fraction of the human experience? The submission deadline for ICLR just passed, meaning all the ML researchers are putting their conference submissions on arXiv, and there is some wild, wild stuff branching off from diffusion models as-is.

    There’s an audio diffusion paper, for text-to-audio.

    There is a 3D asset generator based on NeRF + using Imagen as a loss function, letting you bootstrap from 2D text-to-image into 3D data.

    There is a video generator that’s also bootstrapping from a 2D image model to build up to video generation, since this seems to work better than doing video diffusion straight.

    Animation of asteroids

    This may be early, but diffusion looks like it’s going to go down as a ResNet-level idea in how it generally impacts the development of machine learning models. The jump seems pretty discontinous to me! I know this stuff is hard, and there are cavaets to what works and what doesn’t, and you just see the successes, but I still think there isn’t a reasonable way you can say, “ah, yeah, but this is all a dead-end, it’s going to hit a wall soon”. Right now the runway of “make the obvious next improvement” seems at least 2 years long to me, and that’s a lot of time for better models. As a comparison, 2 years ago is about when GPT-3 came out, and language models haven’t stopped yet.

    It’s research level now, but this stuff is going to hit the wider public consciousness in about a year or two, and people are not going to be ready. They just aren’t. Maybe it will be just as easy to normalize as text-to-image was, but I expect a lot of domains to get disrupted, and for every five problems where it’s harder than expected, there’s going to be one where it’s easier.

    If I see anyone complain about having to write a Broader Impacts section for NeurIPS, I’m going to be pretty upset. ML is crossing some key quality thresholds, and writing that section is not that hard! If you do find it hard, I’d take it as a sign that you should have started thinking about it earlier.

  • Seven Years Later

    Sorta Insightful turns seven years old today!

    Writing is not a muscle I have stretched very much recently. This is mostly because I have been busy stretching puzzle construction muscles instead.

    Last year, I mentioned that I had been writing less because I was spending more time writing puzzles for puzzlehunts. This January, the team I was on won MIT Mystery Hunt, the biggest puzzlehunt of the year. One of the rules of MIT Mystery Hunt is that if your team wins, then your team writes it next year, and it’s this tradition that keeps the hunt going. There’s always a team going for the win to get the chance to write a Mystery Hunt.

    It’s also true that after doing so, they usually don’t try to win again anytime soon. See, people don’t quite understand how long it takes to write Mystery Hunt. When I told my parents we won and had a year to write hunt, they said, “oh, a year, there’s no need to rush.” Meanwhile, last year’s constructing team started by saying “Congrats on winning Mystery Hunt! You are already behind.” I deliberately did not sign up for any leadership roles but I’m still spending about 10-20 hrs/week on Hunt stuff.

    When you work in research, they say you carve out time for your hobbies, or else your research will take over everything. But what do you do when your hobbies take over time from your other hobbies? Boy do I not have a great answer to that right now. I don’t expect winning + writing MIT Mystery Hunt to be a regular activity for me, so I’ve been treating this year as a write-off for blogging. There should be more afterwards. At minimum, I’ll write a post about Mystery Hunt.


    I finally got that post about My Little Pony done! Listen, that post was a struggle. I’m pretty happy that it came together, and am at peace with where I am with respect to the fandom. (Following very little, but still following.)

    Also, some papers I worked on came out, most notably PaLM-SayCan. I’ve been thinking more about AI trends recently, probably because I went to EA Global. It’s interesting to see-saw between people who think AI safety doesn’t matter because transformative AI is too far away, and people who think it’s 20% likely to happen in 10 years and the default outcomes will be profoundly bad. My feelings on this are complicated and I’ll try to write more about it at some point, but I would sum them up as, “sympathetic to people with short AI timelines, not sure I’m on board with what they want to do about it.”


    Word Count

    Normally, I include word count of previous posts, but I’m deciding I’m no longer going to track that data. I feel doing so is promoting the wrong impulses. I would rather write concise posts that take a while to edit, rather than longer posts that ramble more than they need to. I try to make every post concise, and in the past I think word count was a reasonably correlated measure of my writing output, because I had similar standards for all posts. But more recently, I’ve given myself more wiggle room on topics, in the name of “done is better than perfect”, so word count no longer matches quite as well.

    Instead, I will just track post count. I wrote 7 posts this year, two fewer than last year.

    View Counts

    These are the view counts from August 18, 2021 to today.

    286 2021-08-18-six-years.markdown  
    261 2021-10-29-invent-everything.markdown  
    342 2021-12-31-why-mlp.markdown  
    414 2022-01-22-mh-2022.markdown  
    400 2022-04-15-do-what-i-mean.markdown  
    363 2022-05-02-r-place.markdown  
    114 2022-07-14-twitter.markdown  

    I’m a bit surprised the ML-related post has fewer views than the Mystery Hunt post. I guess most people just read Twitter threads nowadays.

    Time Spent Writing

    I spent 99 hours, 30 minutes writing for my blog this year, about 40 minutes less than last year.

    Now for context, 74 hours of that were in the 5 months before MIT Mystery Hunt, and 25 hours were in the 7 months after, so it’s pretty clear where the time went.

    Posts in Limbo

    Here’s a review of all the posts I said I might write last year.

    Post about measurement: dead

    I said I’d remove it from the list if I didn’t write it this year. I didn’t write it this year! So it’s gone. I’m guessing shades of what this post would have been will appear in other posts I write later.

    Part of the reason this post never happened is that I wanted to touch on the concept of legibility. To do so, I figured I should read Seeing Like a State directly, since it was the book that popularized the idea, and it would be better to go to the source rather than trust everyone’s 2-sentence summary. Then I never finished Seeing Like a State because it’s a bit long and not written to be an easy read.

    This is why people just read Twitter summaries of papers and skip the paper itself. How else would anybody get work done?

    Post about Gunnerkrigg Court: dead

    Shooooooooot. Look, Gunnerkrigg Court is sick. It is still one of my favorite comics of all time. But, I’m not quite as hype about it as I was when I first archive binged it in 2015, and I’ve forgotten the things I planned to say about it. The right time to write this post was around 2017, but I didn’t, and now I’m not in the right headspace for it.

    Post about My Little Pony: done


    Post about Dominion Online:

    Odds of writing this year: 5%
    Odds of writing eventually: 40%

    I’ve come to realize that I like writing about things that I expect other people not to write about. I like novelty, I like feeling like I am contributing new ideas to the conversation. It is my way around the “my cake sucks” problem.

    Two cakes

    What you’re supposed to do is tell yourself that the audience is not nearly as judgmental as you are. They just want cake! Instead of telling myself that, I try to create super wild novelty cakes that are harder to compare to anything else.

    This is a bad solution and I should get myself to be more of a two cakes person. But! Who else is going to write about Dominion or Dustforce’s r/place struggles? Sometimes a story has to be told. There is room for the occasional novelty cake.

    Post about Dustforce:

    Odds of writing this year: 5%
    Odds of writing eventually: 50%

    Speaking of Dustforce, this is a new one I’m adding to the queue. Dustforce is a hard-as-nails platformer, and has a special place in my heart. Most of my ride-or-die games are from my childhood and nostalgia blinds me to their flaws. Dustforce got its hooks in me in college, in a way few games have. It is both really cool and really dense - I entirely understand why people bounce off this game, but there’s a fan community that has been playing the game and organizing Dustforce events for 10 years. There’s a reason why.

    Post about puzzlehunts:

    Odds of writing this year: 20%
    Odds of writing eventually: 99%

    I mean, yeah, this is happening, one way or another.