DALL·E 2 was announced in April of this year. As a rule, by the time the paper is public, the model probably existed a month ago, often more.
It’s been swiftly followed by Midjourney, Stable Diffusion, and Imagen, all developed simultaneously. If you are new to ML, this is pretty common. Although you personally will miss some good ideas, people working on the relevant problems will notice and develop the same good ideas within a few months of each other. My understanding is that diffusion was building up steam, then classifier-free guidance was developed, and that was the key piece that unlocked the improved generations we’re seeing today. I highly recommend Sander Dieleman’s post if you want to learn more of the math.
In the months since, image generation has gone from a thing some people talked about, to something everyone was talking about. I’ve seen people say “this is a DALL-E ass looking movie poster”, in a bad way. I’ve seen artists preach AI asceticism, meaning “do not use image generation tools, opt out of letting your art get used in these tools if they give you the choice”. I read a post from someone who discussed AI asceticism, and then acknowledged that they could not do it, the image generation was too fun to play with. They were part of a Last Week Tonight episode. Art sites have had to decide whether they would allow computer generated images or not (most have decided “not”, either because they wanted to be a place for human artists, or because people were generating too much content to moderate.) A person won a state fair art contest with Midjourney, in a way almost perfectly designed to be controversial.
This still feels crazy to me? People have normalized that it is possible to get high quality language-guided image generation really, really quickly. In another world, perhaps it would end there. A lot of wild stuff has been happening. But if I had to estimate where we were on the technology development curve, I’d say we’re about here:
I believe this for two reasons.
Generative modeling in the past few years were primarily ruled by GANs. The developments in image generation are based not on a better GAN, but on diffusion methods, an entirely different paradigm for viewing ML problems. Anytime you have a new paradigm, you should expect a lot of people to try it on their problem, and then watch some of those people succeed and breakthrough on problems that used to be hard.
More importantly, diffusion is a very general idea. The current AI news has been powered by images, but nothing about diffusion is image centric. It’s just a generic method for learning a model to match one probability distribution to another one. The machine learning field is very practiced at turning life into math, and there’s more to life than just images.
When AlphaGo first beat Lee Sedol, I said that it might be the end of turn-based perfect information games - all of them. Go was mountain top, and although AIs wouldn’t exist for other games, no one would doubt that it was possible if someone put a team on solving it.
Something similar is starting to feel true for problem domains where there is enough human data on the Internet. Before people yell at me: I said starting! I think there’s only a few domains where we actually have enough human data at the moment. If pure data quantity was the only factor that mattered, RL agents playing Atari games should have taken over the world by now.
What matters right now is how much human output you can get for your problem domain. When I read through the paper for Whisper, a speech recognition system, I found this section especially interesting.
Many transcripts on the internet are not actually human- generated but the output of existing ASR systems. Recent research has shown that training on datasets of mixed human and machine-generated data can significantly impair the performance of translation systems (Ghorbani et al., 2021). In order to avoid learning “transcript-ese”, we developed many heuristics to detect and remove machine-generated transcripts from the training dataset. Many existing ASR systems output only a limited subset of written language which removes or normalizes away aspects that are difficult to predict from only audio signals such as complex punctuation (exclamation points, commas, and question marks), formatting whitespace such as paragraphs, or stylistic aspects such as capitalization. An all-uppercase or all-lowercase transcript is very unlikely to be human generated. While many ASR systems include some level of inverse text normalization, it is often simple or rule-based and still detectable from other unhandled aspects such as never including commas.
As data-hungry as the Whisper model is, it is still better to exclude certain kinds of data from its training set. It is not just enough to have a massive pool of data, you still need some management to make sure it is of the right form. Kind of like how swimming pools are not just piles of water, they get chlorinated and processed to remove germs.
For text, we wrote a bunch of stuff on the Internet, so it was all there. For images, we took a bunch of photos and drew a bunch of art, so it was all there. What else? Well, there’s a bunch of audio files on Bandcamp and Soundcloud and Spotify, I assume people are trying stuff there. There are a gajillion videos on Youtube, and by now I’ve heard 3 different ML research labs talk about “learning from YouTube”. It’s not a secret, it’s just hard.
Aside from those, I actually don’t know of much else that fits my mental model of “literally billions of people have put this content online for free”. There are lots of effort-heavy datasets for different problem domains (protein folding, theorem proving, GoPro datasets, etc.). These were created with purpose and intention, and that intention limits how big the datasets can be. Some of those will lead to cool things! I don’t think they’ll lead to fundamental floor raising of what we believe ML models are capable of. The problem the authors of Whisper needed to avoid was that you don’t want to learn “transcript-ese” instead of Japanese, and I’m not yet convinced that current models are good enough to cross the transcript-ese barrier and learn from their own outputs. Doing so could be AI-complete.
Even so, if you assume the improvements stop there, and you just have better image generation, audio generation, and video generation, that still covers…like, a really heavy fraction of the human experience? The submission deadline for ICLR just passed, meaning all the ML researchers are putting their conference submissions on arXiv, and there is some wild, wild stuff branching off from diffusion models as-is.
There’s an audio diffusion paper, for text-to-audio.
There is a 3D asset generator based on NeRF + using Imagen as a loss function, letting you bootstrap from 2D text-to-image into 3D data.
This may be early, but diffusion looks like it’s going to go down as a ResNet-level idea in how it generally impacts the development of machine learning models. The jump seems pretty discontinous to me! I know this stuff is hard, and there are cavaets to what works and what doesn’t, and you just see the successes, but I still think there isn’t a reasonable way you can say, “ah, yeah, but this is all a dead-end, it’s going to hit a wall soon”. Right now the runway of “make the obvious next improvement” seems at least 2 years long to me, and that’s a lot of time for better models. As a comparison, 2 years ago is about when GPT-3 came out, and language models haven’t stopped yet.
It’s research level now, but this stuff is going to hit the wider public consciousness in about a year or two, and people are not going to be ready. They just aren’t. Maybe it will be just as easy to normalize as text-to-image was, but I expect a lot of domains to get disrupted, and for every five problems where it’s harder than expected, there’s going to be one where it’s easier.
If I see anyone complain about having to write a Broader Impacts section for NeurIPS, I’m going to be pretty upset. ML is crossing some key quality thresholds, and writing that section is not that hard! If you do find it hard, I’d take it as a sign that you should have started thinking about it earlier.
Sorta Insightful turns seven years old today!
Writing is not a muscle I have stretched very much recently. This is mostly because I have been busy stretching puzzle construction muscles instead.
Last year, I mentioned that I had been writing less because I was spending more time writing puzzles for puzzlehunts. This January, the team I was on won MIT Mystery Hunt, the biggest puzzlehunt of the year. One of the rules of MIT Mystery Hunt is that if your team wins, then your team writes it next year, and it’s this tradition that keeps the hunt going. There’s always a team going for the win to get the chance to write a Mystery Hunt.
It’s also true that after doing so, they usually don’t try to win again anytime soon. See, people don’t quite understand how long it takes to write Mystery Hunt. When I told my parents we won and had a year to write hunt, they said, “oh, a year, there’s no need to rush.” Meanwhile, last year’s constructing team started by saying “Congrats on winning Mystery Hunt! You are already behind.” I deliberately did not sign up for any leadership roles but I’m still spending about 10-20 hrs/week on Hunt stuff.
When you work in research, they say you carve out time for your hobbies, or else your research will take over everything. But what do you do when your hobbies take over time from your other hobbies? Boy do I not have a great answer to that right now. I don’t expect winning + writing MIT Mystery Hunt to be a regular activity for me, so I’ve been treating this year as a write-off for blogging. There should be more afterwards. At minimum, I’ll write a post about Mystery Hunt.
I finally got that post about My Little Pony done! Listen, that post was a struggle. I’m pretty happy that it came together, and am at peace with where I am with respect to the fandom. (Following very little, but still following.)
Also, some papers I worked on came out, most notably PaLM-SayCan. I’ve been thinking more about AI trends recently, probably because I went to EA Global. It’s interesting to see-saw between people who think AI safety doesn’t matter because transformative AI is too far away, and people who think it’s 20% likely to happen in 10 years and the default outcomes will be profoundly bad. My feelings on this are complicated and I’ll try to write more about it at some point, but I would sum them up as, “sympathetic to people with short AI timelines, not sure I’m on board with what they want to do about it.”
Normally, I include word count of previous posts, but I’m deciding I’m no longer going to track that data. I feel doing so is promoting the wrong impulses. I would rather write concise posts that take a while to edit, rather than longer posts that ramble more than they need to. I try to make every post concise, and in the past I think word count was a reasonably correlated measure of my writing output, because I had similar standards for all posts. But more recently, I’ve given myself more wiggle room on topics, in the name of “done is better than perfect”, so word count no longer matches quite as well.
Instead, I will just track post count. I wrote 7 posts this year, two fewer than last year.
These are the view counts from August 18, 2021 to today.
I’m a bit surprised the ML-related post has fewer views than the Mystery Hunt post. I guess most people just read Twitter threads nowadays.
Time Spent Writing
I spent 99 hours, 30 minutes writing for my blog this year, about 40 minutes less than last year.
Now for context, 74 hours of that were in the 5 months before MIT Mystery Hunt, and 25 hours were in the 7 months after, so it’s pretty clear where the time went.
Posts in Limbo
Here’s a review of all the posts I said I might write last year.
Post about measurement: dead
I said I’d remove it from the list if I didn’t write it this year. I didn’t write it this year! So it’s gone. I’m guessing shades of what this post would have been will appear in other posts I write later.
Part of the reason this post never happened is that I wanted to touch on the concept of legibility. To do so, I figured I should read Seeing Like a State directly, since it was the book that popularized the idea, and it would be better to go to the source rather than trust everyone’s 2-sentence summary. Then I never finished Seeing Like a State because it’s a bit long and not written to be an easy read.
This is why people just read Twitter summaries of papers and skip the paper itself. How else would anybody get work done?
Post about Gunnerkrigg Court: dead
Shooooooooot. Look, Gunnerkrigg Court is sick. It is still one of my favorite comics of all time. But, I’m not quite as hype about it as I was when I first archive binged it in 2015, and I’ve forgotten the things I planned to say about it. The right time to write this post was around 2017, but I didn’t, and now I’m not in the right headspace for it.
Post about My Little Pony: done
Post about Dominion Online:
Odds of writing this year: 5%
Odds of writing eventually: 40%
I’ve come to realize that I like writing about things that I expect other people not to write about. I like novelty, I like feeling like I am contributing new ideas to the conversation. It is my way around the “my cake sucks” problem.
What you’re supposed to do is tell yourself that the audience is not nearly as judgmental as you are. They just want cake! Instead of telling myself that, I try to create super wild novelty cakes that are harder to compare to anything else.
This is a bad solution and I should get myself to be more of a two cakes person. But! Who else is going to write about Dominion or Dustforce’s r/place struggles? Sometimes a story has to be told. There is room for the occasional novelty cake.
Post about Dustforce:
Odds of writing this year: 5%
Odds of writing eventually: 50%
Speaking of Dustforce, this is a new one I’m adding to the queue. Dustforce is a hard-as-nails platformer, and has a special place in my heart. Most of my ride-or-die games are from my childhood and nostalgia blinds me to their flaws. Dustforce got its hooks in me in college, in a way few games have. It is both really cool and really dense - I entirely understand why people bounce off this game, but there’s a fan community that has been playing the game and organizing Dustforce events for 10 years. There’s a reason why.
Post about puzzlehunts:
Odds of writing this year: 20%
Odds of writing eventually: 99%
I mean, yeah, this is happening, one way or another.
My Twitter profile is not set up to pull people in. If anything, it is deliberately adversarial.
I’m bad at Twitter. I know I’m bad at Twitter. I don’t know if I want to be good at Twitter.
Every group seems to gravitate towards Twitter over time. There’s a machine learning Twitter, a philosophy Twitter, a history Twitter, a My Little Pony Twitter, a Smash Bros Twitter. Those communities all have their subreddits and Facebook groups, but I get the sense those are stagnating. Being on Facebook is a deliberate decision.
All those groups agree that Twitter is awful for having nuanced conversation, but people post there anyways. When I try to probe why, the common reply is that Twitter forces people to get to the point. I can see the logic, I’m certainly guilty of going on and on for no good reason. (I try not to! It’s hard!)
People tell me ML Twitter is worth it. Parts of it do seem good! It’s just, I have trouble trusting social media in general. I don’t have a TikTok. I know that if I set up TikTok, eventually I’ll be spending an hour a day genuinely having fun watching random videos, with a small voice asking if I could be doing something else instead. It’s not that I wouldn’t get joy out of it, it’s that I’d get joy that aligned me towards the kind of person TikTok would want me to be. Facebook and Reddit already did that to me. There is only so much time for dumb stuff.
The issue, then, is that there’s real benefit to hanging around ML Twitter. It is not just dumb stuff. The medium makes it easier to find the hot takes where someone deliberately challenges accepted wisdom, which is where interesting intellectual thought happens. It’s easier to promote a paper on Twitter than it is to promote it at a conference - if anything, the two go hand-in-hand. The memes are specific enough to be excellent.
It’s quite likely that I’m losing out on both ML knowledge and career equity by not being more active on Twitter. But do I want to become more like the person Twitter wants me to be? I’m not sure people understand how good recommendation systems have gotten and how much work goes into improving them.
“Try it for a bit, you can always change your mind later.” And yet I feel like if I try it enough to give it a fair chance, then it might be too late for me.
For now, I am okay with floating outside Twitter. Dipping in now and then, but not browsing idly. That could change in the future, but if it does, then I’ll at least have this post to refer to. I’ll at least have to explain why I changed my mind.