• I'm Bad at Twitter

    Twitter profile

    My Twitter profile is not set up to pull people in. If anything, it is deliberately adversarial.

    I’m bad at Twitter. I know I’m bad at Twitter. I don’t know if I want to be good at Twitter.

    Every group seems to gravitate towards Twitter over time. There’s a machine learning Twitter, a philosophy Twitter, a history Twitter, a My Little Pony Twitter, a Smash Bros Twitter. Those communities all have their subreddits and Facebook groups, but I get the sense those are stagnating. Being on Facebook is a deliberate decision.

    All those groups agree that Twitter is awful for having nuanced conversation, but people post there anyways. When I try to probe why, the common reply is that Twitter forces people to get to the point. I can see the logic, I’m certainly guilty of going on and on for no good reason. (I try not to! It’s hard!)

    People tell me ML Twitter is worth it. Parts of it do seem good! It’s just, I have trouble trusting social media in general. I don’t have a TikTok. I know that if I set up TikTok, eventually I’ll be spending an hour a day genuinely having fun watching random videos, with a small voice asking if I could be doing something else instead. It’s not that I wouldn’t get joy out of it, it’s that I’d get joy that aligned me towards the kind of person TikTok would want me to be. Facebook and Reddit already did that to me. There is only so much time for dumb stuff.

    The issue, then, is that there’s real benefit to hanging around ML Twitter. It is not just dumb stuff. The medium makes it easier to find the hot takes where someone deliberately challenges accepted wisdom, which is where interesting intellectual thought happens. It’s easier to promote a paper on Twitter than it is to promote it at a conference - if anything, the two go hand-in-hand. The memes are specific enough to be excellent.

    It’s quite likely that I’m losing out on both ML knowledge and career equity by not being more active on Twitter. But do I want to become more like the person Twitter wants me to be? I’m not sure people understand how good recommendation systems have gotten and how much work goes into improving them.

    “Try it for a bit, you can always change your mind later.” And yet I feel like if I try it enough to give it a fair chance, then it might be too late for me.

    For now, I am okay with floating outside Twitter. Dipping in now and then, but not browsing idly. That could change in the future, but if it does, then I’ll at least have this post to refer to. I’ll at least have to explain why I changed my mind.

  • My 2022 r/place Adventure

    Every April Fool’s, Reddit runs a social experiment. I’ve always had fun checking them out, as part of my journey from “people are too hard, I’m just going to do math and CS”, to “people are hard, but like, in a really interesting way?” The shift was realizing that culture is a decentralized distributed system of human interaction, and Reddit’s April Fool’s experiments are a smaller, easier to understand microcosm of that.

    Some have been pretty bad. Second was trash. I know people who liked The Button a lot, but r/place is easily their most popular one. Each user can place one pixel every 5 minutes in a canvas shared by the entire Internet. Alone, it’s hard to do anything. But working together, you can create all kinds of pretty pixel art…that can then get griefed by anyone who wants to. It’s all anarchy.

    I didn’t contribute to r/place in 2017, but for the 2022 run I figured I would chip in a few pixels to the SSBM and My Little Pony projects, then mostly spectate. And yeah, that’s how it started on the first day! The MLP subreddit put out a Rainbow Dash template on March 31, and when I checked in, I saw their location was right in the path of the ever-expanding Ukrainian flag.


    ...and After

    The irony of people fighting for r/place land while a real-world land grab was going on was not lost on me. I deliberately skipped joining the coordination Discord, because I had no interest in getting involved more than the surface level, but from what I heard, the MLP r/place Discord decided to play the long game. They believed that although the Ukraine flag was taking over space now, and although most subreddits supported Ukraine in the Russia-Ukraine conflict, the flag owners would eventually face too much pressure and would have to make concessions to allow some art within its borders. The MLP Discord wanted to maintain their provisional claim formed by Rainbow Dash’s torso, to be expanded later. They were correct that the Ukraine flag would make concessions. They were wrong that My Little Pony would get to keep it. There’s a whole saga of alliance and betrayal there, which eventually forced MLP to change locations, but I did not follow it and I’m sure someone else will tell that story. The one I want to tell is much smaller.

    For about 7 years, I’ve been a fan of Dustforce, an indie platformer. When people ask what video games I’ve played recently, I always tell them I’ve been playing Dustforce, and they’ve always never heard of it. It’s one of my favorite games of all time. Maybe I’ll explain why in another post, but the short version is that Dustforce is the SSBM of platformers. Super deep movement system, practically infinite tech skill ceiling, levels that are hard but satisfying to finish, and a great replay system that always gives you tools to get better. It’s not very big, but it had representation in 2017’s r/place.


    Lots of big communities have little interest in r/place, and lots of little communities have outsized presence in r/place. You don’t need to be big, you just need a subset with enough r/place engagement and organizational will. Dustforce is tiny, speedrun livestreams get at most 30 people, but we made it 5 years ago. I believed we could totally make it in 2022. The Dustforce Discord talked about doing something for r/place, but hadn’t done anything, so I made a pixel art template in hopes it would get the ball rolling.


    Dustkid, a character in Dustforce. We’ll be seeing her a lot.

    Pixel art is not my strong suit. To make this, I downloaded the favicon for, then translated the pixels to the r/place color palette. After scanning existing r/place pixel art, I realized our target image was somewhat big for our community size, so I prepared a smaller version instead. Any representation is better than none.

    Dustkid, smaller

    I wasn’t interested in organizing r/place for Dustforce long term, but I’m a big believer that you get movements going by decreasing the initial effort required. Having a template gives a way for uninterested people to contribute their 1 pixel. I’m happy to report this plan succeeded! Although, it was a long journey to get there. To be honest, this is primarily the work of other community members who took up the mantle I started.

    * * *

    We didn’t want to take pixels away from established territory. I don’t think we could have even if we wanted to. Our best odds were to find a small pocket of space that didn’t have art yet. After some scouting, I proposed taking (1218, 135) to (1230, 148), a region right underneath the Yume Nikki character that looked uncontested. I also argued that even though we would mostly coordinate over Discord, it was important to make a post on the r/dustforce subreddit. Since r/place was a Reddit event, we needed a land claim on Reddit for discoverability. With that claim made, we started placing pixels.

    The starts of Dustkid

    The beginnings of the Dustkid hat

    As we continued, we noticed a problem pretty quickly. We had picked the same spot as r/Taiwan.

    Uh oh

    Someone tracked down the r/Taiwan Reddit post. Their aim was to add a flag of the World Taiwanese Congress. For Dustforce, this was baaaaaad news. Flags historically have a lot of power in r/place. Their simplicity makes it easy for randoms to contribute pixels (it’s easy to fix errors in solid blocks of color), and patriotism is a good mobilizer to get more people to chip in. We were right in the middle of their flag, and it would take up all the available free space in that section. Maybe we could have negotiated living on their flag, but I highly doubt r/Taiwan would have agreed, and no one from r/dustforce even bothered asking.

    r/Taiwan flag

    Alright, back to square one. Other people in the community proposed alternate spots, without much luck. Any spot that opened up was quickly taken by other communities or bots. By and large, people on r/place will respect existing artwork, so once we got something down there was a good chance we could protect it. The problem was that in the time it took us to make something recognizable, other groups or bots would place pixels faster. By the end of Day 2, we had nothing. All our previous efforts were run over. We were simply outmatched, and I didn’t think we’d make it.

    * * *

    On the last day, Reddit doubled the size of the canvas once more, saying it would be the final expansion. With a new block of free space, that expansion represented our best chance of getting something into r/place. It’s now or never.

    First, the expansion had increased the color palette, so we adjusted the template to be more game-accurate.

    New template

    By this point I was content to leave organizing to others. They first reached out to the Celeste community, asking if we could fit Dustkid into their banner, seeing as how both Dustforce and Celeste are momentum-based precision platformers. After some discussion, they felt it would clash too much. This wasn’t entirely fruitless however - during this discussion, they invited Dustforce to the r/place Indie Alliance. It was exactly what it sounded like: an r/place alliance between indie game communities.

    I joined the Indie Alliance Discord to stay in the loop, but quickly found it was too fast for me. Lots of shitposts, lots of @everyone pings, and lots of panicking whenever a big Twitch streamer went live and tried to force their will onto the r/place canvas. There were even accusations of spies and saboteurs trying to join the alliance. I never quite understood how having a spy in r/place would help things. In a real fight, there’s fog of war, it takes time to mobilize forces, and intelligence on troop movements or new weapons can be a decisive edge. But in r/place, there’s no fog of war because the entire canvas is public, people can “attack” (place pixels) anywhere they want with no travel time, and everyone knows how to find r/place bots if they want to. There’s no real benefit to knowing an r/place attack is coming, nor is there much benefit in knowing a place will be defended - by default, everything is defended.

    Our eventual target was a space nearby the Celeste banner, currently occupied by AmongUs imposters.

    Target location

    They laid out the argument: AmongUs crewmates were scattered all throughout r/place, usually to fill up space without compromising the overall artwork. Since there were so many crewmates, the odds any specific patch was fiercely defended was quite low. That meant AmongUs space looked more defended than it really was. If we blitzed the space fast enough, we likely wouldn’t face retribution, since the AmongUs people likely wouldn’t care. We went for it.

    Dustkid, round 2

    We did run slightly afoul of r/avali, a subreddit for a furry species. Our art template and their art template overlapped by 1 pixel, and we both really wanted that pixel. In a dumb parody of the Israel-Palestine conflict, they wanted both an orange and indigo border around the Avali, but our pixel threatened the indigo border, and we really didn’t like the aesthetic of having an indigo border everywhere besides that pixel.


    Indie game-Furry mascot conflict, April 2022

    Now, if r/avali had decided to fight, we would have lost. Luckily, we figured out a solution before it came to that. If we shifted the Dustkid head diagonally down-right 1 pixel, it would resolve the dispute. Plus, we’d be more symmetric between the Rocket League bot logo towards our north and the Mona Lisa towards our south. We let r/avali know, and did the migration.

    Conflict resolution

    With that, we made it! We even had time to adjust our template and fill in more space with Dustforce pixel art, adding the S+ icon we had last time r/place happened.

    Final image

    It was surprisingly low on drama. Everyone nearby was friendly. The Go subreddit r/baduk to our right could have conflicted, but they recognized our land claim and adjusted their template so that it wouldn’t clash. We eventually made a heart connecting the two. A Minecraft wolf invaded the space where we planned our pixel art expansion, but we found it was all created by one user (!), so we offered to adopt and relocate the wolf to a separate corner, which they were happy with. RLbot to our north even offered to give us more space, since they had abandoned their logo a while back. We never took them up on that offer, since by the time we were done tuning things in our corner, r/place had ended.

    We really lucked out on our location, and were never the target of a community big enough to trample over us. The theory was that we were close to big art pieces like the Mona Lisa and One Piece, and although those art pieces weren’t looking to expand, no one wanted to challenge them and this gave us protection by proxy. Perhaps you could call Dustforce a vassal state, but I’m not sure they even care about us.

    With Dustkid settled, I went back to helping cleanup My Little Pony art, which had been griefed enough that they had added a counter for “# of times we’ve rebuilt” to their template.

    My Little Pony

    (It’s the 22 on the middle of the right border, just above Rarity)

    I also helped a bit when someone from the Indie Alliance got raided. But that was small fry, compared to the work beforehand. I was happy to be done with r/place. It really took up more of my attention and time than I expected it to.

    There’s this old quote from The Sandman. “Everybody has a secret world inside of them.” Nowhere is that more true than r/place. The adventure to put a 15x15 Dustkid head was just one piece of the overall 2000x2000 canvas. There’s all sorts of complexity,

    Zoom 1

    that gets lost,

    Zoom 2

    as you take in the bigger picture.

    Zoom 3

  • The Dawn of Do What I Mean

    Boy, last week was busy for deep learning. Let’s start with the paper I worked on.

    SayCan is a robot learning system that we’ve been developing for about the past year. The paper is here, and it builds on a lot of past work we’ve done in conditional imitation learning and reinforcement learning.

    Suppose you have a robot that can do some small tasks we tell it to do with natural language, like “pick up the apple” or “go to the trash can”. If you have these low-level tasks, you can chain them together into more complex tasks. If I want the robot to throw away the apple, I could say, “pick up the apple, then go to the trash can, then place it in the trash can”, and assuming those three low-level tasks are learned well, the robot will complete the full task.

    Now, you wouldn’t want to actually say “pick up the apple, then go to the trash can, then place it in the trash can”. That’s a lengthy command to give. Instead, we’d like to just say “throw away the apple” and have the rest be done automatically. Well, in the past few years, large language models (LLMs) have shown they can do well at many problems, as long as you can describe the input and output with just language. And this problem of mapping “throw away the apple” to “pick / go to trash / place” fits that description exactly! With the right prompt, the language model can generate the sequence of low-level tasks to perform.

    Diagram of the SayCan model

    This, by itself, is not enough. Since the LLM is not aware of the robot’s surroundings or capabilities, using it naively may generate sentences the robot isn’t capable of performing. This is handled with a two-pronged approach.

    1. The language generation is constrained to the skills the robot can (currently) perform.
    2. Each generated instruction is scored based on a learned value function, which maps the image + language to the estimated probability the robot can complete the task.

    You can view this as the LLM estimating the best-case probability an instruction helps the high-level goal, and the value function acting as a correction to that probability. They combine to pick a low-level task the robot can do that’s useful towards the high-level goal. We then repeat the process unless the task is solved.

    This glosses over a lot of work on how to learn the value function, how to learn the policy for the primitive tasks, prompt engineering for the large language model, and more. If you want more details, feel free to read the paper! My main takeaway is that LLMs are pretty good. The language generation is the easy part, while the value function + policy are the hard parts. Even assuming that LLMs don’t get better, there is a lot of slack left for robot capabilities to get better and move towards robots that do what you mean.

    * * *

    LLMs are not the bottleneck in SayCan, but they’re still improving (which should be a surprise to no one). As explained in the GPT-3 paper, scaling trendlines showed room for at least 1 order of magnitude, and recent work suggests there may be more.

    DeepMind put out a paper for their Chinchilla model. Through more careful investigation, they found that training corpus size had not increased in size relative to parameter count as much as it could have. By using about 4x more training data (300 billion tokens → 1.4 trillion tokens), they reduced model size by 4x (280B parameters → 70B parameters) while achieving better performance.

    Chincilla extrapolation curve

    Estimated compute-optimal scaling, using larger datasets and fewer parameters than previous scaling laws predicted.

    Meanwhile, Google Brain announced their PaLM language model, trained with 540B parameters on 780 billion tokens. That paper shows something similar to the GPT-2 → GPT-3 shift. Performance increases on many tasks that were already handled well, but on some tasks, there are discontinuous improvements, where the increase in scale leads to a larger increase in performance than predicted from small scale experiments.

    PaLM result curves

    Above is Figure 5 of the PaLM paper. Each plot shows model performance on a set of tasks where PaLM’s performance vs model size is log-linear (left), “discontinuous” (middle), or relatively flat (right). I’m not even sure the flat examples are even that flat, they look slightly under log-linear at worst. Again, we can say that loss will go down as model size goes up, but the way that loss manifests in downstream tasks doesn’t necessarily follow the same relationship.

    The emoji to movie and joke explanation results are especially interesting to me. They feel qualitatively better in a way that’s hard to describe, combining concepts with a higher level of complexity than I expect.

    Emoji movie explanation

    Neither of these works have taken the full 1 order of magnitude suggested by prior work, and neither indicates we’ve hit a ceiling on model scaling. As far as I know, no one is willing to predict whether or what qualitatively new capabilities we’ll see from the next large language model. This is worth emphasizing - people genuinely don’t know. Before seeing the results of the PaLM paper, I think you could argue that language models would have more trouble learning math-based tasks, and the results corroborate this (both navigate and mathematical_induction from the figure above are math-based). You could also have predicted that at least one benchmark would get qualitatively better. I don’t see how you could have predicted that english_proverbs and logical_sequence in particular would improve faster than their power low curve.

    The blog post for the Chinchilla model notes that given the PaLM compute budget, they expect you could match it with 140B params if you used a dataset of 3 trillion tokens of language. In other words, there’s room for improvement without changing the model architecture, as long as you crawl more training data. I don’t know how hard that is, but it has far less research uncertainty than anything from the ML side.

    Let’s just say it’s not a good look for anyone claiming deep learning models are plateauing.

    * * *

    That takes us to DALL·E 2.

    DALL-E 2 generations

    On one hand, image generation is something that naturally captures the imagination. You don’t have to explain why it’s cool, it’s just obviously cool. Similar to language generation, progress here might overstate the state of the field, because it’s improving things we naturally find interesting. And yet, I find it hard to say this doesn’t portend something.

    From a purely research standpoint, I was a bit out of the loop on what was state-of-the-art in image generation, and I didn’t realize diffusion based image synthesis was outperforming autoregressive image synthesis. Very crudely, the difference between the two is that diffusion gradually updates the entire image towards a desired target, while autoregressive generation draws each image patch in sequence. Empirically, diffusion has been working better, and some colleagues told me that it’s because diffusion better handles the high-dimensional space of image generation. That seems reasonable to me, but, look, we’re in the land of deep learning. Everything is high-dimensional. Are we going to claim that language is not a high-D problem? If diffusion models are inherently better in that regime, then diffusion models should be taking over more of the research landscape.

    Well, maybe they are. I’ve been messing around with Codex a bit, and would describe it as “occasionally amazing, often frustrating”. It’s great when it’s correct and annoying when it’s not. Almost-correct language is amusing. Almost-correct code is just wrong, and I found it annoying to continually delete bad completions when trying to coax the model to generate better ones. There was a recent announcement of improving Codex to edit and insert text, instead of just completing it. It’s better UX for sure, and in hindsight, it’s likely using the same core technology DALL-E uses for image editing.

    Edit examples

    We’re taking an image and dropping a sofa in it, or we’re taking some text and changing the sentence structure. It’s the same high level problem, and maybe it’s doing diffusion-based generation under the hood.

    * * *

    Where does this leave us?

    In general, there is a lot of hype and excitement about models with a natural language API. There is a building consensus that text is a rich enough input space to describe our intentions towards ML models. It may not be the only input space, but it’s hard to see anything ignoring it. If you believe the thesis that language unlocked humanity’s ability to share complex ideas in short amounts of time, then computers learning what to do based on language should be viewed as a similar sea change in how we interact with ML models.

    It feels like we are heading for a future where more computer systems are “do what I mean”, where we hand more agency to models that we believe have earned the right to that agency. And we’ll do so as long as we can convince ourselves that we understand how these systems work.

    I don’t think anyone actually understands how these systems work. All the model disclosure analysis I’ve read feels like it’s poking the outside of the model and cataloging how the black box responds, without any generalizable lesson aside from “consider things carefully”. Sure, that’s fine for now, but that approach gets harder when your model is capable of more things. I hope people are paying attention.