Ostensibly, I’m on vacation. However, it’s raining, I have some inspiration, and I haven’t written a post in a while, so buckle up, here come some more machine learning opinions. I read some discussion about the size of NeurIPS, mostly around Andrey Kurenkov’s post at The Gradient, and wanted to weigh in.
I’ve been to three NeurIPS: 2016, 2017, and 2019. So, no, I haven’t really been around that long. NeurIPS 2016 was my first academic conference ever, so I didn’t really know what to expect. By NeurIPS 2017, I’d been to a few and could confidently say that NeurIPS felt too big. By NeurIPS 2019, I was no longer sure NeurIPS was too big, even though it had over 60% more attendees than 2017.
Before my first conference, I got some advice from senior researchers: if you aren’t skipping talks, you’re doing it wrong. I promptly ignored this advice and attended every talk I could, but now I get what they meant.
Early on in your research career, it makes sense to go to talks. You know less about the field and you know fewer people. As you become more senior, it makes less sense to go to talks. It’s more likely you know a bit about the topic, and you know more people, so the value of talks go down compared to the research conversations you could have instead. Conference organizers know this. Ever wonder why there are so many coffee breaks, and why they’re all much longer than they’d need to be if people were just getting coffee? Important, valuable meetings are happening during those coffee breaks.
In the limit, people attend conferences to meet up with the people they only see at conferences. As someone from the Bay Area, the running joke is that we travel halfway across the world to talk to people who live an hour’s drive away. It’s not that we don’t want to talk to each other, it’s that the conference environment provides a much lower activation energy to scheduling meetups, and it’s easier to have serendipitous run-ins with old friends if we’re all in the same venue.
In this model of a research conference, all the posters, accepted papers, talks, and so on are background noise. They exist as the default option for people who don’t have plans, or who want a break from socializing. That default option is critically important to keeping everything going, but they’re not the point of the conference. The point of the conference is for everyone in the research community to gather at the same place at the same time. If you’ve been to fan conventions, it’s a very similar dynamic.
If you take this model as true, then NeurIPS’s unofficial status as the biggest ML conference is incredibly important. If you could only go to one conference each year, you’d go to NeurIPS, because everyone else is going to go to NeurIPS.
And if NeurIPS is the place to be, shouldn’t NeurIPS be as big as necessary?
* * *
Well, maybe. NeurIPS attendance is growing, but the growth is coming from different places.
Year over year, NeurIPS has been growing way faster than any of the PhD programs that could be feeding into it. I would guess it’s growing faster than the undergrads and master’s students as well. If the growth isn’t coming from universities, it has to be coming from industry and the broader data science community - a community that is much larger and of a different makeup than the traditional ML research crowd.
I said NeurIPS is about networking, but the question is, networking between who? It started as networking between researchers, because the makeup of attendees started as researchers. It’s been shifting ever since deep learning hype took off. It is increasingly likely that if you talk to a random attendee, they’ll be an ML enthusiast or someone working in an ML-related role at a big company, rather than someone in a PhD program.
And I should be really, really clear here: that’s not necessarily a bad thing! But people in a PhD program have different priorities from people working at a big company, and that’s causing a culture clash.
The size debate is just a proxy for the real debate about what NeurIPS should be. We’re in the middle of an Eternal September moment.
Eternal September is a term I really wish more people knew about, so here’s the short version. There used to be this thing called Usenet, with its own etiquette and social norms. Every September, new students from colleges and universities would get access to Usenet, and they’d stir a fuss, but the influx was small enough for existing Usenet culture to absorb them without much change. Then, AOL opened Usenet access to anyone who wanted it. Usenet culture couldn’t integrate the firehose of interest, and it became known as the Eternal September. The original Usenet culture disappeared, in favor of whatever culture made sense for the new users.
The parallels to NeurIPS are uncanny. A simple find-replace exactly describes what’s happening now, from the people saying NeurIPS is turning into a spectacle, to the people complaining they can’t buy tickets to a conference they really want to attend.
Despite their foreboding name, Eternal Septembers are not inherently bad. They are what they are. But generally, they’re good for people trying to join, and bad for people that are already there and like what they have.
So the real question is, who is NeurIPS for? Is it for the established researchers to talk shop? The newer researchers trying to present their work and build a career? The data scientist looking for new applications of ML research? Right now, it’s for all of them, and the organizers are doing their best to balance everyone’s interests correctly, which is an incredibly difficult job I wouldn’t wish on anyone. The one thing that seems clear to me is that a pure, academic-only NeurIPS untethered from industry is never going to happen. Machine learning is currently too economically viable for industry to stop caring about it. You don’t stop Eternal September. Eternal September is something that happens to you. The best you can do is nudge the final outcome the best you can.
It’s a crazy solution, and I don’t know if it even makes sense, but maybe NeurIPS needs to be split in two. Have one act as the submission venue, where people submit and present their research, with heavier restrictions on who’s allowed to attend, and have the other act as the open-to-everyone conference, with the two co-located to encourage some crossover. If NeurIPS’s growing pains are caused by it trying to be something for everyone, then maybe we need to split NeurIPS’s responsibilities. Except, I don’t actually know what that means.
I do believe that it’s something people should be thinking more about. So, consider this as a call to action. September approaches, and thinkpieces or blog posts aren’t going to change what happens when it does.
Around the end of October, the Nature paper for AlphaStar was released. The Nature version has more significant restrictions on APM and camera movement, is able to play as all three races instead of just Protoss, can play the same maps that humans play on the competitive ladder, and reached Grandmaster level. However, it hasn’t done the “go 50-0 against pros” that AlphaZero did. It won some games against pros, but lost games as well. The rough consensus I’ve seen is that AlphaStar still has clear gaps in game knowledge, but is able to win anyways through very good mechanics.
I haven’t read the Nature paper yet, and I’m not going to do a detailed post like I did last time, since I’ve been a bit busy. This is just to follow-up on the predictions I made. I believe that if you make a public prediction, it’s very important to revisit it, whether you were right or wrong.
I made two sets of predictions. The first set came from February 2019, shortly after the TLO and MaNa showmatches.
If no restrictions are added besides no global camera, I think within a year AlphaStar will be able to beat any pro, in any matchup, on any set of maps in a best-of-five series.
If [APM restrictions] are added, it’s less likely AlphaStar will be that good in a year, but it’s still at least over 50%.
Since restrictions were added, the first prediction isn’t relevant anymore. As for the second prediction, there are still a few months before Feburary 2020, but I’m a bit less confident. Maybe before I was 55% and now I’m more like 45%, and this is assuming DeepMind keeps pushing on AlphaStar. I’ve honestly lost track of whether that’s happening or not.
In July 2019, I made this comment:
In AI for Games news, Blizzard announced that AlphaStar will be playing on the live SC2 competitive ladder, to evaluate agent performance in blind trials. AlphaStar can now play as all three races, has more restrictive APM limits than their previous version, and has been trained to only observe information from units within the current camera view. Should be fun - I expect it to beat everyone in blind games by this point.
This prediction was rather decidedly wrong. AlphaStar was very strong, but was not stomping every opponent it played against. Here, my mistake was overcorrecting for announcement bias. Historically, DeepMind only announces something publicly when they’re very confident in its result. So, when they disclosed that AlphaStar would be playing games on the EU ladder, I assumed that meant it was a done deal. I’m now thinking that the disclosure was partly driven by legal reasons, and that although they were confident it was worth doing human evaluation, that didn’t necessarily mean they were confident it would beat all pro players. It only meant it was worth testing if it could.
Two concluding remarks. First, I personally found it interesting that I had no trouble believing AlphaStar was going to be ridiculously superhuman just 6 months after the original showmatch. For what it’s worth, I still think that was reasonable to believe, which is a bit strange given how short 6 months is.
Second, we got spoiled by Chess, Go, and other turn-based perfect information games. In those games, superhuman game AIs always ended up teaching us some new strategy or new way to view the game, and that was because the only way they could be superhuman was by figuring out better moves than humans could. Starcraft is different. It’s part finding the right move, and part executing it, and that opens up new strategies in the search space. The fact that AlphaStar can win with subhuman moves executed well is less a problem with AlphaStar, and more a problem with the strategy space of APM-based games. If it’s an available and viable option, it shouldn’t be surprising that AlphaStar ends up picking a strategy that works. It’s disappointing if you expected something different, but most things are disappointing when they don’t match your expectations. So it goes.
Recently, OpenAI announced they had gotten their dexterous manipulation system to solve a Rubik’s Cube. I thought I wouldn’t have much to say, until I started writing this.
What Did OpenAI Do?
Using reinforcement learning, they learned a controller for a Shadow Hand that lets them solve a Rubik’s Cube reasonably often. They report a success rate of 60% for average scrambles, and 20% for the hardest possible scrambles that require 26 quarter-face turns.
I say doing RL on the Shadow Hand platform, but really, I mean they do learning on a simulated version of the Shadow Hand, then try to get that to transfer to the real Shadow Hand with no real data. It’s a neat dexterous manipulation result.
Why is Dexterous Manipulation Hard?
As a general rule, robot hardware is terrible to work with, and simulators suck unless you spend a bunch of time improving them. This is especially true for robot hands, because they have way more degrees of freedom and complexity than simpler grippers.
OpenAI says they’ve been working on solving a Rubik’s Cube since May 2017. It took them 2 months (May 2017 - July 2017) to solve it in simulation with a simulated hand. It took them 1 more year (July 2017 - July 2018) to get a real hand to manipulate a solid wooden block. Then, another 14 months (July 2018 - October 2019) to get to the Rubik’s Cube. In other words, the “runs on a real robot” part is the entire reason this work is interesting.
Wasn’t This Robot Hand in the News Before?
It was! It showed up in their Learning Dexterity post from July 2018.
What Isn’t New?
A lot of it isn’t new. Everything about this paper looks like a “moonshot achieved through roofshots” project, where there’s a clear line of steady, compounding improvements from prior work.
The model is trained with distributed PPO using OpenAI’s Rapid framework, which was used for both OpenAI Five and the Learning Dexterity paper. The model architecture is heavily inspired by the DotA 2 architecture - each input feature is embedded into a 512-dimensional embedding space, and these embeddings are summed and passed through a large LSTM.
Like the Learning Dexterity work, instead of learning a policy directly on pixels through RL, they instead predict the pose of the Rubik’s Cube from three camera viewpoints, then feed those predicted poses to the RL agent.
Again, like the Learning Dexterity work, they use the same asymmetric actor-critic trick, where the critic gets all ground truth state information, and the policy only gets the features visible from real-world data, which is fine for zero-shot transfer because you only need the policy at inference time.
It’s the same ideas, likely even the same codebase, executed in a different context.
What Is New?
One is the automatic domain randomization. In domain randomization, you learn a model in several randomly sampled simulated environments, learning a final model that’s more robust and more likely to transfer to reality.
Applying domain randomization requires some tuning, both for deciding what to randomize, and deciding the ranges from which to sample parameters. Too little randomization, and the model won’t be robust to real-world noise. Too much, and the learning problem will become too difficult to learn.
Taking inspiration from automatic curriculum learning, they maintain a distribution over simulator parameters. This distribution starts as a single point. If the policy’s recent performance is above a threshold, the distribution is expanded to be wider. This lets us start from a simple problem, then expand its difficulty as the policy gets better at the task. (The distribution is never narrowed, and I assume you have to tune the performance threshold and how much you widen each dimension.)
Another detail which I only found after a close read was adversarial domain randomization. An adversary applies perturbations to the force, torque, and actions, in a way that hurts performance to mine hard examples. Or rather, that’s the theory, but in practice they found best results with a random adversary, which performed better than any learned adversary. This seems weird to me. I can believe the result, and at the same time it feels like a learned adversary should be better (but perhaps it’s tricky to tune it properly.)
Finally, although it’s not directly related, there is a heavier lean on policy distillation and model surgery. This is a lesson they’ve carried over from DotA 2. Over the course of a long project, you will naturally want to add new input features or try new neural net architectures. If the model takes a long time to train, it’s worth designing methods that let you do this without training from scratch. The reason they add embeddings together instead of concatenating them is because you can easily add a new feature without changing the shape of any existing weight matrices. This lets you avoid training entirely from scratch. (It does hurt reproducibility, but if you need to train models over several months, then you may have to make that concession.)
For architecture changes that do change layer sizes (like changing the LSTM size), the current model can be distilled into the new model architecture, which still gives some benefits, because model distillation is faster than training from scratch.
What Are the Pros Of This Work?
I mean, it works. That’s always worth celebrating. Based on their demos, the result is pretty robust. They’ve also done a lot of work on interpreting the model. It’s cool that by applying interpretability tools on the LSTM hidden state, they’re able to identify semantically meaningful clusters for cube manipulation. I know people have complaints over how they got the policy to work (more details on that in the next section), but I don’t think OpenAI has gotten enough credit for their analysis of the learned policy, and what emergent behaviors may appear from sufficiently big neural nets.
In general, I’ve found that people without robot learning experience are poorly calibrated on how much bullshit there is in getting a real robot learning setup to work. It’s always good to see something get there.
Finally, I know this is a weird thing to appreciate, but I actually like the policy distillation and model surgery aspects the most. Yes, the automatic domain randomization is nice, but of course automatic curriculum learning should perform better than sampling tasks uniformly at random.
The policy distillation and model surgery aren’t central to the project, but they are indicative of the correct research culture: a focus on design decisions that encourage long-term research success.
Based on what someone told me 2 years ago (I know, I know, it’s a bad source), OpenAI felt the academic community undervalued research infrastructure. So, they released Gym, and then they released Baselines. Both were places where OpenAI was in a position to provide value to the RL community.
Okay, yes, the obvious cynical point here is that Gym and Baselines are also great branding tools for OpenAI’s recruiting. This would have been true for any group that released something that got the adoption Gym or Baselines did, so I don’t think it’s a valid criticism. Besides, shouldn’t you want industry companies to release better deep learning libraries?
Research code is usually terrible, because you’re trying to move fast on an idea that may not even make sense. Designing everything properly the first time slows down your experimenting too much. However, never cleaning up anything is its own sin. People really underestimate the impact of good research infra, in my experience. I’m not saying it’s easy to build good tools. It’s absurdly difficult to build good tools. But if done properly, it pays off long-term, because they’re reusable in future projects. An RL diagnostics library can be re-used for every RL project, An interpretability library can be re-used for any project that wants interpretability. OpenAI has built both.
The observation that policy distillation is a tool that lets you warm-start any future model indicates that some people at OpenAI get it, and are thinking about it at multiple levels of a research project - both at the code level and the model architecture level. It’s cool and I wish people thought more about this.
What Are the Cons of This Work?
Skynet Today’s article has a good summary of some controversial points, along with their own take on things. It’s worth reading. Here are a few cons I want to point out.
Use of a Solver
The final robot controller is not learned entirely end-to-end from pixels to actions. There are two intermediate steps. Pose is estimated by sim2real transfer of a supervised learning problem, and the sequence of subgoals the policy should reach is outsourced to an existing solver (Kociemba’s algorithm).
These are fine. You can get reinforcement learning to learn to solve a Rubik’s Cube (see McAleer et al, 2018), but the most important part of this work is the sim2real transfer of a dexterous manipulation task. None of the manipulation problems are made easier if you use a solver. I don’t think the pose estimation is a problem either, since it’s learned from vision anyways.
What I’m less fine with is that the video OpenAI released never mentions this multistage approach. To quote directly from the narration,
“We’re trying to build robots that learn a little bit like humans do, by trial and error. What we’ve done is trained an algorithm to solve the Rubik’s Cube one-handed, with a robotic hand, which is actually pretty hard even for a human to do. We don’t tell it how the hand needs to move the cube in order to get there. The particular friction that’s on the fingers. How easy it is to turn the faces on the cube. What the gravity, what the weight of the cube is. All of these things it needs to learn by itself.”
To me, this reads as someone saying things that are consistent with the truth, but which leaves open interpretations that are stronger than the truth. The phrasing of “We don’t tell it how the hand needs to move the cube in order to [solve it]” certainly doesn’t imply any decomposition of the problem, and on my first listen, I had 3 reactions.
- They’re almost certainly using a solver because not doing so would be really silly.
- People who just watch the video will definitely be confused by this.
- That confusion may have been intentional.
It seems like people’s opinion on #3 is almost entirely defined by whether they believe OpenAI’s PR strategy is done in good faith. For what it’s worth, I believe they are acting in good faith, but they simplified things so much that they lost some important nuance. This happens everywhere. How many times have you read a paper because of a good abstract, only to be disappointed once you actually read it?
I understand why people are fixated on this - it’s a flashpoint for the discussion on PR that people actually want to talk about. It’s worth talking about, but in the context of the work, from a robotics perspective, it really, really doesn’t matter. Other parts deserve more focus.
In the results reporting a 60% average solve rate, the Rubik’s Cube used is a modified version of a Xiaomi Giiker cube, which comes with Bluetooth sensors that report the rotation angles of each face. The sensors in the original Giiker cube report face angles at resolution, so they modify it by replacing some components to get to resolution.
I didn’t care about the solver. I do care about this, because I couldn’t find anywhere in the blog post where it was clarified that the 60% solve rate required these Bluetooth sensors. My assumption before reading the paper was that vision and domain randomization were used to predict both the pose of the cube and the angles of faces on that cube. They do have some pure vision results, and from purely vision the solve rate is 20% on average solves and 0% on the hardest solves.
I don’t have any particular problem with the added sensors, but I am let down, because it plays into a bigger theme: for all the claims of sim2real transfer, there’s a lot of simulator details they had to get right in the paper.
I remember when I first heard about domain randomization. I thought it was going to fix everything. Then I tried it myself, and talked to people who tried it for their projects, and got more realistic.
The common explanation of domain randomization is that by adding lots of randomness to your simulator, you can be much looser in simulator design, system identification, and calibration. So far, I’d say this is only sort of true.
Consider the problem of contact forces. Now, I have very little experience with making physics simulators, but when I talk to colleagues with sim experience, contact forces make them break out in cold sweats. It’s simply very hard to exactly model the forces between objects that are touching each other. Somehow, there are always more interactions that are not properly modelled by the simulator.
The domain randomization viewpoint is that if you randomize parameters for interactions that are modeled (like friction, for example), then your model should generalize to real world dynamics without issue. And sometimes, this works. More commonly, the complexity that isn’t correctly modeled by your simulator is enough of a problem that you aren’t able to recover, no matter how much you randomize the simulator at train time.
Think of it this way. Suppose we were trying to model the movements of two magnets in simulation, but our simulator doesn’t model electromagnetic forces. It doesn’t matter how much you randomize friction or mass, you’re never going to predict the movements of those magnets to any reasonable degree of accuracy.
Obviously, actual simulators will model these forces and other ones if there’s reason to believe they’re relevant to the task. I’m just bringing it up as an example of a known unknown within the simulator. But what about the unknown unknowns?
If you look through sim2real papers, it’s not a coincidence that many of the best sim2real results are about sim2real transfer of vision, for tasks where dynamics either don’t matter or are simulated pretty well. When transfer learning is bottlenecked on vision, domain randomization is great! Convolutional neural nets are absurdly good, random textures and lighting is something almost all simulators support, and it seems like they do generalize pretty well.
Unfortunately, practically all interesting robot manipulation problems are bottlenecked on dynamics.
Bringing it back to the Rubik’s Cube result, there are a lot of simulator details mentioned in the paper. I missed this last year, but the same was true in the Learning Dexterity paper. There was some calibration to get the simulated hand to be reasonably close to the real one. From Appendix C.3 of the Learning Dexterity paper:
The MuJoCo XML model of the hand requires many parameters, which are then used as the mean of the randomized distribution of each parameter for the environment. Even though substantial randomization is required to achieve good performance on the physical robot, we have found that it is important for the mean of the randomized distributions to correspond to reasonable physical values. […] For each joint, we optimize the damping, equilibrium position, static friction loss, stiffness, margin,and the minimum and maximum of the joint range. For each actuator, we optimize the proportional gain, the force range, and the magnitude of backlash in each direction. Collectively, this corresponds to 264 parameter values.
The Rubik’s Cube paper further improves on the simulated hand model by adding some tendons and pulleys to the simulated hand model, to better match how the real Shadow Hand works. Appendix D.1 of the Rubik’s Cube paper helpfully includes the performance before and after this change.
An increase from 4.8 face rotations to 14.30 face rotations seems like a pivotal jump to me. For the cube itself, they mention needing to model the bevelled edges that appear on the real Rubik’s Cube, because otherwise the model is too unforgiving.
My understanding is that OpenAI treats zero-shot sim2real transfer as a non-negotiable part of their projects. This is consistent with their research direction: throw a bunch of compute at something that easily scales with compute, and see what you can do in that regime. If you want to do this in robotics, you have to mostly use simulation, because real robots don’t scale with compute.
So really, once you strip everything old away, remove the solver controversy, and exclude the setup-specific simulator design, what are we left with? We’re left with the story that automatic domain randomization is better than plain domain randomization, and domain randomization can work for solving a Rubik’s Cube, if you calibrate your simulator to be sort-of accurate, and instrument enough randomization parameters. Like cube size, and action delay, and action latency, and simulator timestep length, and frictions, and mass, and action noise, but remember that action noise from a random network works best, and, well, you get the picture. There’s a lot. It’s an undeniably cool results, and I’m impressed by the effort that went into setting up the simulator and randomization, but I’m not impressed by the improvements from that randomization.
Domain randomization isn’t a tool. Domain randomization is a paradigm, and a very useful one at that. But at a high level, it doesn’t fully remove simulator design. It just trades some software engineer design time for GPU training time, and the conversion rate depends on whether it’s easy to model a reasonable version of your task in simulation, and whether you have any systematically bad unmodeled effects. From this view, it doesn’t seem like the domain randomization reaches much further than it did last year. Instead, it’s mostly driven by learning how to set up a Rubik’s Cube simulator.
I’m glad that someone is looking at zero-shot sim2real transfer, but I’m unconvinced that it’s the right trade-off. Even a small number of real samples can be really useful for learning algorithms, and it seems like a waste to ignore them at learning time. Arguably, OpenAI is using these samples, when calibrating their simulator, but surely there should be a way to use those real samples in other parts of the learning process as well, in a way that lets you actually see some of the unknown unknowns that you can’t observe in a zero-shot setting.
Then again, I could be wrong here. Maybe compute grows fast enough to make domain randomization good enough for everything. I see two endgames here. In one, robot learning reduces to building rich simulators that are well-instrumented for randomization, then using ludicrous amounts of compute across those simulators. In the other, randomization is never good enough to be more than a bootstrapping step before real robot data, no matter what the compute situation looks like. Both seem plausible to me, and we’ll see how things shake out.