Recently, OpenAI announced they had gotten their dexterous manipulation system to solve a Rubik’s Cube. I thought I wouldn’t have much to say, until I started writing this.
What Did OpenAI Do?
Using reinforcement learning, they learned a controller for a Shadow Hand that lets them solve a Rubik’s Cube reasonably often. They report a success rate of 60% for average scrambles, and 20% for the hardest possible scrambles that require 26 quarter-face turns.
I say doing RL on the Shadow Hand platform, but really, I mean they do learning on a simulated version of the Shadow Hand, then try to get that to transfer to the real Shadow Hand with no real data. It’s a neat dexterous manipulation result.
Why is Dexterous Manipulation Hard?
As a general rule, robot hardware is terrible to work with, and simulators suck unless you spend a bunch of time improving them. This is especially true for robot hands, because they have way more degrees of freedom and complexity than simpler grippers.
OpenAI says they’ve been working on solving a Rubik’s Cube since May 2017. It took them 2 months (May 2017 - July 2017) to solve it in simulation with a simulated hand. It took them 1 more year (July 2017 - July 2018) to get a real hand to manipulate a solid wooden block. Then, another 14 months (July 2018 - October 2019) to get to the Rubik’s Cube. In other words, the “runs on a real robot” part is the entire reason this work is interesting.
Wasn’t This Robot Hand in the News Before?
It was! It showed up in their Learning Dexterity post from July 2018.
What Isn’t New?
A lot of it isn’t new. Everything about this paper looks like a “moonshot achieved through roofshots” project, where there’s a clear line of steady, compounding improvements from prior work.
The model is trained with distributed PPO using OpenAI’s Rapid framework, which was used for both OpenAI Five and the Learning Dexterity paper. The model architecture is heavily inspired by the DotA 2 architecture - each input feature is embedded into a 512-dimensional embedding space, and these embeddings are summed and passed through a large LSTM.
Like the Learning Dexterity work, instead of learning a policy directly on pixels through RL, they instead predict the pose of the Rubik’s Cube from three camera viewpoints, then feed those predicted poses to the RL agent.
Again, like the Learning Dexterity work, they use the same asymmetric actor-critic trick, where the critic gets all ground truth state information, and the policy only gets the features visible from real-world data, which is fine for zero-shot transfer because you only need the policy at inference time.
It’s the same ideas, likely even the same codebase, executed in a different context.
What Is New?
One is the automatic domain randomization. In domain randomization, you learn a model in several randomly sampled simulated environments, learning a final model that’s more robust and more likely to transfer to reality.
Applying domain randomization requires some tuning, both for deciding what to randomize, and deciding the ranges from which to sample parameters. Too little randomization, and the model won’t be robust to real-world noise. Too much, and the learning problem will become too difficult to learn.
Taking inspiration from automatic curriculum learning, they maintain a distribution over simulator parameters. This distribution starts as a single point. If the policy’s recent performance is above a threshold, the distribution is expanded to be wider. This lets us start from a simple problem, then expand its difficulty as the policy gets better at the task. (The distribution is never narrowed, and I assume you have to tune the performance threshold and how much you widen each dimension.)
Another detail which I only found after a close read was adversarial domain randomization. An adversary applies perturbations to the force, torque, and actions, in a way that hurts performance to mine hard examples. Or rather, that’s the theory, but in practice they found best results with a random adversary, which performed better than any learned adversary. This seems weird to me. I can believe the result, and at the same time it feels like a learned adversary should be better (but perhaps it’s tricky to tune it properly.)
Finally, although it’s not directly related, there is a heavier lean on policy distillation and model surgery. This is a lesson they’ve carried over from DotA 2. Over the course of a long project, you will naturally want to add new input features or try new neural net architectures. If the model takes a long time to train, it’s worth designing methods that let you do this without training from scratch. The reason they add embeddings together instead of concatenating them is because you can easily add a new feature without changing the shape of any existing weight matrices. This lets you avoid training entirely from scratch. (It does hurt reproducibility, but if you need to train models over several months, then you may have to make that concession.)
For architecture changes that do change layer sizes (like changing the LSTM size), the current model can be distilled into the new model architecture, which still gives some benefits, because model distillation is faster than training from scratch.
What Are the Pros Of This Work?
I mean, it works. That’s always worth celebrating. Based on their demos, the result is pretty robust. They’ve also done a lot of work on interpreting the model. It’s cool that by applying interpretability tools on the LSTM hidden state, they’re able to identify semantically meaningful clusters for cube manipulation. I know people have complaints over how they got the policy to work (more details on that in the next section), but I don’t think OpenAI has gotten enough credit for their analysis of the learned policy, and what emergent behaviors may appear from sufficiently big neural nets.
In general, I’ve found that people without robot learning experience are poorly calibrated on how much bullshit there is in getting a real robot learning setup to work. It’s always good to see something get there.
Finally, I know this is a weird thing to appreciate, but I actually like the policy distillation and model surgery aspects the most. Yes, the automatic domain randomization is nice, but of course automatic curriculum learning should perform better than sampling tasks uniformly at random.
The policy distillation and model surgery aren’t central to the project, but they are indicative of the correct research culture: a focus on design decisions that encourage long-term research success.
Based on what someone told me 2 years ago (I know, I know, it’s a bad source), OpenAI felt the academic community undervalued research infrastructure. So, they released Gym, and then they released Baselines. Both were places where OpenAI was in a position to provide value to the RL community.
Okay, yes, the obvious cynical point here is that Gym and Baselines are also great branding tools for OpenAI’s recruiting. This would have been true for any group that released something that got the adoption Gym or Baselines did, so I don’t think it’s a valid criticism. Besides, shouldn’t you want industry companies to release better deep learning libraries?
Research code is usually terrible, because you’re trying to move fast on an idea that may not even make sense. Designing everything properly the first time slows down your experimenting too much. However, never cleaning up anything is its own sin. People really underestimate the impact of good research infra, in my experience. I’m not saying it’s easy to build good tools. It’s absurdly difficult to build good tools. But if done properly, it pays off long-term, because they’re reusable in future projects. An RL diagnostics library can be re-used for every RL project, An interpretability library can be re-used for any project that wants interpretability. OpenAI has built both.
The observation that policy distillation is a tool that lets you warm-start any future model indicates that some people at OpenAI get it, and are thinking about it at multiple levels of a research project - both at the code level and the model architecture level. It’s cool and I wish people thought more about this.
What Are the Cons of This Work?
Skynet Today’s article has a good summary of some controversial points, along with their own take on things. It’s worth reading. Here are a few cons I want to point out.
Use of a Solver
The final robot controller is not learned entirely end-to-end from pixels to actions. There are two intermediate steps. Pose is estimated by sim2real transfer of a supervised learning problem, and the sequence of subgoals the policy should reach is outsourced to an existing solver (Kociemba’s algorithm).
These are fine. You can get reinforcement learning to learn to solve a Rubik’s Cube (see McAleer et al, 2018), but the most important part of this work is the sim2real transfer of a dexterous manipulation task. None of the manipulation problems are made easier if you use a solver. I don’t think the pose estimation is a problem either, since it’s learned from vision anyways.
What I’m less fine with is that the video OpenAI released never mentions this multistage approach. To quote directly from the narration,
“We’re trying to build robots that learn a little bit like humans do, by trial and error. What we’ve done is trained an algorithm to solve the Rubik’s Cube one-handed, with a robotic hand, which is actually pretty hard even for a human to do. We don’t tell it how the hand needs to move the cube in order to get there. The particular friction that’s on the fingers. How easy it is to turn the faces on the cube. What the gravity, what the weight of the cube is. All of these things it needs to learn by itself.”
To me, this reads as someone saying things that are consistent with the truth, but which leaves open interpretations that are stronger than the truth. The phrasing of “We don’t tell it how the hand needs to move the cube in order to [solve it]” certainly doesn’t imply any decomposition of the problem, and on my first listen, I had 3 reactions.
- They’re almost certainly using a solver because not doing so would be really silly.
- People who just watch the video will definitely be confused by this.
- That confusion may have been intentional.
It seems like people’s opinion on #3 is almost entirely defined by whether they believe OpenAI’s PR strategy is done in good faith. For what it’s worth, I believe they are acting in good faith, but they simplified things so much that they lost some important nuance. This happens everywhere. How many times have you read a paper because of a good abstract, only to be disappointed once you actually read it?
I understand why people are fixated on this - it’s a flashpoint for the discussion on PR that people actually want to talk about. It’s worth talking about, but in the context of the work, from a robotics perspective, it really, really doesn’t matter. Other parts deserve more focus.
In the results reporting a 60% average solve rate, the Rubik’s Cube used is a modified version of a Xiaomi Giiker cube, which comes with Bluetooth sensors that report the rotation angles of each face. The sensors in the original Giiker cube report face angles at resolution, so they modify it by replacing some components to get to resolution.
I didn’t care about the solver. I do care about this, because I couldn’t find anywhere in the blog post where it was clarified that the 60% solve rate required these Bluetooth sensors. My assumption before reading the paper was that vision and domain randomization were used to predict both the pose of the cube and the angles of faces on that cube. They do have some pure vision results, and from purely vision the solve rate is 20% on average solves and 0% on the hardest solves.
I don’t have any particular problem with the added sensors, but I am let down, because it plays into a bigger theme: for all the claims of sim2real transfer, there’s a lot of simulator details they had to get right in the paper.
I remember when I first heard about domain randomization. I thought it was going to fix everything. Then I tried it myself, and talked to people who tried it for their projects, and got more realistic.
The common explanation of domain randomization is that by adding lots of randomness to your simulator, you can be much looser in simulator design, system identification, and calibration. So far, I’d say this is only sort of true.
Consider the problem of contact forces. Now, I have very little experience with making physics simulators, but when I talk to colleagues with sim experience, contact forces make them break out in cold sweats. It’s simply very hard to exactly model the forces between objects that are touching each other. Somehow, there are always more interactions that are not properly modelled by the simulator.
The domain randomization viewpoint is that if you randomize parameters for interactions that are modeled (like friction, for example), then your model should generalize to real world dynamics without issue. And sometimes, this works. More commonly, the complexity that isn’t correctly modeled by your simulator is enough of a problem that you aren’t able to recover, no matter how much you randomize the simulator at train time.
Think of it this way. Suppose we were trying to model the movements of two magnets in simulation, but our simulator doesn’t model electromagnetic forces. It doesn’t matter how much you randomize friction or mass, you’re never going to predict the movements of those magnets to any reasonable degree of accuracy.
Obviously, actual simulators will model these forces and other ones if there’s reason to believe they’re relevant to the task. I’m just bringing it up as an example of a known unknown within the simulator. But what about the unknown unknowns?
If you look through sim2real papers, it’s not a coincidence that many of the best sim2real results are about sim2real transfer of vision, for tasks where dynamics either don’t matter or are simulated pretty well. When transfer learning is bottlenecked on vision, domain randomization is great! Convolutional neural nets are absurdly good, random textures and lighting is something almost all simulators support, and it seems like they do generalize pretty well.
Unfortunately, practically all interesting robot manipulation problems are bottlenecked on dynamics.
Bringing it back to the Rubik’s Cube result, there are a lot of simulator details mentioned in the paper. I missed this last year, but the same was true in the Learning Dexterity paper. There was some calibration to get the simulated hand to be reasonably close to the real one. From Appendix C.3 of the Learning Dexterity paper:
The MuJoCo XML model of the hand requires many parameters, which are then used as the mean of the randomized distribution of each parameter for the environment. Even though substantial randomization is required to achieve good performance on the physical robot, we have found that it is important for the mean of the randomized distributions to correspond to reasonable physical values. […] For each joint, we optimize the damping, equilibrium position, static friction loss, stiffness, margin,and the minimum and maximum of the joint range. For each actuator, we optimize the proportional gain, the force range, and the magnitude of backlash in each direction. Collectively, this corresponds to 264 parameter values.
The Rubik’s Cube paper further improves on the simulated hand model by adding some tendons and pulleys to the simulated hand model, to better match how the real Shadow Hand works. Appendix D.1 of the Rubik’s Cube paper helpfully includes the performance before and after this change.
An increase from 4.8 face rotations to 14.30 face rotations seems like a pivotal jump to me. For the cube itself, they mention needing to model the bevelled edges that appear on the real Rubik’s Cube, because otherwise the model is too unforgiving.
My understanding is that OpenAI treats zero-shot sim2real transfer as a non-negotiable part of their projects. This is consistent with their research direction: throw a bunch of compute at something that easily scales with compute, and see what you can do in that regime. If you want to do this in robotics, you have to mostly use simulation, because real robots don’t scale with compute.
So really, once you strip everything old away, remove the solver controversy, and exclude the setup-specific simulator design, what are we left with? We’re left with the story that automatic domain randomization is better than plain domain randomization, and domain randomization can work for solving a Rubik’s Cube, if you calibrate your simulator to be sort-of accurate, and instrument enough randomization parameters. Like cube size, and action delay, and action latency, and simulator timestep length, and frictions, and mass, and action noise, but remember that action noise from a random network works best, and, well, you get the picture. There’s a lot. It’s an undeniably cool results, and I’m impressed by the effort that went into setting up the simulator and randomization, but I’m not impressed by the improvements from that randomization.
Domain randomization isn’t a tool. Domain randomization is a paradigm, and a very useful one at that. But at a high level, it doesn’t fully remove simulator design. It just trades some software engineer design time for GPU training time, and the conversion rate depends on whether it’s easy to model a reasonable version of your task in simulation, and whether you have any systematically bad unmodeled effects. From this view, it doesn’t seem like the domain randomization reaches much further than it did last year. Instead, it’s mostly driven by learning how to set up a Rubik’s Cube simulator.
I’m glad that someone is looking at zero-shot sim2real transfer, but I’m unconvinced that it’s the right trade-off. Even a small number of real samples can be really useful for learning algorithms, and it seems like a waste to ignore them at learning time. Arguably, OpenAI is using these samples, when calibrating their simulator, but surely there should be a way to use those real samples in other parts of the learning process as well, in a way that lets you actually see some of the unknown unknowns that you can’t observe in a zero-shot setting.
Then again, I could be wrong here. Maybe compute grows fast enough to make domain randomization good enough for everything. I see two endgames here. In one, robot learning reduces to building rich simulators that are well-instrumented for randomization, then using ludicrous amounts of compute across those simulators. In the other, randomization is never good enough to be more than a bootstrapping step before real robot data, no matter what the compute situation looks like. Both seem plausible to me, and we’ll see how things shake out.
Sorta Insightful turns four years old today! Whether you were here from the beginning, or just discovered this blog, thanks for reading.
I like to do an annual meta-post every year about blogging. This year’s post is a bit rushed since I’ve been doing MUMS Puzzle Hunt the past 2 weeks, but I knew the risks going in.
Last year, I wrote 24,449 words. This year, I wrote 21,878 words.
In all honesty, I’m surprised it isn’t lower. I haven’t spent as much time blogging this year.
I wrote 10 posts this year, under my trend of 1 post a month on average. This counts the AlphaStar post as a single post, even though it was split into 2 parts, because both parts were written and released at the same time.
These are the view counts from August 18, 2018 to today, for the posts I’ve written this year.
When you remove outliers, this is higher than last year. I’m pleasantly surprised my Neopets economy post got solid viewership. It’s one of my favorites. I mean, on a reread, I’d change a bunch of things about the writing, but that’s what normally happens when you read your old work.
Time Spent Writing
Excluding time spent on this post, my time tracker says I’ve spent 88 hours, 12 minutes writing for my blog this year. This is about 2/3rds of the time I spent last year, and explains why it felt like I didn’t blog as much this year: I genuinely didn’t!
The word count doesn’t reflect this, but that’s expected. The rather frustrating thing about writing is the complete lack of correlation between time spent and words produced. If anything, you spend a lot of time making sure you use fewer words. The better you do this, the less people recognize it.
If I am to speak ten minutes, I need a week for preparation; if fifteen minutes, three days; if half an hour; two days; if an hour, I am ready now.
(Woodrow Wilson. Well, maybe..)
Posts in Limbo
To be honest, I do still want to write that post about Gunnerkrigg Court. I’ve been talking about writing that post for over two years. One day, it’ll happen. It has to.
Oh, past me, how wrong you were.
My interest in Gunnerkrigg Court has cooled down a lot ever since the conclusion of the Jeanne arc. Don’t get me wrong, it’s still a great webcomic. However, it’s been less emotionally resonant for me this year. There have been a few big events, but right now it feels like the comic is in search of a new overarching mystery to guide the plot. The chapters so far have felt like aftershocks of the Jeanne conclusion, rather than a new plot. I’m sure it’ll get to a new driving question soon, but with the story unfolding at Webcomic Time, it’s going to take a while to get there.
I still want to write a Gunnerkrigg Court post. But I want to write it in the sense that I like the idea of writing it. That isn’t the same as actually doing it.
One way to view decision making is that your choices depend on two things: your mental state, and your environment. Side projects, like this blog, usually don’t depend on the environment, because they have little to no deadlines or accountability.
I, at some point, reach a mental state where I think “Gee, I should write a post about Gunnerkrigg Court”. But I’m busy, or want to write other posts, so I don’t write about Gunnerkrigg Court. This decision sends me on a trajectory that arrives at the same mental state where I think “Gee, I should write a post about Gunnerkrigg Court”.
Because there’s no accountability, unless something has changed in my thinking since last time, my next decision will be the same: I’ll say I’m busy and do something else. Since my choice was the same, it sends me on the same trajectory as before, that arrives at the same point as before, creating a loop. Until something changes, my behavior will follow the same cyclic pattern where nothing happens.
For decisions that depend on the environment, you can break these loops by changing the environment. Deadlines are a good example of this. I know that I want to make a post for my blogging anniversary. This deadline is part of the environment. It arrives on August 18th, it’s inevitable, and there’s no way to change it. That deadline stops me from putting it off forever, because it changes the state. First I have a week to write the post. Then, I have six days. Five days, four days - eventually, the time limit becomes dire enough that I can’t avoid it. If there isn’t a deadline, or if I don’t respect the self-imposed deadline, then this doesn’t work. It’s easy to get stuck when the state doesn’t change.
The hard way to break these loops and not get stuck is to simply resolve not to get stuck and Actually Do Things. This means recognizing that if you don’t work on something now, you may never work on it in the future, because you’re implicitly deciding what you’ll do in any future decision point sufficiently similar to the current one. Maybe you’re okay with that. Maybe you aren’t. If you aren’t, then you should just Do The Thing, or at least Start Doing The Thing. Of course, it’s hard to Actually Do Things, but it’s why I respect people who do so.
(A lot of this section is inspired by conversations I’ve had with friends that self-identify as rationalists, who tried explaining timeless decision theory to me. I didn’t get timeless decision theory at all, but it turns out that you don’t have to understand something to get useful ideas out of it.)
I’m planning to make more of an effort on blogging this year. In case I don’t write any of the planned posts, here are some notes on posts I’ve been meaning to write. That way, if they never materialize, you can fill in the blanks yourself.
Topic: A post about measurement. Not at the level of “how is a meter defined”, although I do like that discussion. It would be more about how measurements are inherently imperfect pictures of the reality we’re trying to measure, where imperfections are created by the process generating the measurements. Understanding this is important for explaining why statistics and machine learning can give misleading results, as well as the limitations and risks of ML.
Why I want to Write It: It’s a useful mental concept that explains why more data isn’t always good. It’s something that experienced ML people understand, but it takes a while for the lessons to sink in. As machine learning gets applied to more of society, it’s crucially important that non-ML people become ML-literate. Not at the level of understanding the math behind machine learning, but at the level of understanding the concerns of blindly applying machine learning to arbitrary problem domains.
Estimated Time to Complete: Very high. I’ve thought about this a while, and could write a bad post pretty quickly, but I want to explain it right, and that’ll require a lot of careful writing.
Odds I Write This, Eventually: 95%
Odds I Write This by the End of 2019: 20%
Topic: A post about Gunnerkrigg Court. What it does well, why I find its story fun, and why you should read it.
Why I Want to Write It: Gunnerkrigg Court is one of my favorite webcomics of all time, and although I have a short blurb about it on my Webcomics Recommendation page, I’ve felt I have more to say about it than just that blurb suggests.
Estimated Time to Complete: Medium-High. I’m not sure what I want to say but I don’t think there’s a lot of it.
Odds I Write This, Eventually: 70%
Odds I Write This by the End of 2019: 25%
My Little Pony
Topic: A post about attending BronyCon 2019, and my thoughts on what My Little Pony means to me and the community that grew around it, along with what I’ve learned about how communities are created and become self-sustaining.
Why I Want to Write It: I’ve been meaning to write a post about My Little Pony for a while, but had trouble deciding where to begin. It was too big for me to explain. I think I can use BronyCon 2019 as a lens to focus thoughts into more specific ideas, and if I have unused material I can make follow-on posts later.
Estimated Time to Complete: High. Will likely involve lots of introspecting, and figuring out the right phrasings for implicit concepts that I haven’t put into words before.
Odds I Write This, Eventually: 85%
Odds I Write This by the End of 2019: 50%
Topic: A post about the history of Dominion Online, and the many trials it’s been through. This is the same post I mentioned in a previous post.
Why I Want to Write It: I’ve been part of the competitive Dominion community for many years. The list of people who have both lived through all the drama and want to document it is very, very short. It’s pretty much just me, as far as I know, so if I don’t write it, no one will.
Estimated Time to Complete: High. I know exactly what I want to say, but there are a lot of sources to look up, and the sheer length of what I want to write is daunting. The only reason this isn’t “Very high” is because I have the first 1/3rd written already.
Odds I Write This, Eventually: 85%
Odds I Write this By the End of 2019: 35%
What Are Lagrange Multipliers?
Lagrange multipliers are a tool for doing constrained optimization. Say we are trying to minimize a function , subject to the constraint .
To solve this, you define a new function.
The optimum lies at a stationary point of (a point where the gradient in and is both zero).
This is all true, but it doesn’t explain why it works.
Why Do Lagrange Multipliers Work?
Let’s consider a variant of the problem. We want to minimize subject to .
Let’s define the following min-max optimization problem.
I claim the solution of this optimization problem occurs at the smallest that satisfies the constraint . Why?
As written, we first choose an , then choose a that maximizes the objective, and we want to choose an that minimizes how much an adversarial can hurt us. Suppose we are violating the constraint . Then we have . At such an , , so we can pick to drive the objective value to .
If we are not violating the constraint, then is or negative, and since is constrained to , the optimal is , giving objective value .
In other words, the solution of the unconstrained problem is the same as the solution to the original constrained problem.
This handles the case. What if we have multiple constraints?
We can define a similar min-max problem by adding a Lagrange multiplier for each constraint .
By a similar argument, the optimal solution to this min-max problem occurs at , and at the solution of the original constrained optimization problem, assuming it exists. If either constraint was violated, then we could have driven the corresponding to , like before.
What if we have a constraint ? We can rearrange it to the constraint .
This is solved the same way.
If we have a constraint , we can negate both sides to get . If we have a constraint , we can rearrange it to . This reduces everything to the case we know how to solve.
Bringing it back to the equality case, let’s say we have a constraint . How do we reduce this to the previous cases? This equality is the same as the pair of constraints , which are only both satisfied at . This time, I’ll write the Lagrange multipliers as and
Like before, if we ever have , then we can choose such that the objective value shoots up to . But, since and are just negations of each other, we can simplify this further.
It’s possible to make equal any real number while still satisfying , so let’s just replace with , and say can be any real number instead of only nonnegative ones.
Now, we’re back to the equality case we started at. The solution to this must lie at a saddle point where the gradient with respect to and is both zero.
Why Do I Like This Explanation?
When I was re-learning Lagrange multipliers a while back, I was upset that all the most popular results were targeted towards people taking their first multivariate calculus course. These explanations exclusively covered the case, and the constraints I wanted to add were more like .
It’s a shame that most people’s first introduction to Lagrange multipliers only covers the equality case, because inequality constraints are more general, the concepts needed to understand that case aren’t much harder, and it’s clearer how you’d apply an optimization algorithm to solve the resulting unconstrained problem. Like all other min-max optimization (GANs, actor-critic RL, etc.), doing alternating gradient descent updates on then is both simple and works out fine.
This procedure doesn’t guarantee that your constraint is met over the entire optimization process, but it does quickly penalize the optimization for violating the constraint, since the for that constraint will quickly rise and keep rising until the constraint is satisfied again, at which point will quickly regress towards until it is needed once more.
As one final detail, handling inequalities requires handling optimization over . The common trick is to rewrite as or , then do gradient descent over variable instead.