OpenAI recently announced that a team of five Dota 2 agents has successfully beaten an amateur team. It’s a pretty exciting result and I’m interested to see where it goes from here.
When OpenAI first revealed they were working on Dota 2, there was a lot of buzz, a lot of hype, and a lot of misunderstanding that compelled me to write about it. This time, I have fewer questions and less compulsion to set the record straight, so to speak. The blog post has enough details to satisfy me, and the reaction hasn’t been as crazy. (Then again, I haven’t been reading the pop science press, so who knows…)
I’m pretty busy this week, so instead of trying to organize my thoughts, I’m just going to throw them out there and see what happens. This post is going to be messy, and may not make sense. I typed this out over about an hour and didn’t think too hard about my word choice. Everything in it makes sense to me, but that doesn’t mean anything - everything you write makes sense to you.
(If you haven’t read the OpenAI announcement post, you should do so now, or else this will make even less sense.)
* * *
This result came a bit earlier than I thought it would, but not by a lot. I’m not sure exactly when I was expecting to hear that 5v5 was looking solvable, but when I heard the news, I realized I wasn’t that surprised.
The post clarifies that yes, the input is a large number of game state features coming from the Dota 2 API, and isn’t coming from vision. The agent’s ability to observe the game is well beyond any human capability. I said this before and will say it again: this is totally okay and I have no problems with it.
On the communication front, I was expecting the problem to require at least some communication. Not at the level of the multi-agent communication papers where people try to get agents to learn a language to communicate goals, I was thinking something like every agent getting the actions each other agent made at each time step. That isn’t happening here, it’s just five LSTMs each deciding their own actions. The only direct encouragement for teamwork is that the reward of each agent is defined by a “team spirit” parameter that decides how important the team’s reward is to the individual. The fact that a single float is good enough is pretty interesting…
…Well, until I thought about it a bit more. By my understanding, the input state of each agent is the properties of every unit in the team’s vision. This includes health, attack, orientation, level, cooldowns of all their skills, and more. And your teammates are always in your team’s vision. So, odds are you can reconstruct the actions from the change in state. If they changed location. they moved. If they just lost mana and one of their spell’s cooldown just increased, they just used a skill.
In this respect, it feels like the state definition is rich enough that emergent cooperative behavior isn’t that surprising. There’s no theoretical limit to the potential teamwork - what would team captain’s give to have the ability to constantly understand everything the API can give you?
Compute-wise, there’s a lot of stuff going on: 256 GPUs, each contributing to a large synchronous batch of over a million observations. That is one of the largest batch sizes I’ve seen, although from a memory standpoint it might be smaller than a large batch of images. A Dota 2 observation is 20,000 floats. A 256 x 256 RGB image is approximately 200 thousands bytes.
(I assume the reason it’s using synchronous training is because async training starts getting really weird when you scale up the number of GPUs. My understanding is that you can either hope the time delays aren’t too bad given the number of GPUs you have, or you can try doing something like HOGWILD, or you can say “screw it” and just do synchronous training.)
Speaking of saying “screw it” and doing the thing that will clearly scale, it’s interesting that plain PPO is just good enough so far. I’m most surprised by the time horizon problem. The partial observability hurts, but empirically it was doable for the Dota 1v1 bot. The high dimensional action / observation space didn’t feel like obstacles to me - they looked annoying but didn’t look impassable. But the long time horizons problem felt hard enough that I expected it to require something besides just PPO.
This seems to have parallels to the Retro Contest results, where the winning entries were just tuned versions of PPO and Rainbow DQN. In the past, I’ve been skeptical of the “hardware hypothesis”, where the only thing stopping AI progress is faster computers. At the time, I said I thought the split in AI capabilities was about 50-50 between hardware and software. I’m starting to lean towards the hardware side, updating towards something like 60-40 for hardware vs software. There are an increasing number of results where baseline algorithms just work if you try them at the right scale, enough that I can’t ignore them.
One thing I like to joke about is that everyone who does reinforcement learning eventually decides that we need to solve hierarchical reinforcement learning and exploration. Like, everybody. And the problem is that they’re really hard. So from a practitioner perspective, you have two choices. One is to purse a risky research project on a difficult subject that could pan out, but will likely be stuck on small problems. The other option is to just throw more GPUs at it.
It’s not that we should give up on hierarchical RL and the like. It’s more that adding more hardware never hurts and likely helps, and even if you don’t need the scale, everyone likes it when their models train faster. This makes it easier to justify investing time into infrastructure that enables scale. Models keep getting bigger, so even if it doesn’t pay off now, it’ll pay off eventually.
* * *
I’d like to end this post with a prediction.
The team’s stated goal is to beat a Pro team at The International, August 20-25, with a limited set of heros (presumably the same hardcoded team mentioned in the footnote of the post.) I think OpenAI has a decent shot, about 50%.
To explain my thinking a bit more, everything about the progress and skill curves so far suggest to me that the learning algorithm isn’t hitting a plateau. For whatever reason, it seems like the Dota 2 skill level will continue to increase if you give it more training time. It may increase at a slower rate over time, but it doesn’t seem to stop.
Therefore, the question to me isn’t about whether it’s doable, it’s about whether it’s doable in the 2 months (60 days) they have left. Based on the plots, it looks like the current training time is around 7-19 days, and that leaves some breathing room for catching bugs and the like.
Funnily enough, my guess is that the main blocker isn’t going to be the learning time, it’s going to be the software engineering time needed to remove as many restrictions as possible. For the match at The International, I’d be very disappointed if wards and Roshan were still banned - it seems ridiculous to ask a pro team to play without either of those. So let’s assume the following:
- Both wards and Roshan need to be implemented before the match.
- The policy needs to be trained from scratch to learn how to ward and how to play around Roshan.
- After wards and Roshan get implemented, there will be a crazy bug of some sort that will hurt learning until it gets fixed, possibly requiring a full restart of the training job.
Assuming all of the above is true, model training for The International can’t proceed until all this software engineering gets done, and that doesn’t leave a lot of time to do many iterations.
(Of course, I could be wrong - if OpenAI can finetune their Dota 2 bots instead of training from scratch, all the math gets a lot nicer.)
Whatever way the match goes, I expect it to be one-sided, one way or the other. There’s a narrow band of skill level that leads to an even match, and it’s much more likely that it falls outside of that band. Pretty excited to see who’s going to win and who’s going to get stomped!
In the span of just under a month, I attended two conferences, ICLR 2018 and ICRA 2018. The first is a deep learning conference, and the second is a robotics conference. They were pretty different, and I figured it would be neat to compare the two.
From the research side, the TL;DR of ICLR was that adversarial learning continues to be a big thing.
The most popular thing in that sphere would be generative adversarial networks. However, I’m casting a wide umbrella here, one that includes adversarial examples and environments with competing agents. Really, any minimax optimization problems of the form counts as adversarial learning to me.
I don’t know if it was actually popular, or if my memory has selective bias, because I have a soft spot for these approaches. They feel powerful. One way to view a GAN is that you are learning a generator by using a learned implicit cost instead of a human defined one. This lets you adapt to the capabilities of your generator and lets you define costs that could be cumbersome to explain by hand.
Sure, this makes your problem more complicated. But if you have strong enough optimization and modeling ability, the implicitly learned cost gives you sharper images than other approaches. And one advantage of replacing parts of your system with learned components is that advances in optimization and modeling ability apply to more aspects of your problem. You are improving both your ability to learn cost functions and your ability to minimize those learned costs. Eventually, there’s a tipping point where it’s worth adding all this machinery.
From a more abstract viewpoint, this touches on the power of expressive, optimizable function families, like neural nets. Minimax optimization is not a new idea. It’s been around for ages. The new thing is that deep learning lets you model and learn complicated cost functions on high-dimensional data. To me, the interesting thing about GANs isn’t the image generation, it’s the proof-of-concept they show on complicated data like images. Nothing about the framework requires you to use image data.
There are other parts of the learning process that could be replaced with learned methods instead of human-defined one, and deep learning may be how we do so. Does it make sense to do so? Well, maybe. The problem is that the more you do this, the harder it becomes to actually make everything learnable. No point making it be turtles all the way down if your turtles become unstable and collapse.
There was a recent Quanta article, where Judea Pearl expressed his disappointment that deep learning was just learning correlations and curve fitting, and that this doesn’t cover all of intelligence. I agree with this, but to play devil’s advocate, there’s a chance that if you throw enough super-big neural nets into a big enough vat of optimization soup, you would learn something that looks a lot like causal inference, or whatever else you want to count as intelligence. But now we’re rapidly approaching philosophy land, so I’ll stop here and move on.
From an attendee perspective, I liked having lots of poster sessions. This is the first time I’ve gone to ICLR. My previous ML conference was NIPS, and NIPS just feels ridiculously large. Checking every poster at NIPS doesn’t feel doable. Checking every poster at ICLR felt possible, although whether you’d actually want to do so is questionable.
I also appreciated that corporate recruiting didn’t feel as ridiculous as NIPS. At NIPS, companies were giving out fidget spinners and slinkies, which was unique, but the fact that companies needed to come up with unique swag to stand out felt…strange. At ICLR, the weirdest thing I got was a pair of socks, which was odd but not too outlandish.
Papers I noted to follow-up on later:
- Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
- Learning Robust Rewards with Adverserial Inverse Reinforcement Learning
- Policy Optimization by Genetic Distillation
- Measuring the Intrinsic Dimension of Objective Landscapes
- Eigenoption Discovery Through the Deep Successor Representation
- Self-Ensembling for Visual Domain Adaptation
- TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning
- Online Learning Rate Adaptation with Hypergradient Descent
- DORA The Explorer: Directed Outreaching Reinforcement Action-Selection
- Learning to Multi-Task by Active Sampling
ICRA 2018 was my first robotics conference. I wasn’t sure what to expect. I started research as an ML person, and then sort of fell into robotics on the side, so my interests are closer to learning-for-control instead of making-new-robots. My ideal setup is one where I can treat real-world hardware as an abstraction. (Somewhere, a roboticist weeps.)
This plus my spotty understanding of control theory meant that I was unfamiliar with a lot of the topics at the conference. Still, there were plenty of learning papers, and I’m glad I went.
Of the research that I did understand, I was surprised there were so many reinforcement learning papers. It was mildly entertaining to see that almost none of them used purely model-free RL. One thing about ICRA is that your paper has a much, much better chance of getting accepted if it runs on a real-world robot. That forces you to care about data efficiency, which puts a super heavy bias against doing only model-free RL. When I walked around, I kept hearing “We combine model-free reinforcement learning with X”, where X was model-based RL, or learning from human demonstrations, or learning from motion planning, or really anything that could help with the exploration problem.
At a broader level, the conference has a sense of practicality about it. It was still a research conference, and plenty of it was still very speculative, but it also felt like people were okay with narrow, well-targeted solutions. I see this as another consequence of having to use real hardware. You can’t ignore inference time if you need to run your model in real time. You can’t ignore data efficiency if you need to collect it from a real robot. Real hardware does not care about your problems.
(1) It Has To Work.
(2) No matter how hard you push and no matter what the priority, you can’t increase the speed of light.
This surprises a lot of ML people I talk to, but robotics hasn’t fully embraced ML the way that people at NIPS / ICLR / ICML have, in part because ML doesn’t always work. Machine learning is a solution, but it’s not guaranteed to make sense. The impression I got was that only a few people at ICRA actively wanted ML to fail. Everyone else is perfectly okay with using ML, once it proves itself. And in some domains, it has proved itself. Every perception paper I saw used CNNs in one way or another. But significantly fewer people were using deep learning for control, because that’s where things are more uncertain. It was good to hear comments from people who see deep learning as just a fad, even if I don’t agree.
Like ICLR, there were a lot of companies doing recruiting and hosting info booths. Unlike ICLR, these booths were a lot more fun to browse. Most companies brought one of their robots to demo, and robot demonstrations are always fun to watch. It’s certainly more interesting than listening to the standard recruiting spiels.
At last year’s NIPS, I noted that ML company booths were starting to remind me of Berkeley career fairs, in a bad way. Every tech company wants to hire Berkeley new grads, and in my last year, recruiting started to feel like an arms race on who can give out the best swag and best free food. It felt like the goal was to look like the coolest company possible, all without telling you what they’d actually hire you for. And the ML equivalent of this is to host increasingly elaborate parties at fancy bars. Robotics hasn’t gone as far yet. It’s growing, but not with as much hype.
I went to a few workshop talks where people talked about how they were using robotics in the real world, and they were all pretty interesting. Research conferences tend to focusing on discussing research and networking, which makes it easy to forget that research can have clear, immediate economic value. There was a Robots in Agriculture talk about using computer vision to detect weeds and spray weed killer on just the weeds, which sounds like all upside to me. Uses less weed killer, kills fewer crops, slows down growth of herbicide resistance.
Rodney Brooks had a nice talk along similar lines, where he talked about the things needed to turn robotics into a consumer product, using the Roomba as an example. According to him, when designing the Roomba, they started with the price, then then molded all the functionality towards that price. It turns out a couple hundred dollars gives you very little leeway for fancy sensors and hardware, which places tight limits on what you can do in on-device inference.
(His talk also had a rant criticizing HRI research, which seemed out of place, but it was certainly entertaining. For the curious, he complained about people using too much notation to hide simple ideas, large claims that weren’t justified by the sample sizes used in the papers, and researchers blaming humans for irrational behavior when they didn’t match the model’s predictions. I know very little about HRI, so I have no comment.)
Organization wise, it was really well run. The conference center was right next door to a printing place, so at registration time, the organizers said that if you emailed a PDF by a specific date, they would handle all ordering logistics. All you had to do was pay for your poster online and pick it up at the conference. All presentations were given at presentation pods, each of which came with a whiteboard and a shelf where you could put a laptop to play video (which is really important for robotics work).
Papers I noted to follow-up on later:
- Applying Asynchronous Deep Classification Network and Gaming Reinforcement Learning-Based Motion Planner to a Mobile Robot
- OptLayer - Practical Constrained Optimization for Deep Reinforcement Learning in the Real World
- Synthetically Trained Neural Networks for Learning Human-Readable Plans from Real-World Demonstrations
- Semantic Robot Programming for Goal-Directed Manipulation in Cluttered Scenes
- Interactive Perception: Leveraging Action in Perception and Perception in Action
I’ve played Magic: the Gathering on and off for nearly 15 years. It’s a great card game, with tons of depth. It’s only downside is that it can get pretty expensive. So when I heard Wizards of the Coast was working on a free-to-play version called MTG Arena, I signed up for the beta. I was lucky enough to get an invite, and the beta’s recently went out of NDA, so I figured I’d give some first impressions.
This is the first digital implementation of Magic I’ve ever played. So far, the implementation feels smooth. The animations add a bit without getting in the way, and rules-wise I’ve yet to hit any problems. I have run into issues with the auto-tap though. In one game, I was playing a UB control deck, and played one of my finishers with Cancel backup. I didn’t realize the auto-tap had tapped all my Islands until my opponents turn, and it made me lose. But I chalk that up to unfamiliarity with the interface. It was a one-off mistake and I haven’t lost like that since.
At times, I’ve also found it annoying to clear the stack when a lot of triggers happen at once, but I’m willing to accept that as a consequence of Magic’s rules engine. It doesn’t “pop” as much as Hearthstone does, but the core gameplay is a lot more interesting to me, and that’s what’s bringing me back.
The experience right after the NDA drop was pretty magical. Everyone’s account got wiped, and you can’t spend real money yet. Not only was everyone on a level playing field, no one could pay-to-win their way to a strong deck. The end result was like a massive, worldwide Sealed league. For the uninitiated, Sealed is a Magic format where every player gets 6 booster packs and builds a deck out of those cards. A Sealed league is a Sealed tournament that runs over several weeks. Every 1-2 weeks, players add an extra booster pack to their card pool.
Arena is working in almost exactly the same way, thanks to the slow unlock rate. And therein lies the problem. Most people have terrible decks, because it’s currently very difficult to build good ones, even if you spend a lot of time playing the game.
Now, I was planning on writing a post complaining about the economy, but then I read a really good post that covered everything I wanted to cover, and I realized I had nothing to add. Instead, I’ll share some points I realized.
I have a lot more respect for Hearthstone’s core economy. I don’t like where Hearthstone’s gameplay has gone, but the core dusting and crafting mechanics are well designed. 30 card decks with at most 2 copies of a card makes it easier to build a collection. The 3rd copy of every card can be disenchanted for free, and the 1st and 2nd copy can be disenchanted too if you don’t think that card will be useful in the future. In MTG Arena, I have to open 4 copies of a common before I can make progress towards the Vault, which is Arena’s equivalent of disenchanting.
The developers of MTG Arena said they decided against a disenchant system because it created feel-bad moments when people disenchanted cards they needed later. That’s true, but in its place they’ve created feel-bad moments when players open cards they don’t want, with little choice on how to turn them into cards they do want. I own several commons where I have 3 unplayed copies of the same card, and I can’t do anything with them.
At a broader level, I’ve started appreciating the fragility of things. The best part of my MTG Arena experience was at the beginning, when everything was new, people were still figuring out the meta, and draft chaff decks were competitive. Nothing about that environment was going to last, but I didn’t expect it to. In many ways it reminds me of the early years of the brony fandom. A ton of ridiculous stuff happened, and no one knew where the fandom was going, but just being on the ride was exciting. The first BronyCon must have been insane, because I doubt there was a good understanding for what a brony convention should aspire to be.
The fandom has cooled down since then. Content creators settled in. Conventions have become more like institutions. Season 3 didn’t help, given that it was disappointing compared to Seasons 1 and 2. The fandom’s still going - Season 8 premiered last week - but it’s condensed into something that’s lost a lot of its initial magic.
The question is whether people should have expected the brony fandom to keep its magic forever. On reflection, ponies were never going to stay as culturally visible as they were in 2011 or 2012. I feel a heavy part of the fandom’s growth was its unexpectedness. Very few people expected a reboot of My Little Pony to actually be good, and it was that surprise that pulled people in. Now that people know it’s a cartoon that people like, there’s less pressure to see what all the fuss is about.
There’s nothing wrong with that. Cultural touchstones come and go. But if your definition of “fandom” is calibrated to the peak insanity of that fandom, then everything afterwards is going to be a disappointment. I saw a Reddit post asking if research in deep learning was slowing down. I don’t think it is, but I do feel there have been fewer fundamental architecture shifts. There were a few years where every state-of-the-art ImageNet model introduced a new idea, and if you were watching the field at that time, the field would have looked ridiculously open. It’s a lot less ridiculous now.
I’m not a big fandom jumper. I tend to get into a few fandoms, and then stick with them for a long, long time. And for a while, I looked down on people who did jump fandoms. It felt like they were chasing the high of the new thing, that they were in love with the collective enthusiasm of fandom, instead of the work the fandom was based on. I didn’t see them as “true fans”, because as soon as the next new thing came around, they’d leave. I don’t look down on this behavior anymore. If that’s what people are looking for, who am I to judge?
It’s just that if the community does get worse, I don’t think it’s productive to complain about “the good old days.” Analyzing it or trying to fix it is fine, but I suspect that many communities start because of forces outside of their control. People get pulled in, those outside forces go away, and then when things change, people blame the community for “ruining things”, or “destroying the fandom”, instead of blaming the disappearance of the outside forces that made the community grow in the first place.
The thing that gets people into a community doesn’t have to be the thing that gets people to stay. There’s even a TVTropes page for this. If the community starts getting worse, maybe the problem isn’t the community. Maybe the problem is that you were pulled in by something that the community was never really about. And if you can’t change that, then the easiest thing to do is to leave with the memories of the good times you had.