• Recent RL Papers I've Worked On

    I’m a coauthor on two RL papers that went on arXiv recently.

    The first is The Principle of Unchanged Optimality in RL Generalization, which I co-wrote with Xingyou Song. This evolved out of discussions we had about generalization for RL, where at some point I realized we were discussing ideas that were both clearly correct and not written down anywhere. So, we wrote them down. It’s a very short paper that draws comparisons between supervised learning generalization and RL generalization, using this to propose a principle RL generalization benchmarks should have: if you change the dynamics, make this observable to your policy, or else your problem isn’t well-founded. It also talks about how model-based RL can improve sample efficiency at the cost of generalization, by constructing setups where modelling environment-specific features speeds up learning in that environment while generalizing poorly to other environments.

    The second is Off-Policy Evaluation via Off-Policy Classification. This is a much longer paper, written with Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine over many more months. It’s about off-policy evaluation, the problem of evaluating an RL policy without directly running that policy in the final environment. This is a problem I was less familiar with before starting this paper, but I now believe it to be both really important and really understudied. Our aim was to evaluate policies using just a Q-function, without importance sampling or model learning. With some assumptions about the MDP, we can use classification techniques to evaluate policies, and we show this scales up to deep networks for image-based tasks, including evaluation of real-world grasping models. I’m working on a more detailed post about off-policy evaluation and why it’s important, but it needs more polish.

    Let me know if you read or have comments on either paper. Both will be presented at ICML workshops, so if you’re going to ICML you can go to those workshops as well - see the Research page for workshop names.

  • Thoughts on ICLR 2019

    ICLR did terrible things for my ego. I didn’t have any papers at ICLR. I only went to check out the conference. Despite this, people I haven’t met before are telling me that they know who I am from my blog, and friends are joking that I should change my affiliation from “Google” to “Famous Blogger”.

    Look, take it from me - having a famous blog post is great for name recognition, but it isn’t good for much else. Based on Google Analytics funnels, it doesn’t translate to people reading other posts, or reading any more of my research. I’ve said this before, but I blog because it’s fun and I get something out of doing it, even if no one actually reads it. I mean, of course I want people to read my blog, but I really go out of my way to not care about viewership. Sorta Insightful is never going to pay my bills, and my worry is that if I care about viewership too much, I won’t write the posts that I want to write. Instead, I’ll write the posts that I think other people want me to write. Those two aren’t the same, and caring about viewers too much seems like the first step towards a blog that’s less fun for me.

    Favorite Papers

    Every conference, when I’m doing small-talk with other people, I get the same question: “What papers did you like?”. And every conference I give the same answer: “I don’t remember any of them.”

    I try to use conferences to get a feel for what’s hot and what’s not in machine learning, and to catch-up on subfields that I’m following. So generally, I have a hard time remembering any particular poster or presentation. Instead, my eyes glaze over at a few keywords, I type some notes that I know I’ll never read, I’ll make a note of an interesting idea, and if I’m lucky, I’ll even remember what paper it was from. In practice, I usually don’t. For what it’s worth, I think this conference strategy is totally fine.

    People still like deep learning. They still like reinforcement learning. GANs and meta-learning are both still pretty healthy. I get the feeling that for GANs and meta-learning, the honeymoon period of deriving slight variants of existing algorithms has worn off - many more of the papers I saw had shifted towards figuring out places where GANs or meta-learning could be applied to other areas of research.

    Image-to-image learning is getting pretty good, and this opens up a bunch of interesting problems around predictive learning and image-based models and so forth. And of course, generating cool samples is one of the ways your paper gets more press and engagement, so I expect to see more of this in the next few years.

    It’s sad that I’m only half-joking about the engagement part. The nature of research is that some of it looks cool and some of it doesn’t, and the coolness factor is tied more towards your problem rather than the quality of your work. One of these days I should clean up my thoughts on the engagement vs quality gap. It’s a topic where I’ll gladly preach to the choir about it, if it means that it gets through to the person in the choir who’s only pretending they know what the choir’s talking about.

    Can We Talk About Adversarial Perturbations?

    Speaking of cool factor, wow there were a lot of papers about adversarial perturbations. This might be biased by the poster session where all the GAN and adversarial perturbation papers were bunched together, making it seem like more of the conference than it really was (more about this later), but let me rant for a bit.

    As a refresher, the adversarial perturbation literature is based around the discovery that if you take images, and add small amounts of noise imperceptible to the human eye, then you get images that are completely misclassified.

    Adversarial Perturbation

    Source: (Goodfellow et al, ICLR 2015)

    These results really captured people’s attention, and they caught mine too. There’s been a lot of work on learning defenses to improve robustness to these imperceptible noise attacks. I used to think this was cool, but I’m more lukewarm about it now.

    Most of the reason I’m not so interested now is tied to the Adversarial Spheres paper (Gilmer et al, 2018). Now big caveat: I haven’t fully read the paper. Feel free to correct me, but here’s my high-level understanding of the result.

    Suppose your data lies in some high-dimensional space. Let be the set of points your classifier correctly classifies. The volume of this set should match the accuracy of your classifier. For example, if the classifier has accuracy, then the volume of will be of the total volume of the data space.

    When constructing adversarial perturbations, we add some noise to the input point. Given some correctly classified point , we can find an adversarial example for if the -ball centered at contains a point that is incorrectly classified.

    To reason about the average distance to an adversarial example over the dataset, we can consider the union of the -balls for every correctly classified . This is equivalent to the set of points within distance of any .

    Let’s consider the volume of , once its expanded by out in all directions. The larger is, the more the volume of will grow. If we pick an that grows the volume of from of the space to of the space, then we’re guaranteed to find an adversarial example, no matter how misclassifications are distributed in space, because every possible misclassification is within of . The adversarial spheres paper proves bounds on the increase in volume for different , to increase the volume such that it covers the entire space, specifically for the case where you data lies on a sphere. Carrying out the math gives an below the threshold for human-perceptible noise. This is then combined with some arguments that the sphere is the best you can do (by relating volume increase to surface area and applying the isoparametric inequality), and some loose evidence real data follows similar bounds.

    Importantly, the final result depends only on test error and dimensionality of your dataset, and makes no assumptions about how robust your classifier is, or what adversarial defenses you’ve added on top. As long as you have some test error, it doesn’t matter what defenses you try to add to the classifier. The weird geometry of high-dimensional space is sufficient to provide an existence proof for adversarial perturbations. It isn’t that our classifiers aren’t robust, it’s that avoiding adversarial examples is really hard in a 100,000-dimensional space.

    (High-dimensional geometry is super-weird, by the way. For example, if you sample two random vectors where each coordinate comes from , they’ll almost always be almost-orthogonal, by the law of large numbers. A more relevant fun fact is that adding noise to each coordinate of an -dimensional point produces a new point that’s away from your original one. If you consider the dimensionality of image data, it should be more intuitive why small perturbations in every dimension gives you tons of flexibility towards finding misclassified points.)

    The fact that adversarial examples exist doesn’t mean that it’s easy to discover them. One outcome is that adversarial defense research successfully finds a way to hide adversarial examples in a way that’s hard to discover with any efficient algorithm. But, this feels unlikely to me. I think defenses for adversarial attacks will be most useful as a form of adversarial data augmentation, rather than as useful stepping stones to other forms of ML security. I’m just not too interested in that kind of work.

    A Quick Tangent on Zero-Knowledge Proofs

    I don’t want this post to get hijacked by an adversarial examples train, so I’ll keep this brief.

    Ian Goodfellow gave a talk at the SafeML ICLR workshop. I’d encourage listening to the full talk, I’d say I agree with most of it.

    In that talk, he said that he thinks people are over-focusing on adversarial perturbations. He also proposed dynamic defenses for adversarial examples. In a dynamic defense, a classifier’s output distribution may change on every input processed, even if the same input is replayed multiple times. This both breaks a ton of assumptions, and gives you more flexible and expressive defense models.

    This may be a completely wild connection, but on hearing this I was reminded of zero-knowledge proofs. A lot of zero-knowledge proof schemes integrate randomness into their proof protocol, in a way that lets the prover prove something is true while protecting details of their internal processing. And with some twisting of logic, it sounds like maybe there’s some way to make a classifier useful without leaking unnecessary knowledge over how it works, by changing in the right way each input. I feel like there might be something here, but there’s a reasonable chance it’s all junk.

    Poster Arrangements

    Hey, do you remember that comment I made, about how all the adversarial example and GAN papers were bunched up into one poster session? At this year’s ICLR, posters were grouped by topic. I think the theory was that you could plan which poster sessions you were attending and which ones you weren’t by checking the schedule in advance. Then, you can schedule all your networking catch-ups during the sessions you don’t want to visit.

    I wing all my conferences, getting by on a loose level of planning, so that’s not how it played out for me. Instead, I would go between sessions where I didn’t care about any of the posters, and sessions where I wanted to see all the posters. This felt horribly inefficient, because I had to skip posters I knew I’d be interested in reading, due to the time crunch of trying to see everything, and then spend the next session doing nothing.

    A friend of mine pointed out another flaw: the posters they most wanted to see were in the same time slot as their poster presentation. That forced a trade-off between presenting their work to other people, and seeing posters for related work in their subfield.

    My feeling is that ICLR should cater to the people presenting the posters, and experience of other attendees should be secondary. Let’s quickly solve the optimization problem. Say a subfield has posters, and there are different poster sessions. As an approximation, every poster is presented by person, and that person can’t see any posters in the same session they’re presenting in. We want to allocate the posters such that we maximize the average number of posters each presenter sees.

    I’ll leave the formal proof as an exercise (you’ll want your Lagrange multipliers), but the solution you get is that the posters should be divided evenly between the poster sessions. Now, in practice, posters can overlap between subfields, and it can be hard to even define what is and isn’t a subfield. Distributing exactly evenly is a challenge, but if we assign posters randomly to each poster session, then every subfield should work out to be approximately even.

    To me, it felt like the ICLR organizers spent a bunch of time clustering papers, when randomness would have been better. To quote CGP Grey, “Man, it’s always frustrating to know that to literally have done nothing would be faster than the something that is done”. I’m open to explanations why randomness would be bad though!

    The Structure and Priors Debate

    This year, ICLR tried out a debate during the main conference. The topic was about what should be given to machine learning models as a given structure or prior about the world, and what should be learned from data. I got the impression that the organizers wanted it to be a constructive, fiery, and passionate debate. To be blunt, it wasn’t.

    I’m in a slightly unique position to comment on this, because I actually took part in the ICML 2018 Debates workshop. I’d rather people not know I did this, because I was really, really winging it, armed with a position paper I wrote in a day. I’m not even sure I agree with my position paper now. Meanwhile, the other side of the debate was represented by Katherine and Zack, who had considerably more coherent position papers. It was like walking into what I thought was a knife fight, armed with a small paring knife, and realizing it was an “anything goes” fight, where they have defensive turrets surrounding a fortified bunker.

    But then the debate started, and it all turned out fine, because we spent 90% of our time agreeing about every question, and none of us had any reason to pull out particularly heavy linguistic weaponry. It stayed very civil, and the most fiery comments came from the audience, not from us.

    When talking to the organizers of the ICML debates workshop after the fact, they said the mistake was assuming that if they took people with opposing views, and had them talk about the subject they disagreed on, it would naturally evolve into an interesting debate. I don’t think it works that way. To get things to play out that way, I believe you have to continually prod the participants towards the crux of their disagreements - and this crux is sometimes not very obvious. Without this constant force, it’s easy to endlessly orbit the disagreement without ever visiting it.

    Below is a diagram for a similar phenomenon, where grad students want to work on a thesis right up until they actually sit down and try to do it. I feel a similar model is a good approximation for machine learning debates.

    PhD orbit comic

    Source: PhD Comics

    Look, I’m not going to mince words. Machine learning researchers tend to be introverted, tend to agree more than they disagree, and are usually quite tolerant of differing opinions over research hypotheses. And it’s really easy to unintentionally (or intentionally) steer the conversation towards the region of carefully qualified, agreeable conversation, where no one remembers it by tomorrow. This is especially true if you’re debating a nebulous term like “deep learning” or “structure” or “rigor”, where you can easily burn lots of time saying things like, “Yes, but what does deep learning mean?”, at which point every debater presents their own definition and you’ve wasted five minutes saying very little. The context of “we’re debating” pushes towards the center. The instinct of “we’re trying to be nice” pushes way, way harder away from the center.

    I think ML debates are cool in theory, and I’d like to see a few more shots at making them happen, but if it does happen again, I’d advise the debate moderators to go in with the mindset that ML debates need a lot of design to end in a satisfying way, with repeated guidance towards the crux of the debaters’ disagreements.


    ICLR was pretty good this year. New Orleans is a nice convention city - lots of hotels near the convention center, and lots of culture in walking distance. I had a good time, and as someone who’s lived in California for most of their life, I appreciated getting to experience a city in the South for a change. It was great, asides from the weather. On that front, California just wins.

  • OpenAI Finals

    OpenAI just beat OG, champions of The International 8, in a 2-0 series. They also announced that in private, they had won three other pro series: 2-0 over Team Lithium, 2-0 over SG e-sports, and 2-0 over Alliance. Pretty cool! I don’t have a lot to add this time, but here are my thoughts.

    OG Isn’t the Top Team and That Doesn’t Matter

    After pulling off an incredible Cinderella story and winning TI8, OG went through some troubles. My understanding is that they’ve started to recover, but are no longer the consensus best team.

    To show this, we can check the GosuGamers DotA 2 rankings. This assigns an lo rating to the top DotA 2 teams, based on their match history in tournaments. At the time of this post, OG is estimated as the 11th best team.

    I don’t think this really matters, because as we’ve seen with the 1v1 bot, the previous OpenAI Five match, and with AlphaStar, once your at the level of semi=pro, reaching pro is more a matter of training time and steady incremental training improvements than anything else. Going into the match, I thought the only way OG would have a chance was if the restrictions were radically different from the ones used at TI8. They weren’t. Given that OpenAI Five beat a few other pro teams, I believe this match wasn’t a fluke and there’s no reason they couldn’t beat Secret or VP or VG with enough training time.

    Reaction Times Looked More Believable

    I’m not sure if OpenAI added extra delay or not, but the bot play we saw felt more fair and looked more like a player with really good mechanics, rather than superhuman mechanics. There were definitely some crazy outplays but it didn’t look impossible for a human to do it - it just looked very, very difficult.

    If I had to guess, it would be that the agent still processes input at the same speed, but has some fixed built-in delay between deciding an action and actually executing it. That would let you get more believable reactions without compromising your ability to observe environment changes that are only visible for fractions of a second.

    Limited Hero Pool is a Bit Disappointing

    I think it’s pretty awesome that OpenAI Five won, but one thing I’m interested by is the potential for AI to explore the hero pool and identify strategies that pro players have overlooked. We saw this in Go with the 3-3 invasion followup. We saw this in AlphaStar, with the strength of well microed Stalkers, although the micro requirements seem very high. With OpenAI Five, we saw that perhaps early buybacks have value, although again, it’s questionable whether this makes sense or whether the bot is just playing weird. (And the bot does play weird, even if it does win anyways.)

    When you have a limited hero pool, you can’t learn about unimplemented heroes, and therefore the learned strategies may not generalize to full DotA 2, which limits the insights humans can take away from the bot’s play. And that’s a real shame.

    It seems unlikely that we’ll see an expansion of the hero pool, given that this is the last planned public event. It’s a lot more compute for what is already a compute heavy project. It would also require learning how to draft, assuming draft works the same as the TI8 version. In the TI8 version, the win rate of every possible combination of heroes is evaluated, and the draft is done by picking the least exploitable next hero. Given a pool of 17 heroes, there are different hero combinations, which is small enough to be brute-forced. A full hero pool breaks this quick hack and requires using a learned approach instead. I’m sure it’s doable (there’s existing work for this), but it’s another hurdle that makes it look even more unlikely.

    A Million 3k MMR Teams at Five Million Keyboards Have to Win Eventually

    At the end of the match, OpenAI announced that they were opening sign-ups to allow everyone to play against or with OpenAI Five. It’s only going to be up for a few days, but it’s still exciting nonetheless. I have no idea how much the inference will cost in cloud credit (which is presumably why it’s only running for a few days).

    I fully expect somebody to figure out a cheese strategy that the bot has trouble handling. I also expect every pro team to try beating it for kicks, because if they can beat it consistently, can you imagine how much free PR they’d get? If they don’t beat it, they don’t have to say anything, so it seems like a win-win.

    There is a chance that the bot is genuinely too good, in an “AlphaGo Master goes 60-0 against pros” kind of way, but that was 60 games, and way more than 60 people are going to try to beat OpenAI Five. They’re not all going to be pros, but scale is going to matter more than skill here.

    When OpenAI let TI attendees play their 1v1 bot that beat several pros, people were able to find all sorts of cheese strategies that worked. It was an older version of the bot, so perhaps history doesn’t have precedence, but I’m going to guess somebody is going to figure out something sufficiently out-of-distribution.

    We Still Take Pride in Few Shot Learning

    In the interview with Purge after the match, OG N0tail had an interesting comment

    Purge: If you guys got to play 5 matches right now against them, do you think you could take at least 1 win?

    N0tail: Yeah, for sure. For sure 1 win. If we played 10, we’d start winning more, and if we could play 50 games against them, I believe we’d start winning very very reliably.


    He later elaborated that he felt the bot had exploitable flaws in how it played around vision, but I think the more important note is that we take pride in our ability to actively try new things based on very few examples. The debate over how to do this is endless, but it makes me think that if somebody manages to demo impressive few-shot learning, we’ll start running out of excuses about AI.