I’m a coauthor on two RL papers that went on arXiv recently.

The first is The Principle of Unchanged Optimality in RL Generalization, which I co-wrote with Xingyou Song. This evolved out of discussions we had about generalization for RL, where at some point I realized we were discussing ideas that were both clearly correct and not written down anywhere. So, we wrote them down. It’s a very short paper that draws comparisons between supervised learning generalization and RL generalization, using this to propose a principle RL generalization benchmarks should have: if you change the dynamics, make this observable to your policy, or else your problem isn’t well-founded. It also talks about how model-based RL can improve sample efficiency at the cost of generalization, by constructing setups where modelling environment-specific features speeds up learning in that environment while generalizing poorly to other environments.

The second is Off-Policy Evaluation via Off-Policy Classification. This is a much longer paper, written with Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, and Sergey Levine over many more months. It’s about off-policy evaluation, the problem of evaluating an RL policy without directly running that policy in the final environment. This is a problem I was less familiar with before starting this paper, but I now believe it to be both really important and really understudied. Our aim was to evaluate policies using just a Q-function, without importance sampling or model learning. With some assumptions about the MDP, we can use classification techniques to evaluate policies, and we show this scales up to deep networks for image-based tasks, including evaluation of real-world grasping models. I’m working on a more detailed post about off-policy evaluation and why it’s important, but it needs more polish.

Let me know if you read or have comments on either paper. Both will be presented at ICML workshops, so if you’re going to ICML you can go to those workshops as well - see the Research page for workshop names.