• So Hey, That Open AI LP Thing

    I originally wrote this post March 12, 2019, the day after OpenAI announced they were creating a “capped-profit” company. I decided not to share it very widely, because it was a touchy subject, lots of people were yelling, and I did not want to get involved in the cross-fire.

    I remembered that post recently, and after a re-read I still agreed with most of what I wrote. I feel it’s worth having it public somewhere, even if no one cares about the Open AI LP debate anymore. Here’s the original post, with some editing. For context, this was written a month after GPT-2, before the Rubik’s Cube result, and before the Microsoft deal.

    * * *

    Okay, so OpenAI announced that they were legally restructuring from a nonprofit into a “capped-profit” company. I am not a lawyer, and I’m especially not a lawyer familiar with business law, but here is my understanding. OpenAI LP is a for-profit company. OpenAI Nonprofit is the nonprofit side. The OpenAI LP is managed by the nonprofit’s Board of Directors. Members of the nonprofit board are allowed to own stakes in OpenAI LP, but only a minority of the Board of Directors is allowed to do so. Investors and employees sign agreements that following OpenAI’s principles takes priority over profit. Investors are capped at receiving back 100x what they invest. This is on the order of “returns from investing early in a really good startup, but not as much as investing early in a stupidly amazing startup”.

    Company is now roughly divided as follows: the for-profit side employs the robotics, DotA 2, language, safety, and policy teams. The nonprofit side employs the educational programs and policy initiatives, although I don’t understand what “policy initiatives” means here, if the policy team is under OpenAI LP.

    Why is this happening? The given argument is that recent results suggest that new AI systems and any approaches that could lead to AGI will need lots of compute, and also need to attract lots of talent, and the investment required is going to be much larger than what is already dedicated to the company, so part of the company needs to become for-profit to attract investors to front the required funds. And if they succeed, they’ll make way more than 100x return, by orders of magnitude.

    Rather predictably, reactions have ranged from “yeah makes sense” to “wow WTF this is bullshit.” My reaction is roughly at the “ehhhhhhhhh, okay” side of things. I am a big fan of Buffy Speak, but at some point you gotta stop using Buffy Speak and gotta start explaining stuff using actual words.

    The “wow WTF this is bullshit” side seems to revolve around feelings of betrayal that OpenAI the nonprofit would turn into a for-profit company. The criticism feels like, “OpenAI was supposed to not be beholden to profit, and would stand above the incentives that for-profit companies are under. It’s good for nonprofit AI companies to exist, and now there is one fewer.” There are enough for-profit AI startups and AI consulting gigs out there. If you saw the nonprofit nature of OpenAI as a core principle of its existence, then turning into a for-profit feels disingenuous. What distinguishes OpenAI from every other AI startup trying to make a quick buck? Moreover, given one restructuring from nonprofit to capped-profit, what systems are in place to prevent further slides to for-profitness? How do we know the 100x figure won’t change to 1000x later on? Are we guaranteed that the minority of the board with stakes in the for-profit OpenAI LP will not be able to leverage their wealth to take over the board’s decision making? Culture doesn’t die overnight. It’s something that grows organically and bleeds organically. What if this is the first step towards OpenAI turning into something it shouldn’t be?

    Meanwhile, the “yeah makes sense” crowd is generally (but not entirely) a group that believes AGI should be considered a serious possibility in the near term (where serious is something like “at least 5%”), and thinks that turning into a for-profit company is a necessary evil to raise sufficient capital to make sure OpenAI has a credible chance of creating AGI and having a seat at the table of any policy decisions. Opinions on that argument are highly correlated with how much you think people should care about AI safety, and now we’re in that minefield.

    My impression is what OpenAI is doing is consistent with the belief that AGI could happen sometime in the near-term, by combining lots of compute with a few (but not a lot) of insights into learning. Under that belief, it makes sense that they need more money for new hardware and talent. It makes sense to structure things such that investors have a reason to give them money. Under that belief of plausible near-term AGI, these are all necessary things OpenAI must do to continue to have any influence over the trajectory of machine learning, and then all the other stuff with the capped profit and nonprofit board and so on are about trying to minimize risk of losing their original mission statement.

    And honestly, I don’t expect anything different from OpenAI. Look, when your company gets initial investment from Elon Musk the Futurist and Sam Altman the head of Y Combinator, and your founding statement is “ensuring AGI helps humanity”, it’s going to primarily attract people who think AGI is credible enough to merit decisive action. So talented people who believe in shorter AGI timelines move towards OpenAI. This makes it more likely OpenAI does things consistent with belief in shorter timelines, which attracts people with short AI timelines and pushes away people with longer timelines, and so the feedback loop begins. It’s not a very strong feedback loop, but it’s there. I remember talking to someone about why only OpenAI and DeepMind worked on AI safety, and what I said to them was, “it’s because, under your definition of AI safety, people at OpenAI and DeepMind care about it the most, and so they flock there. But this doesn’t imply no one else cares - it just implies they don’t care enough to drop their current work to join those safety teams.” (Editing note: not to mention that what “AI safety” means is itself a nebulous concept. For example, I consider fairness and interpretability as part of AI safety. I know people who’d disagree.)

    Based on what previous and current employees of OpenAI are tweeting about, there was a lot of internal discussion about the restructuring, but everyone who is talking about it is voicing cautious support about the arrangement. Of course, there are obvious sampling biases here.

    I can’t help but feel like the attacks on OpenAI LP are just an extension of the attacks around GPT-2, that accuse them of building hype narratives for profit. A lot of people were not happy with how GPT-2 was handled, and although they’ll often say OpenAI does good research, they also dislike that OpenAI has excessive amounts of PR around their work. They’ll accuse this PR of overemphasizing the research importance and engaging in media-trolling to drive attention, rather than substance. For them, OpenAI LP is just another dot in an on-going trend of burning research goodwill to funnel money into GPUs. Meanwhile, there are counterarguments in the form of blog posts defending OpenAI, accusing OpenAI’s detractors of not taking OpenAI at face value, and lamenting the state of ML discourse.

    ML arguments are pretty terrible these days, I’m not going to defend that. But for trust, well… The thing about company trust is that other people have no obligation to take OpenAI at face value. Historically, I assume (but do not know for sure) that people who have been in ML for longer have seen lots of companies that were all fluff and no substance, that deliberately misled people about what they could do to continue getting investors and donators to give them money, and in the end, those investors were left with nothing. From the outside, OpenAI’s behavior is not too different from prior snake-oil salesman, so why should we believe anything about what they say?

    I can see how this would be incredibly frustrating to people who do take OpenAI’s explanation at face value, because if you take OpenAI at face value, everything they’re doing seems reasonable. It must feel like the people who don’t get it are being incredibly obtuse. This is how you get fanboy blog posts - people build an incorrect model that detractors don’t understand OpenAI’s argument, and if they just explain it in a blog post, then it’ll get more people to care about AGI and improve the world. The key difference, is that it’s not that people don’t understand OpenAI’s argument, it’s that they don’t think OpenAI believes their own argument. They think it’s just a front to hide their true goal of staying financially solvent.

    It’s all typical mind fallacy. One side can’t believe someone could genuinely believe AGI is coming soon, and the other side can’t believe someone could genuinely believe AGI has no chance of happening in the next 50 years. Really, the most upsetting thing for me is that the most invested people have so little self-awareness that they can’t see they aren’t arguing about what they should be arguing about.

    For what it’s worth, my timelines are a bit long, 10% chance of AGI by 2045. (Editing note: I recently updated my forecast to 10% chance by 2040. Maybe I’ll explain why in a future post.) This is long enough that most of AI safety research doesn’t feel worth it to me, but I’m willing to entertain the possiblity I’m wrong, and believe that reasonable people can believe AGI will happen within the next 10 years. I also think many of those reasonable people are at OpenAI. I also do wish there was less negativity in ML discourse, and more good-faith arguments, so if you want to write a post explaining why OpenAI is acting reasonably, feel free. It’s not actually going to change any minds, but it probably won’t hurt.

    As for what happens next? I see a few scenarios.

    If AGI happens soon, then it will have been good that OpenAI did this, because they would have needed the money. I do trust the company-which-already-has-an-AGI-safety-team to be a good player to have in the game.

    If AGI doesn’t happen soon, then a bunch of investors will have given money to OpenAI and not received anything in return, and some interesting AI research will get done. I guess this means other speculative research won’t get investment, and that’s sad. But it could also have meant that those investors didn’t invest in some random Silicon Valley startup that builds a website that generates a lot of money without creating social value. I have trouble seeing OpenAI turning into the next Theranos - if they did, they’d bleed AI talent so fast to every other lab. Sure, the investors will lose money, but really, that’d be their fault for betting on uncertain technology. You’re not supposed to feel bad when venture capitalists make a bad investment. They knew the risks. If they didn’t, they’d be bad VCs.

    As for the nonprofit to profit aspect of things, I don’t have many opinions. I understand the concerns there. I also think it’s entirely possible for for-profit companies to care about societal implications of their work. Literally all the FAANG machine learning labs have groups for this. I don’t see any reason this couldn’t include the existential threats people worry about. You can’t make money if you’re dead. (Editing note: to clarify - yes, it gets much harder to coordinate when profits get involved, but I don’t think it’s impossible, and if you think that for-profit companies are fundamentally incapable of handling existential risk, then you have bigger problems to worry about than OpenAI LP.)

    From my perspective, all OpenAI LP means is that more money will go into OpenAI, rather than other AI startups, and although I have some disagreements with how OpenAI does things, I do trust them to do interesting foundational research. TL;DR, “yeah okay, let’s see where this goes.”

  • Our Generation's Chernobyl

    This post has several spoilers for the HBO mini-series Chernobyl. In some ways, you can’t really spoil a show about a historical event, but you may want to skip this post if you haven’t seen it before.

    Coronavirus is the new Chernobyl. Both are crises caused by something we can’t see with the naked eye. Both started in authoritarian countries, and both have made scientists the heroes and trusted officials of the day.

    Mechanically, they are very different. Chernobyl is about an exploded reactor pouring out radioactive fallout, while COVID-19 is about a living virus. But the response has felt similar, and that’s what’s making me disappointed.

    The Chernobyl mini-series on HBO is really well made, and although it is not always historically accurate, it’s accurate enough to paint some pretty stark parallels.

    Everything is Normal

    Craig Mazin, the writer and executive producer of Chernobyl, did a podcast about the series, where he talks about the show and the historical event. In episode 2, he mentions one streak of bad luck that didn’t make it to the show.

    The Chernobyl accident happened on April 26. Five days later was May 1, International Workers’ Day. This was a big holiday for the Soviet Union, and in the days before May 1, party officials understood the scope of Chernobyl’s danger. They wanted to cancel holiday celebrations, to reduce exposure to radioactive dust. They failed, because the Kremlin told them that Everything Was Normal. Why would you take precautions when there’s nothing to worry about?

    We’re talking about parades. In Kiev, in Minsk, there were party officials who, it honestly seems to me, begged, BEGGED, to cancel the parade. They were told that not only would they not cancel the parade, you’ll be walking in it too. And they did.

    (starts at 22:10)

    Everything’s fine - until it isn’t. Craig emphasizes that people in the Soviet Union bureaucracy weren’t monsters. They knew it was dangerous to be outside, but were overruled by those who cared about reputation more than safety.

    Never Mind, This Isn’t Normal

    Plenty of Soviet disasters got covered up and were only declassified long after they occurred. Chernobyl couldn’t be kept a secret, because the wind carried radioactive particles to Sweden and other European countries. After that, it couldn’t be kept a secret.

    Radioactive fallout is not the same as a virus. A single case of the virus can multiply and turn into a million cases. Radiation does not replicate this way. Chernobyl is special because people close to the reactor were exposed to so much radiation that they ended up giving off radiation themselves. The firefighter uniforms from that day are still in the basement of the abandoned Pripyat hospital, too radioactive to approach without the right protective gear.

    Nearby countries told their citizens to avoid eating wild game and vegetables, and to follow basic decontamination measures.

    There’s radioactive dust, she said; close all the windows and plug all the cracks. Later, my anxiety grew when I saw her husband Andrei taking off his clothes and putting them in a plastic bag before entering his apartment.


    As for the Soviet Union, to their credit, they did take drastic action once it was clear they had to. Thanks to an authoritarian government and a culture that pushed the collective over the individual, the evacuation and cleanup was fairly orderly.

    So they finally evacuate Pripyat. One thing I was struck [by] was how orderly it was. I was like, oh you’re evacuating an entire town 50,000 people, and I could only think of what that would be like if they tried to do that to a similar town in America. People would be yelling, people would be complaining, people would be demanding they were allowed to bring that […] Everybody just said “alright” and climbed up onto the [evacuation] buses.

    The citizenry, by all accounts except one, was incredibly orderly. Again, reflective of the society in which they lived and grew up. The police said, “You’re coming with us. You’re coming on the bus. You can take one suitcase and no pets, and you’ll be back in a few days.” And everybody just said, “Okay, I’ll get on the bus.” […] And they never, ever, ever came back.

    (starts at 28:55)

    Scientists Become Heroes

    Chernobyl made everyone care a lot more about nuclear physicists. Coronavirus is making everyone care a lot more about epidemiologists. Valery Legasov is the protagonist of the Chernobyl mini-series, and in real life he was the chief of the commission investigating Chernobyl. For the West, he became one of the faces of the Soviet response, since he presented the Soviet report at a meeting of the International Atomic Energy Agency in Vienna, detailing what happened in Chernobyl. He was well-respected for acknowledging failures in the Soviet response, although his public testimony covered up design flaws in Soviet nuclear reactor design.

    Now we have Fauci, who is generally popular (56% trust rating as of May 2), and one of the figureheads of the Coronavirus Task Force. The CDC playbook for communication mentions the importance of having a single lead spokesperson who’s a scientist, not a politician. Having a single spokesperson increases trust because of familiarity, and making that person a scientist reduces risk of politicizing the disease. If half the country trusts the CDC less because of a culture war, it’s a disaster for public health. See this New Yorker article for more.

    According to the YouGov study above, Fauci’s 56% trust rating is split 68% among Democrats and 48% among Republicans. It is already too late for the United States.

    Reality Doesn’t Care About Politics

    The Chernobyl mini-series is ostensibly about the events of Chernobyl, but it’s really more about how people responded to Chernobyl. That’s the aspect I was reminded of first, and the reason I started writing this post.

    Chernobyl and COVID-19 aren’t really about people. Sure, people are part of both, but their fundamentals are grounded in physical reality: infectious diseases for COVID-19, and radiation for Chernobyl. It’s like the classic Philip K. Dick quote: “Reality is that which, when you stop believing in it, doesn’t go away.” A worryingly large number of decision makers aren’t respecting reality.

    I’m not sure of the historical accuracy of this moment, but in Episode 3 of the mini-series, “Open Wide, O Earth”, Legasov and Scherbina are arguing over the size of an evacuation zone. Legasov learns the evacuation zone has been set to 30 km, and wants it to be much, much larger. He is overruled.

    Legasov: “How did this happen? Who gave them this idea?”

    Scherbina: “Are you suggesting I did?”

    Legasov: “Well someone decided the evacuation zone should be 30 km, when we know– (points to map) Here! Cesium-137 in Gomel District. Two HUNDRED kilometers away!”

    Scherbina: “It was decided.”

    Legasov: “Based on WHAT?”

    Scherbina: “I don’t know.”

    Legasov: “Forgive me. Maybe I’ve spent too much time in my lab. Or maybe I’m stupid. But is this really how it all works? An uninformed, arbitrary decision that will cost who knows how many lives is made by some apparatchik? Some career Party man?”

    (screenplay linked here)

    This is worth emphasizing: no one is forced to do things that make sense. To be a politician, the only thing you have to understand is people. Who they are, how they think, what they believe, and how to convince them to support you. That’s certainly not easy, it requires you to be shrewd and to have a good read of people. But there’s no particular reason to expect politicians to be good at anything besides navigating structures of power. They don’t have to be well-informed about anything, unless it’s politically expedient to be well-informed. Unfortunately, that approach is exactly what doesn’t work for the coronavirus.

    If the coronavirus was a people problem, maybe you could use charm and wit to defuse the situation. But you can’t. You can’t talk to the coronavirus to understand what it does and doesn’t want. You can’t work the room to get the disease to spread slower. You can’t cut a deal with coronavirus to make it kill fewer people. No, it’s there, it exists, and you have to deal with it - or not. This has been repeated elsewhere, but all political instincts are wrong at the start of a pandemic, and you pay a price if those instincts aren’t overruled.

    You hope you get a politician that understands the problem, has the grit to take unfavorable actions, and they use their experience in navigating deals to get what matters to the people who need it. By the end of the series, Scherbina is that man.

    Scherbina: I’m an inconsequential man, Valery. I hoped one day I would matter, but I didn’t. I just stood next to people who did.

    Legasov: There are other scientists like me. Any one of them could have done what I did. You– everything we asked for: men, material, lunar rovers. Who else could have done these things? They heard me, but they listened to you. […] Of all the ministers, and all the deputies, the entire congregation of obedient fools, they mistakenly sent the one good man. For God’s sake Boris, you were the one who mattered most.

    There’s this phenomenon, where people in politics and PR will repeat something that isn’t true, and if they do so often enough, people will believe it. Those people will even start using motivated reasoning to create their own justification for what you’re saying. This works for subjects that are complicated enough to have ambiguity in their causality, but it doesn’t work very well for the narrow slice of reality that COVID-19 occupies.

    In episode 4, “For the Happiness of All Mankind”, the Soviet Union explores using robots to clear debris off the Chernobyl reactor roof. Radiation damages electronics. The Soviet robots they had from the Space Race could withstand some radiation, but not the highest levels of radiation detected on the roof.

    Through tense off-screen negotiations, they get a robot from West Germany which can withstand the reported numbers. They try it out, and the robot fails immediately. Scherbina soon learns why.

    Scherbina: The official position of the State is that a global nuclear catastrophe is not possible in the Soviet Union. They told the Germans that the highest detected level of radiation was 2000 roentgen. They gave them the propaganda number. That robot was never going to work.

    Think about this for a second. Someone high up in the government decided to give a propaganda number, instead of a real one. That filtered all the way down the bureaucracy, down to the people figuring out how to borrow a robot from the West. That one lie, from someone who cared more about reputation than accuracy, wasted the time of the negotiators, of the robot constructors from West Germany, of the technicians who operated it - all of it, gone.

    I cannot think of a better argument for why you should care who your boss is, and who your elected officials are. If technology is a multiplier of both good and bad, then power is too. What does it say when both Republican and Democrat governors hid flight details of testing and PPE shipments from the federal government, to avoid confiscation? It’s just absurd.

    The central theme of the Chernobyl mini-series is truth, and the lies surrounding it. This was why the showrunners tried to stay historically accurate. They thought it would be cheap to send a message about truth that was wrapped in artistic lies, and reality was dramatic enough. When the inevitable COVID-19 documentaries arrive, I hope they make the same decision.

  • A Reinforcement Learning Potpourri

    I’ve fallen behind on RL literature from the past few months. So, I’ve decided to catch up with a bunch of recent papers.

    First Return Then Explore

    Let’s start with First Return Then Explore, by Ecoffet et al. This is a continuation and extension of the Go-Explore work from UberAI.

    When Go-Explore first came out, I was very excited by its announced results, but got upset by how they were presented. I wrote a post attempting to explain that tension - that I really liked the paper’s ideas, and really disliked its media strategy. The media strategy for First Return Then Explore is comparatively muted. For one, this time they have a draft on arXiv. (Sorry, I’m never going to stop ribbing them for that.) They’ve also been more careful in their claims, and have improved their previous results.

    Both First Return Then Explore and Go-Explore aim to first return to a state that has been visited before, then explore from that state. To make this more efficient, states are grouped into “cells” through some encoding. In the original Go-Explore paper, these cells are defined by downsampling by a fixed factor. First Return Then Explore changes this to tune the downsampling factor online, by doing a small search to maximize normalized entropy across a fixed budget of cells. There are also more heuristics on choosing which cell to return to, instead of uniformly at random.

    Besides this change, the Atari experiments mostly operate the same way: they assume a simulator or deterministic environment, learn the policy by leveraging the determinism, then do a robustification step where they try to reproduce behavior in a stochastic version of the environment.

    The part I care about is the part they call Policy-based Go-Explore. My main criticism of the original Go-Explore paper was that it required access to a deterministic analogue of your final environment. They proposed learning a goal-conditioned policy to return to previous states, instead of following a memorized trajectory, which lets you hand stochastic environments at training time. However, they left it as future work.

    Well, now they have results. It worked, but it was only tested on Montezuma’s Revenge with domain-specific features. I view papers through survival bias: if there’s an experiment that’s natural in the paper’s context, but isn’t in the paper, then it probably didn’t work, because if it worked, it’d be in the paper. So for now, I’m assuming it didn’t beat SOTA with domain agnostic features.

    My final verdict is that the updated paper improved its strengths, but only mildly improved its weaknesses. The paper is an even stronger case that good exploration can be reduced to learning to quickly return to states you’ve visited before, and exploration algorithms without this capability have failure modes that First Return Then Explore fixes. Learning that return policy, however, is still an open problem for general domains. The reduction is valuable, and I hope it encourages more work on efficiently learning goal-conditioned policies.

    Data Augmentation

    The new hotness in RL is data augmentation. Three papers came out on arXiv in the past week: Constrastive Unsupervised Reinforcement Learning (CURL), from Srinivas and Laskin et al, Image Augmentation is All You Need (DrQ) from Kostrikov and Yarats et al, and Reinforcement Learning with Augmented Data (RAD) from Laskin and Lee et al. It also made it to VentureBeat of all places.

    These three papers all find that for image-based RL, data augmentation gives very large gains on several tasks. Now at this point, I should mention that CURL and RAD are from people I know from UC Berkeley, and DrQ is from people I know from Google, so I’m going to step very carefully…

    CURL learns a representation by contrastive learning. Two randomly sampled data augmentations are applied to the same image, and their representations are encouraged to be close to one another through an InfoNCE loss. (See the SimCLR paper for an ablation showing this contrastive loss does better than other ones.)

    RAD compares just using data augmentation, without any contrastive losses, and finds that it outperforms CURL on the DMControl Suite. The theory is that in these environments, RAD beats CURL because it only optimizes for the task reward we care about, while CURL has to balance RL and contrastive learning. An ablation of the data augmentations used finds that random cropping is by far the most important data augmentation.

    DrQ also does data augmentation, using random shifts. This is the same as padding the image, then doing a random crop. In an actor-critic framework, they sample data augmentations to estimate , sample other data augmentations to estimate target Q-value , and do a critic update that’s now regularized by the data augmentation.

    Now, are these results surprising? Uh, kind of? It isn’t surprising because data augmentation isn’t new. Specifically doing random cropping isn’t new either - the QT-Opt paper I worked on 2 years ago used random cropping. Other groups have used data augmentation as well. The surprising part is the effect size. These papers are the first to carefully design an experimental setup that lets them isolate and measure the gains from data augmentation.

    It’s the sort of paper that makes you feel dumb you didn’t write it yourself. I’ve run very similar data augmentation ablations before, with results that were consistent to theirs, but I never did it on standard RL benchmarks and I never dug into it more. If I had, I probably could have written this paper. Ah well, live and learn.

    I’m very big on data augmentation. It just seems like the obvious thing to do. You can either view it as multiplying the size of your dataset by a constant factor, or you can view it as decreasing the probability your model learns a spurious correlation, but in either case it usually doesn’t hurt and it often really helps.

    AI Economist

    Salesforce put out a paper that uses reinforcement learning to design tax policy in a toy economic environment, and they argue their tax policies give better equality-productivity trade-offs, compared to the Saez framework.

    I do not understand tax policy very well, but my first instinct is that the economy is really complicated, a model of the economy has to be too simplistic somewhere, and therefore the results should be taken with massive caveats. The authors are aware of this, and the ideas the paper plays with are interesting. I’ve found papers like this are best viewed as idea generators. Within a model, the AI discovers a new strategy, which could be useful in the more complex environment, but you will get better results by asking a human to consider whether the AI’s strategy makes sense, instead of applying the AI’s strategy directly.

    Within the simulated economy, the agent preferred higher tax rates for the top brackets and lower tax rates for the middle class. So that’s interesting.

    It’s very unlikely this makes it to actual tax policy anytime soon. The real economy is more complicated, the politics is a nightmare to navigate, and the people in charge of economic policy probably care more about the perception of a good economy than the reality of a good economy. Given the ethics questions surrounding economics experiments, perhaps that’s for the best.

    Offline Reinforcement Learning

    Some colleagues from Google Brain and UC Berkeley have put a tutorial for Offline Reinforcement Learning on arXiv.

    By offline reinforcement learning, they mean reinforcement learning from a fixed dataset of episodes from an environment, without doing any additional online data collection during learning. This is to distinguish it from off-policy learning, which can happen in an offline setting, but is commonly used in settings with frequent online data collection.

    Offline RL is, in my opinion, a criminally understudied subject. It’s both very important and very difficult, and I’ve been talking about writing a blog post about it for over a year. Suffice it to say that I think this tutorial is worth reading. Even if you do not plan to research offline RL, I feel the arguments for why it’s important and why it’s hard are useful to understand, even if you disagree with them.