Improving Deep Reinforcement Learning via Quality Diversity, Open-Ended and AI-Generating Algorithms
Date Posted:
May 8, 2023
Date Recorded:
May 2, 2023
Speaker(s):
Jeff Clune, Associate Professor, Computer Science, University of British Columbia; Canada CIFAR AI Chair and Faculty Member, Vector Institute; Senior Research Advisor, DeepMind
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: Quality Diversity (QD) algorithms are those that seek to produce a diverse set of high-performing solutions to problems. I will describe them and a number of their positive attributes. I will summarize how they enable robots, after being damaged, to adapt in 1-2 minutes in order to continue performing their mission. I will next describe our QD-based Go-Explore algorithm, which dramatically improves the ability of deep reinforcement learning algorithms to solve previously unsolvable problems wherein reward signals are sparse, meaning that intelligent exploration is required. Go-Explore solved all unsolved Atari games, including Montezuma’s Revenge and Pitfall, considered by many to be a grand challenges of AI research. I will next motivate research into open-ended algorithms, which seek to innovate endlessly, and introduce our POET algorithm, which generates its own training challenges while learning to solve them, automatically creating a curricula for robots to learn an expanding set of diverse skills. Finally, I’ll argue that an alternate paradigm—AI-generating algorithms (AI-GAs)—may be the fastest path to accomplishing our field’s grandest ambition of creating general AI, and describe how QD, Open-Ended, and unsupervised pre-training algorithms (e.g. our recent work on video pre-training/VPT) will likely be essential ingredients of AI-GAs.
Bio: Jeff Clune is an Associate Professor of computer science at the University of British Columbia and a Canada CIFAR AI Chair at the Vector Institute. Jeff focuses on deep learning, including deep reinforcement learning. Previously he was a research manager at OpenAI, a Senior Research Manager and founding member of Uber AI Labs (formed after Uber acquired a startup he helped lead), the Harris Associate Professor in Computer Science at the University of Wyoming, and a Research Scientist at Cornell University. He received degrees from Michigan State University (PhD, master’s) and the University of Michigan (bachelor’s). More on Jeff’s research can be found at http://www.JeffClune.com or on Twitter (@jeffclune). Since 2015, he won the Presidential Early Career Award for Scientists and Engineers from the White House, had two papers in Nature and one in PNAS, won an NSF CAREER award, received Outstanding Paper of the Decade and Distinguished Young Investigator awards, and had best paper awards, oral presentations, and invited talks at the top machine learning conferences (NeurIPS, CVPR, ICLR, and ICML). His research is regularly covered in the press, including the New York Times, NPR, NBC, Wired, the BBC, the Economist, Science, Nature, National Geographic, the Atlantic, and the New Scientist.
CEDRIC: Thank you, everybody, for coming to the CBMM Quest seminar. We're really happy today to have Jeff Clune. I'll just make a very brief introduction. I am not sure when I first met Jeff. It might have been at the Uber AI launch party. Is that possible?
Yeah, so he was one of the founding members of Uber AI's team. That was like an interesting middle point in his career. I first became aware of his work when he published a paper that was one of the original adversarial images projects, like showing that deep neural networks for vision standard classification networks could be fooled into making very confident predictions of images that could be procedurally generated by a kind of evolutionary program that looked nothing like natural images.
So, to me, probably this is still one of the, if not the most, interesting adversarial image result. And I was just asking Jeff if he thinks with all the progress recently, that's been solved. And he thinks, no, probably not. And I think that's probably right, too. I doubt that's what he's going to talk about here, but it was an example of really creative out-of-the-box thinking.
And it reflects, I think, some of his background and ways that he's approached really interesting problems in AI from an evolutionary standpoint, whether it's biological or cultural evolution or just thinking about various dynamics of self-improvement that are really, I think, some of the most interesting ideas out there. They're not the mainstream ideas, although he's also, as I think we'll see in this talk, he's nimbly bootstrapped off of some of recent exciting developments and other more mainstream areas of AI.
But for his whole career, as far as I can tell, he was an early professor at the University of Wyoming. He's now at University of British Columbia. In addition to being at Uber AI, he was at OpenAI. He's part of the Vector Institute. He's gone through many different places in his career, but he's really had his eye on the big, hard questions of intelligence and what is it to have intelligent agents that can really learn for themselves and what kind of, again, broader evolutionary dynamics can capture some insights from how evolution has happened in the natural and artificial worlds, but also maybe take AI into new places that are different from the ones that happened in biology.
So, anyway, I'm really happy to have him here and have him be telling us about some of his work for the last two years and maybe some future directions. And we'll have him speak for maybe 55 minutes or so and then leave time for some questions. And I bet that will be a particularly good part of the session. So without further ado, I'll hand it over to Jeff Clune.
[APPLAUSE]
JEFF CLUNE: Thank you very much, Josh, for the invitation and to Cedric for organizing and for the lovely introduction. So it's an honor and a pleasure to be here. Today the main goals that I want to talk about is to share some work that I think represents a different style than you're probably used to in terms of approaches to reinforcement learning. That's going to take us through quality diversity algorithms and open-ended algorithms.
And then I want to take a step back at the end of the talk and really think about the overall grand ambitions of the field and how we might accomplish the grandest ambitions of the field. And that will take us into this thing that I'm calling an AI generating algorithm.
So we've got a lot of ground to cover, but before I do any of that, I want to quickly give a shoutout to my longstanding collaborator, Jean-Baptiste Mouret, Ken Stanley, and Joel Lehman, who have really been instrumental in developing a lot of the ideas and the work that I'm going to tell you about today.
OK, so there's a paradox in life that is pretty interesting. And that is, if you try really hard to solve a very difficult problem, you will fail. Ironically, if you don't try to solve that problem at all, you're much more likely to succeed. So as a metaphor for that, think about this maze. If you're this agent and you're trying to get to this goal, and you only reduce your distance to the goal-- you do things that get you closer to the goal-- then that's what classic optimization does. You optimize towards the goal, and you get this distribution of policies that kind of bang their head against that wall forever and never solve the maze.
But you can trivially solve this problem by just going and exploring and saying, I just want to go to new places. And if you optimize a policy to go to new places, you trivially solve this maze. Now, you might write that off as a simple caricature, but I will point I will argue and my colleagues argue, this happens all throughout human society and also in evolution.
For example, if you go back thousands of years to this cooking technology and you say, I am a queen and I will only fund scientists in my kingdom who make me the ability to cook that's faster and produces less smoke, you will never produce the microwave because to produce the microwave, you had to have been working on radar technology and notice that a chocolate bar melted in your pocket.
Similarly, if you start with the abacus, and you fund scientists who get you more compute per hour, you will never invent the modern computer because to do so, you had to have been working on electricity and vacuum tubes, which were not invented because of the benefits they provide vis a vis computation. So the conjecture here is that the only way to solve really hard problems is to kind of invent the problems as you go inspired by new things that have happened recently and to goal switch between problems as you go.
So the idea of goal switching came from this work that we did in 2016 with the same authors of the paper that Josh was referring to. And the idea is if you're trying to work on task A and you end up making progress on task B, also start optimizing for task B. For example, if you're a roboticist and you're trying to get your robot to do bipedal locomotion, and you suddenly see it balancing on one leg or crawling, don't throw that behavior out because it doesn't quantifiably improve your ability to walk. You should save that task and that skill and practice it more because those might be important stepping stones to ultimately learn how to walk.
And so, basically, the idea is we want our algorithms to capture serendipity on the wing as it goes through its process. And so this brings us to this family of algorithms that's called quality diversity algorithms that my colleagues and I have been pioneering and creating. And the idea is that we want in these algorithms not to do what traditional optimization does, which is to return the single best solution it has found, but instead, to return a set of solutions, where the solution is a diverse set of solutions where every one of those solutions is as high quality as possible.
So, here, for example, you see all these totally different solutions, if you will, engineering marvels. But each one of them is really good at whatever it is it does in the natural world. So probably the most famous algorithm in quality diversities right now is an algorithm called MAP-Elites that was invented by Jean-Baptiste Mouret and myself. And it's really, really simple in how it works. So I can explain it to you.
What we're first going to do is we're going to have some performance measure, like I want to solve some problem, like do I want a robot that's really fast. But I will also pick dimensions of variation that I think are interesting. So here in this example, height and width, for example. And what I really want this algorithm to do is get to me the fastest robot at all possible combinations of those dimensions of variation.
So for example, what I will do is I will initialize the algorithm by producing a random agent. This could be a deep neural net or some other thing. I will evaluate it. I will get its performance. Maybe here, it was speed. Also its phenotypic properties, or you could call them behavioral characteristics, whatever you want. And these are height and weight.
And then I will put that agent wherever it belongs in the cell-- for example, here. And I'm coloring performance with this heatmap here. Then the algorithm will do the following. It'll just loop forever, grab something out of the map, perturb it a little bit, reevaluate it, put it wherever it belongs. And if it's the best thing you've ever seen in that cell, according to the performance criterion, then you swap it in. So it's like a map of elites, the best you've ever seen at every point in this grid. And you just run that forever, and you fill up the grid.
And what's very different from an optimization algorithm is it returns to you this surface or this set that teaches you a lot about your search space, that illuminates your search base. It tells you, oh, there's like high modes here and here. You can't do very well in the middle. These are not connected, et cetera. It tells you a lot more about the problem that you're trying to solve.
And so here is a problem that we had worked on before MAP-Elites, where we produced these kind of soft robots whose job it was to run. And we put this video out that had this nice diversity of creatures. But there was a cheat here, and that is that each of these creatures came from a separate run of optimization. Within that run-- it was an evolutionary-based run-- all of the organisms in that population were slight variants on each one of these. So the only real diversity was from different random restarts of the algorithm.
But we don't want that. Evolution doesn't work that way. Instead, what we want is we want to have diversity. And so the dimensions of variation here are the number of voxels and the amount of this dark blue bone material. And if you run classic optimization here, you can see low performing solutions that don't search the search space very much. If you add diversity in the same two dimensions, and what you get is higher performing solutions, so you see some yellow here, but it really doesn't explore the search space much at all. But when you run MAP-Elites, you get this total sea change in what the algorithm does.
So there's the exact same amount of compute going into this map and this map, and look at the difference in the amount of exploration it did in the search space, which is pretty remarkable. And what I'm going to show you for the rest of the middle section of the talk is that you can do a lot once you have this map. You can use it in many interesting ways.
So, one thing I like to do is to go into the map. And because nearby points are similar-- remember, they have similar heights and similar weights or whatever it is you're measuring-- you can actually kind of see variations on a theme. So here is a creature that has a big bony shell, and it kind of is like a turtle walking. And as you can sweep down this dimension here, you can see it's basically the same motif, same design, but with less and less shell. And you can do the same thing with this creature, with this creature, et cetera.
And so basically, this is kind of like looking into the natural world and seeing the cat idea, but at different sizes and in different colors and maybe with different length claws, et cetera. And so it's kind of-- I think it's interesting and suggestive that we're starting to get some of the properties we see in the natural world inside of these algorithms. One other thing that's really essential is that goal switching turns out to be critical to the success of these algorithms.
So if you take any point in the map, you can consider that its own optimization problem. I wanted the best thing right there, and you look back through the history of where did the parents of that final thing, where did they visit in the map? What you see are these long, circuitous paths through the search space. Ahead of time, you would not have known that the best way to get the right solution in the top right corner was to traverse in this particular way. There might have been other ways, too, but what we know is if you only tried to stay in the top corner, you wouldn't have performed very well.
So what I want to tell you now is this paper that we had in Nature in 2015 called "Robots that adapt like animals." And it basically is going to use the ideas of quality diversity algorithms to get robots to adapt really quickly. You can actually see the map. We snuck our data onto Nature, which they normally don't let you do, but we made it beautiful. All right, and it was with these lovely authors here.
So here's our robot. It's going to become damaged. And the question is if the robot is out in the field finding landmines or survivors or whatever, you want it to continue on with its mission despite injury. So think about what an animal would do when it becomes injured. If you're in a forest, for example, and you sprain your ankle, what you don't do is what a traditional RL algorithm would do, which is try a billion different permutations that are very similar to the current policy that you started out with.
Instead, an animal like yourself would probably try to walk on that ankle and say, ow, that hurts. And then you might try a different idea, like, why don't I walk on the side of my foot? And you're like, nope, that still hurts. And you'll say, OK, how about my tiptoes? No, that still doesn't work. And they say, all right, whatever, I'll just hop out of the forest on one leg, which you already knew how to do because you have a high quality set of diverse policies in your brain from when you were a kid playing on the playground or whatever.
And so that's what we want our algorithm to do here. And so we need intuitions about different ways to move. And we want to conduct a few intelligent tests and then pick a behavior that works despite the injury. Here is our robot that we're going to work with. And the way that we're going to get the intuitions about different ways to move comes from that MAP-Elite's algorithm. So we're going to have this six-dimensional space. And each dimension is the percent of time that each leg of the robot is touching the ground.
And then we run MAP-Elites in this space, and its job is to go through ahead of time in simulation and find the best way to walk with only using your legs and this particular type of percentages, et cetera, et cetera, et cetera. And so it goes from a search space that has more gaits than there are molecules in the solar system to ultimately returning about 13,000 high quality diverse gaits. OK, here are some of the examples of these gaits. So here's a robot that walks with this leg 10% of the time and the other leg some other percent, et cetera. And it goes through, and we have these different ones.
A fun story is that a literal corner case in the hypercube was a challenge to have the robot walk without touching any of its legs touching the ground, which we thought was impossible. And we thought it was buggy when we saw performance in our chart in that corner, but we looked and it figured out to do this. So as they say in Jurassic Park, life will find a way.
OK, so these are intuitions for how to move on the undamaged simulated robot. So now we have two gaps to cross to reality and to the damaged robot. How will we do that? Well, we can use Josh's favorite tool, Bayesian optimization, to go through and intelligently sample different points in the search space. And what's really great about this idea is that because nearby points in the map are similar types of gaits, once we do an experiment in the real world, and it doesn't work, we don't just upgrade our posterior prediction of how poor that gait is, but as a function of space in the map, we can rule out that entire family of gaits.
I can say, oh, all the gaits that are like that probably won't work as a function of distance in this space. We can update the posterior from the prior that comes from mapping leads from the simulated undamaged robot. So you can try this one. That doesn't work. Try this one, and bounce around the space. And then eventually we have a really simple stopping criterion that I won't even get into because it's so simple.
So here's a video. You have the original gait. It's a classic hand-coated tripod gait. It's 0.25 meters per second and moves straight. Then damage occurs. So now we need to launch the algorithm. Sorry, we haven't launched the algorithm yet. This is just what happens with the damage on this thing. So the original gait is now not performing well. So we launch MAP-Elites. First thing I want to do is point your attention to this clock. This is the number of seconds of evaluation time.
So in the first trial, the behavior was curved, 0.13 meters per second. So we go up into the map here, and we're going to update the prior to become the posterior in light of that evidence, and you can see therefore there's kind of like more blue flowing through the map because we're learning that many gaits don't work. This is a six-dimensional map, so it's actually kind of spaced throughout the entire thing.
Now I think we're on our fourth trial here, 19 seconds into the thing. And we have something that's almost recovered the original speed and is mostly straight, but our stopping criterion thinks there's still something out there that's better. So we'll try one more. Here we go. And now 24 seconds in, we've learned this dynamic limping gait that has basically recovered the performance of the original policy.
If you did like policy gradients or something, that would take tens of thousands of evaluations probably to work. And here, we did it in five or six or something. So we tried it with different damage conditions on different robots, and in all times, this algorithm works quite well.
And this is probably the most important plot. We have many different variants of the algorithm. These two here are Bayesian optimization in the original search space or policy gradients, which is a modern RL algorithm in the original search space. Neither of them worked very well. All of the variants that start with the intuitions coming out of the quality diversity algorithm MAP-Elites are the ones that work really well and really quickly. Yes.
AUDIENCE: Quick question, but how do you measure the distance between gaits?
JEFF CLUNE: So we just have this six-dimensional space, which is the percent of time that you use each of your legs. And so you probably just have a Euclidean distance in that discretized space.
OK, so to conclude this little mini chapter of the talk-- the talk has many chapters-- we call this the intelligent trial and error algorithm. It shows in my opinion the power of quality diversity algorithms, like what you can do if you have this repertoire of diverse gaits. And in this case, you can kind of hand it off to some other algorithm here, Bayesian optimization, which can do wonderful things once you have this diverse, high quality set of gaits ahead of time. And that allowed us to have robots adapt in two minutes.
OK, so the next paper I want to tell you about also uses the principles of quality diversity, but in a totally different context. This was a paper we also had in Nature, this time in 2021, and it was done with these wonderful colleagues. And so one of the Achilles heels, in my opinion, in learning in general and deep reinforcement learning is the problem of exploration. What do you do when you're trying to learn via trial and error, but every time you try things, you get no reward from the system?
And so, these are called hard exploration problems. There's two types. One is sparse, where you're just basically almost never getting anything back from the reward. It's not telling you good or bad because you're not doing whatever it is that the reward function wants you to do.
And the kind of a classic example of this game, Montezuma's Revenge, which, for years, we were getting basically zero points on as a community, even though we were getting really high performance on other Atari games, because in this game, you have to take all sorts of actions and jump over this skull, et cetera, to go get this key. And you don't get any reward until you go and you get that key. So that had become like a mini grand challenge in the field to try to make some progress on this game.
An even harder game that wasn't even the grand challenge because we were way worse at it was this game Pitfall, and that's where the reward function is actively lying to you. So in this game, almost everything you do results in negative points if you don't know what you're doing. So you run into enemies, you fall down traps, you fall under the tarp, whatever, it's like negative, negative, negative, negative. So every algorithm at the time just learned to stand still because that's like the only thing to do in this cruel, horrible world, where I just get punished every time I do anything.
Now, obviously, that's not the true solution. The true solution to the game is to ignore those negative rewards, go explore, figure out how to play the game and ultimately get high rewards, but that's not what RL algorithms do. So we call these deceptive problems. OK, so classic deep RL algorithms on Montezuma's Revenge were in the range of 0 to 2,500 points, and on Pitfall, no algorithm had ever scored a single point on the game.
And so what you're probably thinking is like, OK, well, the classic answer here to these kind of problems is this thing called intrinsic motivation. Like, I'm not getting external reward. So why don't I have internal motivation to just go to new states, explore the problem and the game, and then I'll hopefully figure out what these external rewards are. Well, people have tried this, myself included, for a really long time, and the short answer is it kind of works. I mean, it definitely helps. It's better than not doing it.
And for example, three weeks before we put our paper on archive, there was this huge fanfare about a paper out of OpenAI called R&D, which, as you can see here, was a really big step up from all like the mostly failure cases on Montezuma's Revenge. It got all the way to 11,000 points. But if you watch humans play the game, they get way more than 11,000 points. So we're really far from human level play on this game.
And so I spent a lot of time at Uber AI labs, as Josh mentioned, trying to think about why isn't intrinsic motivation working. Like, it should work. It's a really good idea. Why isn't it working better? And I think we ultimately identified two pathologies at the heart of intrinsically motivated RL algorithms. One of them I called detachment. And it's best illustrated with this cartoon here.
So imagine you're an agent. You start here. You're in this maze, and we're coloring green all of the places the agent hasn't been yet. OK? So it will get an intrinsic motivation. It will be happy if it goes to that place because it's new. All right, so let's see what happens. The agent starts going over here. And every time it goes to and touches some green, I turn it purple. It's consumed that.
Sorry, that's not true. Everywhere the current agent is going is purple. So the current agent is going all the way to right here, touching all of this green, yeah. But we're always in intrinsically motivated algorithms. We have to discover new green and new green and new green. So the way that we do that exploration, even to get the intrinsic motivation, is we sprinkle in random stuff, like random actions or posterior sampling, et cetera. So we're randomly perturbing this agent.
And so maybe, by chance, we flip the agent that was going to right here, we flip it to this side, and it happily goes and it consumes all the green over here. But now what happens? OK, this policy has been trained for a long time to figure out how to go to the right, go to the right, get all this stuff. And now there's no green left over here. And so we're randomly perturbing it, and it's flailing here or here or here or here, flailing here, even flailing here. It's never getting any more green because it's been to all of those places.
And so what we say is that it has detached from the frontier of promising exploration here. It used to know how to get here, and it knew that it was producing a lot of green. But it has completely forgotten about that because it is detached from that promising, from frontier. So our proposal is just to explicitly remember promising locations for exploration and how to get back to them so we can get back to them later.
All right, the second pathology that we identified, we call derailment. And it's best illustrated by this fun drawing here. So imagine that you're an agent, and you're trying to explore this platform here. You're trying to get somewhere. There's some reward. You don't know where it is, somewhere on this platform. Imagine you climb all the way up to right there. Now, that is a very impressive thing to do. And you've got intrinsic motivation when you got there because you had never been there before, but then you fell off.
OK, so what would modern RL do? Well, it would say, that was really good. We got somewhere new. Let's give the policy reward. And then we'll launch the policy again to go back there, but we're always sprinkling in those random perturbations to try to find new stuff. And so the problem is that we're doing that throughout the entire trajectory. And so probably what we're going to do is knock it off the wall before it ever even gets back to this place because that was a really hard thing to do in the first place. And we're making it actively harder for the agent to do that, which I think is a pretty terrible idea.
This becomes really pernicious, especially as the length complexity and the precision of the sequence required to get to a hard place gets bigger, which means it gets more and more damaging to us as the agent is learning more and more complicated skills. So the insight that we had, which ended up becoming the title of the paper in Nature, was first return, then explore. That's it. That's all you need to do. And good things should happen.
So our idea is we want to first return to this without adding any perturbations whatsoever. And then we will explore from there. And from here, it's not that hard to maybe figure out that there is a reward on top of this thing. Now, you might counter that that would hurt robustness because we're not practicing in the face of noise. I would agree with you, but remember, we're exploring. We don't know where the reward is. Maybe the reward is over here, or it's over here, et cetera. So let's first go find out what's rewarding. And then once we know where the rewards are in this domain, then let's practice that in the face of stochasticity and figure out how to do that reliably.
OK, so that leads to this two-phase solution. First, we'll try to find out how to solve the problem. Then we will what we call robustify that policy in the face of noise. So here's how the algorithm works. We ended up calling this algorithm Go-Explore. So we're going to initialize by taking random actions and storing the states that we visited in an archive. And then we go to loop this thing until we solve the problem. We'll just pick a state from the archive. We'll go back to it with no explanation added. We'll explore from it.
And then if we find new states at any point in time, then we will just add them to our archive, with one extra twist, which is, if we find a better way back to an existing state that we already visited, like it got more points or it was more efficient and fast, then we'll swap that in as the preferred trajectory back to that particular location. Then we'll just run that forever.
I'm hoping that this is seeming a little familiar to you because this is the idea of quality diversity. It's now just being done within state space. What we're doing is we're having the algorithm go out and find a diverse set of high quality policies to visit different states within the state space. So it's basically a novel version of MAP-Elites.
OK, so I think now it's really instructive to contrast intrinsic motivation and what it does with Go-Explore. So intrinsic motivation is always being drawn to new areas. It's got a policy that's like going over there, going over there, going over there, because that's where I recently found intrinsic motivation. So it acts like a narrow beam flashlight that's pointing over there, and then it's slowly drawn around the room or the house, if you're trying to explore a house, by kind of, if there's green on the periphery of that beam. But it's always going in one place.
In contrast, I think of MAP-Elites as like turning on the lights in the living room and then its adjacent rooms and then their adjacent rooms and like an expanding sphere of illumination and knowledge until you fully know the building or the house. So it's a very different way to explore. OK, so one of the things you have to do is you can't run this algorithm in the original state space. It's too high dimensional.
And so we just have to do some abstraction or conflation where basically similar states should be counted the same, and different states should be different. So we want interestingly different states to map to different cells. So we actually came up with a really simple approach, is, we'll just take the original frame and downsample it. And if two different frames in the downsampled version are identical, we call those the same thing. So minor adjustments will hash to the same thing. And if the hash is something they do, we count that as a new state. And surprisingly, that worked quite well.
To return the cells, we can either exploit the fact that we're usually in a video game or a simulator, and we can literally just resurrect the cell. Or if you don't like that for a variety of reasons, we can train a goal conditioned policy where you tell it to go to that particular situation, and it has to train to get back there, which is less efficient, but as I'll show you later, works just fine.
OK, so interestingly, in Go-Explore, the first phase has no neural networks and no generalization. It actually just learns an open loop set of controls, like up, up, down, down, that will get you to a particular place. But then we go to a second phase, where we will train a deep neural network to robustly perform that behavior via imitation learning in the face of loss of stochasticity and noise. So when we ran Go-Explore, we immediately saw four times the previous state of the art, just like way better and actually above the human expert on the game, which is the definition of solving this challenge. So we went from not performing very well to solving the challenge.
But we can even do better. So my ideal algorithms don't require any domain knowledge, but when you add domain knowledge, they can take advantage of it. The easy way to do that in Go-Explore is we can just say, oh, what really should count as a new situation is when you get to a new X, Y location, or you get a new number of keys, or you get to a new room. Really easy to specify. If you do that, then things get quite ridiculous.
Now we can have a policy that, on average, gets 660,000 points, so we had to tear the plot. And basically, the best neural network we trained could basically play the game for as long as Adrian was willing to let it run. It ended up scoring like 18 million points and beating the human world record of 1.2 million points. I have no idea who spent so much time on Montezuma's Revenge, but I'm very sorry that we stole your world record.
OK, Pitfall, as I mentioned, had never had a single positive point from any RL algorithm before. And Go-Explore ends up getting a max score of effectively within a couple of thousand of the max possible score on the entire game and dramatically advances the state of the art here. We ended up running it on all of the Atari games in the benchmark.
And on every single game that we ran it on, most of the time, we got state of the art. And every single time, we beat the human level performance, which was the definition of solving the game. So this, in effect, solved the Atari benchmark suite with some caveats. And then we went to all of the unsolved Atari games and did the robustification thing, and it also solved them. So effectively, we're confident that Go-Explore can solve all of the Atari games.
I mentioned that you could, if you want to, you can train a policy to go back to a particular situation without resurrecting from the simulator. You just give it that goal and train it to do that. There are some benefits to that I won't go into for time, but I wanted to show you that also works, and it beats the state of the art and average human performance on these hard exploration games. So it works quite well.
OK, we also applied it to a robotics task. And I just want to highlight this is a more real-world example, where you have a hard exploration problem, like you want your robot to clean your room or to make you an omelet, and you don't want to give it all these intermediate rewards. And you just want it to go explore and solve that task. And it works. Here, we're only giving the robot a reward if it puts its toys away. And basically, policy gradients with intrinsic motivation doesn't do anything on this problem. Go-Explore solves this problem quite easily and quite efficiently.
OK, there's lots of future directions here. I just want to quickly mention them because I know there's some graduate students in the room that might be looking for a project. One of them, I think, is fascinating is trying to learn the cell representation, instead of using a generic one or a hand-engineered one. [? Akarsha ?] and I were talking about that earlier today. I think that's a really good direction.
Another one that no one's worked on yet, but I think would be fascinating, is run Go-Explore on like a million or like 2 million different heart exploration problems. Each one of them, you could quickly and efficiently get a solution demonstration. Then distill all of those different solutions into one policy. And now you have a generalist policy like GPT 4, but for robots that can just [? zero ?] [? shout ?] a whole bunch of new tasks. I think that would be really profitable and interesting.
And then another one is that I think you could try to learn to explore like Go-Explore, like do what it's doing because it has a lot of principles for how to explore domains. Get that into a policy that can do it in a flexible and generic way. And actually, [? Akarsha ?] and Phillip Isola and I are collaborating on that. So I got to talk about that today during our visit, which is pretty fun.
OK, so to conclude this mini chapter of the talk, I think that Go-Explore, it's a new approach for hard exploration domain. So I think independent of the overall theme of this talk, I think it's interesting. Yeah, one second. I think it opens up many new exciting research directions. But I think for the purposes of this talk in the overall theme, it once again shows the value of these quality diversity ideas. Like, what can you do if you have this diverse set of high quality things, which you have shown two different ways you can leverage that archive. Yeah.
AUDIENCE: Do you want to take questions now?
JEFF CLUNE: Sure.
AUDIENCE: So I assume that Atari games are designed to be solvable by humans, at least ultimately.
JEFF CLUNE: Definitely.
AUDIENCE: And it seems like if you took the same platform, you could design things that humans could never solve. And so I'm wondering, do you think you could design games that your methods couldn't solve?
JEFF CLUNE: Yeah, I think that no free lunch teaches us you can always be particularly cruel to algorithms and find their weaknesses. One of the nice things about Go-Explore versus methods where they kind of learn a prior of exploration is that they are kind of agnostic. So as long as their cell representation actually was willing to recognize new things as new, they might do well in these weird games that humans would find difficult because they're not kind of leveraging prior experience.
But then you run into this challenge of, how are you going to recognize something as interestingly new? And I'm sure that you could design games where what counts is interestingly new is like a leaf being one centimeter over, and that's the key to the game. And if you conflate that away as uninteresting, then you're never going to solve that. So, yes, the short answer is, yes, I think you could construct cruel games that even Go-Explore would fail. It would all come down to what you define as interesting. Yeah, thank you.
OK, I'll keep going. So I hope that I've planted a seed that [? QD ?] algorithms are interesting and worthwhile. They produce a diverse set of high performing agents or policies and/or whatever. And they harness goal switching to do really high quality exploration. And so the question is kind of what's missing, and I think the answer to that question is a problem that plagues almost all RL, which is that it's stuck in a single environment.
Traditionally, in machine learning, we pick a problem like Go. We come up with an algorithm that's good at learning Go. We let it run for as much time as the CEO of our company will give us compute for, and it works really well. And then we call it a day. We move on to the next project. Like, that Go player will only become better and better at Go with more compute. It won't go learn calculus or learn how to make me an omelet or invent a new form of art.
So what I'm interested in and my colleagues have been interested in for almost 15 years now is what we are calling open-ended algorithms. And these are algorithms that truly innovate forever. And when I mean forever, when I say forever, I mean forever. So evolution has been going for 3.5 billion years and continues to surprise us, right, with things like COVID. We, in my career, have never had algorithms that were worth running for very long. When I was an early grad student, there was no algorithm worth running for more than like a day.
We've made some progress. I think we now have algorithms that are worth running for a few months. But I think we should reach for the stars and say, could we create algorithms that are interesting to run for a billion years? You'd want to come back and check on because they endlessly innovate. And so some examples of algorithms that do that are natural evolution, as I've said. Here is the tree of life, and it keeps doing wonderful things. Inside, there are jaguars and hawks, the human mind, duckbilled platypuses, all sorts of weird things.
But human culture is also one of these algorithms. We, in science and technology and art, we make new stuff. We come up with new technologies that allows us to ask new questions. Those new questions cause us to invent new technologies to answer them. And we have this gigantic expanding set of niches and cross-pollinations.
So nature, for example, came up with the, quote unquote, problem or opportunity of leaves high up in trees, and then the solution to that problem in the form of caterpillars and giraffes. And then there are now creatures that can live inside of giraffes or predate on giraffes, et cetera, et cetera, and then just kind of this expanding explosion of innovation. And we don't know how to put that inside of computer algorithms. But I think that is a fascinating question to dedicate one's scientific career to, which is why I'm doing that.
And so our most recent salvo to try to produce these open-ended algorithms is this work called POET that was done with these wonderful collaborators here, and I want to tell you about it now. The overall goal of the paper is summed up in the subtitle of the paper, which is "Endlessly generating increasingly complex and diverse learning environments and their solutions." So let me unpack that.
So this algorithm is also pretty simple. We're going to start by parametizing the environment. We're going to allow a vector of numbers to describe what the environment is. And then we're going to generate new learning environments and add them to this growing set, this growing set-like archive of challenges and environments, if they're not too easy and not too hard for the current agents and they're new. They're somehow interestingly diverse or novel. And then we're just going to optimize agents to better solve each one of these environments. And crucially, we're going to allow goal switching between them for the reasons I've been trying to explain in this talk.
So let's see how this works. So I'll start with 5, 1 here. This is a vector of numbers that specifies an environment, somehow. And then I also have theta 1, which is a neural network, a deep neural network that's playing or learning to do well on this challenge here. And I optimize theta 1 to get better and better on this environment. Eventually, let's assume theta 1 is good enough on phi 1. I now copy phi 1 to phi 2. I perturb it a little bit. And now I have a different environment. So I don't have to start from scratch from a new neural network. I can actually just copy in theta 1 and then start finetuning it to this new environment.
And I could try again. I eventually make this environment here. That's too hard, so I throw it out. So I make a new environment. Now I have a choice. Do I see from theta 1 or theta 2? We decide let's just try both. And you could get more smart. You can get smarter about that, but you want to try multiple things. Whoever is the best possible thing to invade this new niche, we create that. This doesn't have to be linear. I could end up going and making a new environment branching off of here. This one could become the parent, et cetera, et cetera, et cetera.
So imagine eventually you get this environment here, 5, 6. It's super hard. Maybe we'll call this Montezuma's Revenge or whatever. And nothing is currently doing well. But I try all these policies, and it looks like theta 5 is the best. So I copy theta 5, and it's currently using its skills to play this environment. But it could get stuck on a local optima, and no matter how much optimization I do, it's never going to do better.
So after a while, we can see some goal switching happening. Maybe theta 4 has some innovation on this environment that lets it invade and replace whatever was in this niche. And then after some time in this environment, this thing now has skills that come through and allow it to replace theta 6. So just like in natural evolution or even in science and technology, like an idea like CRISPR invented over here comes and revolutionizes this field over here, you're allowing all of that kind of stuff to happen within your algorithm.
So let's see this in a fun, little cartoon world. So here is our challenge world. The vector of numbers specifies whether you have stumps, how tall they are, whether you have gaps, how wide they are, whether your terrain is rugged, et cetera, et cetera, et cetera. And what we see is every single environment you're seeing here was generated automatically by the system.
And so initially, it started off with each challenge by itself, small stumps or small gaps. Eventually, as the agents mastered those simple skills, it started making the things harder, like taller stumps or bigger gaps, whatever. And then it starts to combine them. And what you saw just now is the hardest level almost that's possible with this little legged robot, where it has big gaps and rugged hills and big stumps to jump over.
And so, to show that this is a good idea, what we can do is we can take any one of the final environments produced by the system that was also solved by the system, and we can plot the performance of POET on that environment. And then we can just take a neural network from scratch, the same exact neural net optimized with the same optimization algorithm, but started anew in that environment, which we call direct optimization. It doesn't get the curriculum or anything. And what we show is in all cases, it totally fails, right? If you have no curriculum, then you can't learn to solve these problems. They're pretty hard.
You say, OK, that's cheating. You didn't have a curriculum. And I say, OK, fine. I'll take those environments, and then I will take the parameter vector here, the parameter vector here. I'll linearly interpolate along the way. I'll create some environments, and I'll let a curriculum occur where an agent can hop from here to here to here to here to here in a straight line. It seems like it will work. It failed every single time. Every single time, there was ultimately one link in that chain that was a bridge too far, and the agent couldn't figure out how to solve it.
And that's it. You have one curriculum. You're stuck if it doesn't work because you don't have this kind of goal switching, right? It's like you put all your eggs in one basket on one technology. And if it doesn't work, you're toast because you don't have a collection of technologies being developed throughout all of science and engineering. So goal switching is quantifiably essential in these systems.
So I want to show you one of these little anecdotes because I think it really fuels intuition, and it's fun. Here is a little agent. It's walking on this ground. It has a performance of 298. This is the simplest environment we see the algorithm with. Eventually, the system creates this environment. It has little tiny stumps, and this knee dragging, toe dragging thing doesn't work very well. It trips on the stumps. So in this environment, it learns to stand up, and it gets a better score. And it's now going over these things with ease.
But the system is always looking for goal switching opportunities, right? And so it actually takes this agent here and transfers it back to this original environment because it's now better than the original because it knows how to stand up. And so with further optimization, you get all the way to 349 on this particular flat ground. Now, interestingly enough, this is a local optima for whatever reason, and we did the counterfactual. We just took this agent and ran it forever in this environment. And eventually, it got a little bit faster, 309. But it never learned to stand up, and it never learned to get very fast.
And I think this underscores the reason why goal switching is so essential because who would have thought that you had to go work on a harder problem to come back and get good at this simpler problem? And who would have known that it was that particular problem that would have helped in this case?
So we had a follow-up paper called Enhanced POET, but I won't get into the details, although I think they're pretty fun. But anyway, it produced all of this diversity in one run. And I think that's what I want to really highlight here, is these are now algorithms that we start them, and the result is an explosion of diversity and complexity within one run, whereas we weren't getting that typically with optimization, which is pretty fun.
This is probably one of my favorite plots in my entire scientific career because I have been searching for this since I was a first year grad student. This is a deep phylogenetic tree of environments and the agents that solve those environments within one run of the system. Typically, in an optimization system, especially an evolutionary algorithm, the system is usually-- like, you don't get a lot of diversity and it collapses.
So you basically get a trunk, right? You get little branches off, and then collapse [? to ?] the best thing you found, little branches off, collapse [? to ?] the best thing you find. What you don't get is deep sustained diversity, where I might have a whole family of ants and bears and cheetahs all within the same run. But now we're starting to see that kind of stuff happen in these systems, and I think that's terrifically exciting.
So practically, this stuff is also really good because it produces a huge diversity of environments. And we're starting to learn in RL that if you train on a huge diversity of environments, you can do really well, including transferring and zero-shotting new problems. So here are some people who basically took POET, ran it in simulation, and it transferred from simulation to this robot walking around in really challenging terrains. It looks like Big Dog, but unlike the Big Dog videos, there's learning going on in this system. And to me, this seems like the kind of stuff of science fiction. I would have said that a few years ago, and now here we're seeing these algorithms be able to produce it. So this was really cool work.
I also want to kind of pump your intuitions and say, I think humans need constant reminding about exponential growth. We will have exponentially cheaper and more compute coming soon. And soon, the environments won't look like the ones I've been showing you, but instead, it'll be these rich three-dimensional environments. So like, what would happen if POET was unleashed in a world here, where you could go to the market and negotiate and barter, form a team and have to deal with pickpockets, have aerial predators and boats and naval situations? I mean, the sky is the limit, I think, when we start to put these algorithms into really rich environments.
OK, so to conclude this chapter of the talk, I think POET, once again, shows the benefits of QD algorithms in terms of generating challenges, but adding to that, the ability to generate challenges while you solve them. And so if you apply the principles of QD and now the ability to generate environments, I think exciting things start to happen. OK, so I promised you at the beginning of the talk that towards the end of the talk, I would take a step back and think about the grandest ambitions of the field. So we have now arrived at that moment.
So all the way back to Turing, we've been fascinated with the concept of, could we create human level intelligence, which is now often called AGI. If we should make that, which is an important conversation that we should have but it's beyond the scope of this talk, then I think we probably all agree we have a long way to go, although it's getting shorter by the week. But the question is, how will we get there? How will we get all the way there, not just to the next paper at the next conference?
And so I think if you study what's happened in the field of machine learning over basically since its inception, almost all of the work is what I would call the manual path to artificial intelligence. So in it's almost every conference, what people do is they say, I'm going to try to identify a building block of AI.
And if you look at papers and I took like a random conference and looked at a bunch of the papers, and basically, you see people that are saying, here's an existing building block. I have a better version of it. Or here's a new building block I think we're going to need, and whether or not that's multimodal fusion or trust regions or capsules or [INAUDIBLE] or whatever. Here's a piece of the overall puzzle, and I've done some work to show that that is valuable.
But I think that raises some questions. Like, how many more pieces are there out there? Are there hundreds? Are there thousands? And are we going to find them all manually, like we have been doing? But even if we could find them all, then we have to deal with this phase 2 problem, which is that somehow we're going to have to put all these things together into some complex Rube Goldbergian thinking machine. That's a very, very difficult task. It would require changes in our scientific organization, like a Manhattan Project or an Apollo project. And just think about how complex and nonlinear that thing will be in its behavior. And therefore debugging it will be very hard, et cetera.
So it's not impossible, but I think that we should be clear-eyed about how difficult this path is going to be to tread. Now, I want to add on that this idea that there is a clear, undeniable trend in machine learning, and that is that hand-designed pipelines give way to entirely learned pipelines as we have more compute and data, right? We've seen that with features within neural nets. We've seen that in neural net architectures. Hyperparameter and data augmentation pipelines are now being automated. And even RL algorithms themselves are being meta learned.
And so I think this suggests an alternate path towards ambitious, powerful AI. And that's what I called the AI generating algorithms back in 2019. And the idea here is to learn as much as possible and to bootstrap from simple AI all the way through to AGI. I think that we need to do that with a really expensive outer loop. We might call that pretraining nowadays, which is searching over the space of architectures and challenges, et cetera. But the end result of that expensive process is a very sample-efficient learner.
And we have an example that this can work because natural evolution, which had the Darwinian evolution, which is a really unintelligent algorithm, and a lot of compute produced you. And you were the most sample-efficient learner we know of in the universe. And so if we want to make progress on this, I think we need to make push on three pillars simultaneously. One of them is we need to meta-learn the architectures. The second is we need to meta-learn the algorithms themselves. And the third is that we need to automatically generate the training environments and the challenges.
And so handcrafting each of these would be really slow and would be limited by our intelligence and time. It's better to just go all in and learn them all and let ML and compute do the heavy lifting. So I have done work and so have many others on these two pillars. I'm not going to talk about that at all today, but today, most of the talk has been here on this third pillar of automatically generating environments with quality diversity algorithms, open-ended algorithms, and POET kind of all woven together.
So I also want to point out there's been some recent exciting work combining these pillars. So Open AI had this wonderful paper on the Rubik's cube, which you probably saw. And what was really cool there is that they were meta-learning the learning algorithm itself so it would learn to explore and then exploit, and then also automatically generating different versions of the challenge for the system to learn on with a curriculum. And it worked really well.
And then the interestingly named open-ended learning team at DeepMind-- we were really happy to see that there was a team even named that-- went and did this super futuristic work called XLAN, where they generated this huge procedurally generated space of possible very difficult problems, and then trained one neural net to basically meta-learn and explore in these environments and then figure out how to solve them. And they put a huge amount of compute into that, but in the end, had this agent that could solve really hard problems.
And then just a couple of months ago, this paper, which I think is probably the most important paper in ML in recent years that is not getting talked about enough, is this paper ADA from DeepMind, which is the successor to the previous paper I was telling you about, this time now from a team called the Adaptive Agents team. And basically, they did the exact same thing with a couple of tweaks, but they put it in a transformer or whatever.
But again, it's a huge procedurally generated set of challenges, just like pillar 3 of AI GAs, automatically generated challenges with a curriculum. And then one agent learns to explore in this environment and then learn from the data it gathers and then go perform very well. And here is the blockbuster. It is human level in its sample efficiency at solving these new problems.
So just like the formula of the AI GAs, where you spend a lot of compute during pre-training to get really good, but eventually, you have a sample-efficient agent, they actually got one that is as good as a human on these very hard problems at going exploring and then exploiting and solving the problem. And that is a blockbuster result. And they even in the paper talk about how this is evidence of the power of combining pillars 2 and 3 of the AI GA paradigm.
OK, I want to quickly point out that even if you're not sold that AI GA is the fastest path to AGI, I still think it's a worthwhile path and for a lot of reasons. One, I think it's really fascinating to just study scientifically how simple processes can bootstrap themselves up to produce complexity. Like, what are the necessary sufficient and catalyzing factors that allow that thing to happen?
Also, I just think that it's the most human thing to want to know how we got here and where we came from and what produced us. And so by studying open-ended algorithms and AI generating algorithms, we're studying our own origin story and coming to understand our own existence, which I think is interesting. Also, by understanding how these processes work, how likely they are to work and what can accelerate them, we could king of start to think about how likely it is for this process to have happened on other planets in the universe, like how likely are alien cultures. Maybe fill in a couple of terms in the Drake equation.
And if we can get this to work, then-- it's so crazy. It's basically like inventing alien travel. We can go and visit different cultures. They're synthetic, but they're wildly different, not made in our image, and study what are they like. What kind of art do they make? What are their mathematics like? We can start to understand the general space of intelligence that's out there because we're sampling different points in that space, as opposed to using the single point of human intelligence, which is all we've had to know so far.
So I put this idea out actually in this talk at [? ICLEAR ?] with Josh and I on the stage in the plenary debate in I forget what-- do you remember what year that was? I forget.
JOSH: 2019.
JEFF CLUNE: There you go, 2019, same year I put the paper online. And Josh asked a very, very smart question of me on stage. He's like, OK, this sounds really interesting, but how are you going to pull this off without a planet-sized computer? And fair question. So I've thought about that. First of all, I had to go to DALL-E and ask it for a planet-sized computer image. It gave me these two, both of which were so good that I had to put both of them up. So the question that Josh asked me then, which I will try to answer now, is, how are we going to do it if we don't have one of these?
JOSH: [INAUDIBLE]
JEFF CLUNE: [LAUGHS] Fair enough. Also it turns out, I guess I was wearing the same shirt. So it turns out this is my lucky shirt. I am embarrassed to admit. [LAUGHS] Josh, I see you have at least more than one fancy shirt. All right, anyway, so my hypothesis for how to answer this question is, I do think we can do it. And I think there's many different answers. For example, we can get better abstractions. In computer science, that's usually how we make progress. So there's lots of work to be done there.
We're also not trying to produce, to my sadness, duckbilled platypuses and birds of paradise. What we can focus on intelligence that shaves off a lot of work. But here is the one I really want to focus on in the last mini chapter of the talk. And that is, to borrow from Newton, I think that we can stand on the shoulders of giant human data sets. And this has been the lesson of the modern AI revolution in the last few days with GPT 4, et cetera.
And so we had this paper on this from my team at OpenAI. I want to thank all of the awesome collaborators on that paper. And so this paper is called Video Pre-Training, or VPT. And I think it is my best current answer to practically how I would address Josh's question. So hard exploration, as I've kind of talked about, is a real tax on learning in general. Like, RL can't learn problems if there are hard exploration problems.
There are a lot of hard exploration problems out there. It's an extremely large task if you want to create open-ended algorithms because now I don't just want to solve Montezuma's, but I might want to solve Montezuma's and then Pitfall and then 100 different problems in this domain and then also learn how to do calculus, et cetera. And each of those might have hard exploration. And so if I always have to solve hard exploration problems, then maybe I can't even get an open-ended system off to the races.
But one thing we can do is we can cheat, and we can just steal-- we can level ourselves up to humanity by using the entire body of human knowledge. Just like GPT didn't learn on its own to become really smart-- it just kind of trained on all the human data and basically got almost right up to human level intelligence-- we can also start our agents off by using human data and human intelligence. So it turns out the internet is full of videos and tutorials if you want to have an embodied agent, for example, including on Minecraft. So this is the domain we chose as a proof of concept of this idea.
So we want to basically have our agent Clockwork Orange itself to YouTube for like eight years and watch all the Minecraft data out there and then just suddenly know how to play Minecraft. Then we could put it in an open-ended world and have it create society and civilization and figure out all sorts of new awesome things that we want the open-ended algorithm to go invent. And so the challenge is that videos on the internet are unlabeled.
Unlike with text and music and images, we don't have the next pixel or the next note or the next word for free, like you do with those modalities. We need the actions taken at each time step in this video, which are not provided in the video itself. You don't see the thumb pressing each button on PlayStation or whatever for a video game.
So we came up with a pretty simple solution here, which is that we're going to train an inverse dynamics model. It has quite an easy job. It gets to see the past and the future. This is the idea on here. Its only job is to tell us what action happened in the middle. So if you saw this and then you saw this, you can guess that the player must have hit the Jump button, OK? And if I can train a model that constantly looks at the past and the future and can guess the action in the middle, now I can go to all of YouTube and label it. For any video that comes through, I can provide this pseudo label, up, down, right, right, right.
Now I can take as many hours of video off of YouTube as you want. I now have the labels for it. And then I can do basically the same thing as GPT, which is GPT says, OK, if I see these 50 words, here's the 51st word and then the 52nd. We do the same thing. Like, if you have this much video, what's the next action to take and then the next action to take, et cetera. And so we trained this 500 million parameter neural net to go from history to next action. And then you can run that auto regressively in an environment. Just roll it out and see if it can play Minecraft.
So it turns out that it can. So here is a classic OpenAI scaling log plot, where you scale the data on a log scale here. And what you find is for a really long time, you're only getting very simple things in the game, like this crafting table here. But as you really crank up the data, you start to see more and more difficult skills coming out of the network zero shot. For example, to make stone tools in this game requires over 2 minutes, about 140 seconds, and almost 3,000 consecutive actions to pull off. It's a really, really long sequence of things. But our model just does that zero shot after watching all of the data that we got out of YouTube.
But what's even cooler is that it knows a lot about the world, and we can have it go learn new tasks. If it was an open-ended system, it could continuously generate new and new challenges to expand its set of learning and harness what we call these behavioral priors that come out of the system. So here is our RL system. There's a tech tree here. To get these diamond tools, you have to do all of these things in sequence. And we just reward them for doing any one of these things.
And here's the TLDR. If you try to learn this from scratch, you'll learn basically nothing. If you do the pre-training with VPT, video pre-training, then you can learn all the way to doing this really hard thing with a diamond tool, which had been like a challenge in the field of people working on this. And then it takes over 20 minutes and 24,000 actions to pull off.
So think of all the things you can do on a computer in 20 minutes. By the way, this agent is playing just with the mouse and keyboard. So it could learn to do anything that humans can do on a computer. And think of all you could accomplish in 20 minutes. There's a lot of now good priors to allow these systems to go learn a huge variety of tasks, almost everything that we do on a computer, and you could plug that in to an open-ended algorithm. So the conclusion here is that pre-training massively speeds up RL, and that allows us to speed up open-ended learning.
OK, I'm now in the conclusion, a few conclusion slides for the talk. So quality diversity algorithms, they explore really well by creating and collecting this kind of growing archive of stepping stones, which I think is really important. They harness this principle of goal switching. And they automatically invent effective counterintuitive curricula. And we can use them for open-ended algorithms, which is this really ambitious quest to try to produce algorithms that literally will learn and innovate forever, like natural evolution and human culture.
And I think the question that I'd ask all of you is, what's missing? Like, what are we missing from the ingredient list that would allow these algorithms to truly innovate forever? Because we don't know. I think quality diversity principles and environment generation and maybe some pre-training on YouTube if you want, but what else? Because nobody's ever created an open-ended algorithm. And that is a fascinating thing to do. I think actually there's a Turing Award or a Nobel Prize or both out there for whoever can solve this problem. So I offer that to you as an intellectual question to ponder. Josh, you can figure it out.
All right, and then also I introduced these ideas of AI generating algorithms. I think they're likely the fastest path to AI. They're worthwhile even if they're not. And, as I showed with VPT and partial answer to Josh's question, I think that they can help us speed up. This pre-training can help speed them up and allow them maybe to become practical to work on now in the current span of science. So with that, I want to thank my collaborators, the funding sources, especially Cedric and Josh for the invitation and all the hard work in the organization. Thank you, Cedric, as well as my main collaborations and all of you for listening. Thank you.
[APPLAUSE]