Modular Learning and Reasoning on ARC
Date Posted:
February 11, 2021
Date Recorded:
February 9, 2021
CBMM Speaker(s):
Andrzej Banburski ,
Simon Alford All Captioned Videos CBMM Research
Description:
Abstract: Current machine learning algorithms are highly specialized to whatever it is they are meant to do — e.g. playing chess, picking up objects, or object recognition. How can we extend this to a system that could solve a wide range of problems? We argue that this can be achieved by a modular system — one that can adapt to solving different problems by changing only the modules chosen and the order in which those modules are applied to the problem. The recently introduced ARC (Abstraction and Reasoning Corpus) dataset serves as an excellent test of abstract reasoning. Suited to the modular approach, the tasks depend on a set of human Core Knowledge inbuilt priors. We implement these priors as the modules of a reasoning system and combine them using neural-guided program synthesis. We then discuss our ongoing efforts extending execution-guided program synthesis to a bidirectional search algorithm via function inverse semantics
HECTOR: I'd like to welcome everybody to the first CBMM research meeting of the spring semester. Today, we'll hear from Andy Banburski, who is a post doc in Tommy Poggio's lab. And he will be joined by Simon Alford, who is a master's student working closely with Andy.
So today they're going to tell you about some of the most recent work, and interesting work, that has been going on in module 1, and within the Poggio lab. And so I think I'll just turn it over to Andy now.
ANDRZEJ BANBURSKI: Thank you, Hector, for introducing us. Now this is somewhere that we've been working in collaboration with quite a few people. I'll just-- we have them listed here as-- [INAUDIBLE], and Tommy Poggio. And we might have more people working on this. We're partially funded by DARPA. I think this is all we have to say from the credits at the beginning, and we'll just jump right in.
The talk is modular learning and reasoning on ARC. Basically, we're going to start with some motivation for why we care about this and then talk about what ARC is, which is the abstraction and the reasoning corpus. Then we'll talk a little bit about how one could go about solving this with program synthesis, and talk about some work we've been expanding on on the work that has been done already in CBMM in [INAUDIBLE] lab. And then we'll briefly touch upon some preliminary work that we're currently doing on bidirectional program synthesis.
I don't think I have to say much to people here. Deep learning has kind of been successful in the last decade. But it's kind of interesting to look at what actually deep learning is good for, and maybe what is not so good for. So we know it's a pretty decent at object classification, something like ImageNet, it does pretty well on it.
It's also pretty good at tasks like segmentation, semantic segmentation, giving us hope that we could have things like self-driving cars, and other automated technologies. And then it's been having a really great success at tasks that we would say really are requiring some kind of intelligence, things like beating the best humans at Go, or at games like StarCraft. And these have been, for a long time, kind of taken as somehow nice benchmarks for defining what we consider to be actual real intelligence.
Now, visual intelligence is something that we would say is harder to somehow quantify, and we don't really have a grasp on, either people are better at visual intelligence than others. But it's been somehow this big intuition that the really really smart people are capable of-- you have to really spend a lot of time and training to get good at something like chess, or go, or StarCraft. And somehow beating humans in battle would be a good benchmark for saying, hey, this thing is intelligent.
And more recently, GPT3 and GPT2, in the last year or so have been doing really impressive stuff both at the level of language, and even applying this to visual tasks, or either visual completions, or using text prompts to generate images. And this is getting to something that we will say, hey, this is maybe doing some kind of more higher level reasoning than just the simple side of things we've seen before.
But is this exactly how we do all kinds of reasoning? Can we get to general intelligence this way? Can we get to something that will make us obsolete? Could we have an automated researcher, just by scaling up deep learning, or is there something more to it?
So we know that deep learning actually has different [INAUDIBLE] modes. There's very simple ones, which is if you train these things on a lot of data, they don't necessarily extrapolate too far beyond that. This is an example of something that was trained for image captioning. You know, it's never seen a baby with a toothbrush before. So one of the guesses is, a boy holding a baseball bat. This is something that would have been in a data set.
Some other stuff that has been done in the CBMM. It's been shown that it is very difficult to train neural networks, especially [INAUDIBLE] neural networks, to solve problems like figuring out whether you are somewhere inside or outside of a shape. This is connected to the bigger issue, which is it is difficult to learn solid rules with the gradient descent on deep networks or recurrent networks.
So in this sense, things like teaching a neural network to add numbers is actually a big challenge. In fact, most of the time we cannot successfully train neural networks to perform addition, exactly. It does it within the range of data that has been exposed to, but then it fails to extrapolate to numbers that are bigger than it has seen before.
Which tells us that it has learned something, it has associated somehow these training examples to something that does reasonably well, but then it really has failed to somehow get to the heart of the matter of what it is that-- what addition is, or what counting, checking the parity of a bit string is. It has just learned to explore a little bit outside of its range of things that it has to.
So we could even go further and say that in all of these successful cases, what we actually have are algorithms that are very skilled, but at one specific task. And you might start asking yourself, is this what we would want the intelligence to be? Because none of these systems really generalize beyond the scope of what they're trained at, as I mentioned.
An example, this is the fact that AlphaGo, it can beat the best human at Go, but it cannot play go on board sizes that it was not trained on. Whereas, if you give an un-standard size of Go board to a professional, or even an amateur, they're going to pretty well extrapolate the technique that they've used successfully in normal board sizes. So that is kind of something that you immediately see that these things are maybe in some ways skilled, but if they have any intelligence, that intelligence is very fragile.
And what we're going to be quoting out throughout this talk is this paper by Francois Chollet of the [INAUDIBLE] on the measure of intelligence. And in this paper-- if you haven't read it, I highly recommend you to read it, because it's one of my favorite papers in the last few years. And it's basically a big criticism of the whole field, in a way. And specifically in how we treat intelligence and how we define it.
And the [INAUDIBLE] he proposes that intelligence-- this is nothing, I think, that would be dramatic in any way-- but we should be measuring intelligence instead of being really skilled at some sort of tasks, intelligence should be considered as efficiency in acquiring new skills. And then in the paper, he kind of proposes a mathematical framework in which this can be well defined, and also proposes how to test this.
But maybe to be a bit specific in what we actually want to be talking about technically is we talked about generalization. And then we can actually talk about the different levels of generalization, of what kind of generalization in the AI system has. And actually in a way, if you look [INAUDIBLE], you can think about the fact that current AI systems are kind of on this lowest rung of this ladder. It's kind of local generalization, in the sense of being able to generalize to new images within the same distribution.
For example, being able to do relatively well on the test set of ImageNet. That is somehow local generalization. But then you want to have systems that are capable of generalizing to maybe a broader range of variation that it hasn't encountered before. But in the most extreme sense, you would want a learning system to have something called developer aware generalization.
And that is to handle situations that neither the system, the AI system, not the developer who has been creating the system has encountered before. And in this sense you would require a system that would be capable of discovering the rules and discovering the relevant features and concepts within the task or a set of tasks that it's looking at, without it being somehow primed for them. Now, this sounds maybe impossible, and in some sense it is.
There is no free lunch theory, I might tell you but there is nothing that you could have as a general algorithm like this. But maybe within the set of tasks that we care about, which are the ones that we encounter in the real world, there is no restrictions like that, because we are capable of these things. In this sense, we can think about the kind of deep learning systems as these things that learn by taking massive amounts of data, and they basically do this kind of associative reasoning.
But if we're defining intelligence as efficiency and skill acquisition, we would want to not have massive amounts of data. Instead, we want to go for the less is better road. So the question here is, how can we efficiently learn from small amounts of data, so efficient learning. And how can we efficiently learn the rules? And how can we efficiently learn things that we can then apply exactly, in some sense?
And the obvious thing here is you would want to have an architecture that is capable of reasoning about some abstract versions of entities that are relevant for objects or just general abstractions. And there is this, within the field of cognitive science, there is these dual reasoning system theories in which there is this concept of having a fast intuitive reasoning that is based on experience. And which allows you to respond and react rather quickly.
And I would claim that deep nets are exactly this kind of system that just associates and basically has slightly more fancy versions of lookup tables, and which it is capable of extrapolating within its experience. But then there is this slower system that is responsible for our language abilities, for our logic and mathematics abilities, and for our general reasoning that is a more linear and step by step based process. And in a way, I think everyone will agree, that deep learning doesn't exactly do this.
And the question is then, how do we get to something that would be capable of doing this kind of slower but more abstract based reasoning? And this is what this talk will be about. Now, when we talk about, we want to have a system that is doing abstract reasoning. The question is, the first question you should have is, what is a task or a set of tasks that you are going to be doing this on?
Is this going to be some kind of visual, logical puzzles? Is this going to be like IQ tests? Are we going to be using the data set of all SAT math problems? What is it exactly that we want to do it here? And which way can we control, somehow, and specifically talk about generalization performance?
You know, a lot of these things, specifically things that are in the SAT domain or even the [INAUDIBLE] puzzles depend in some very specific way on our experiences, both culturally and kind of as you grow up, and being exposed to different concepts. And you would want to be very clear about what it is that comes in as a prior knowledge that is behind, that is somehow at the basis of then doing actual intelligence work on top of it. And this is where the abstraction and the reasoning corpus comes in.
This is, again, from "On The Measure of Intelligence" paper. And to paraphrase, its a general artificial intelligence benchmark, in a way, or a program synthesis benchmark, or psychometric intelligence test. And it's something that is both targeted at humans and intelligence systems, and Chollet propose it as a sort of set of IQ-like tasks. And it's meant to measure exactly the skill learning efficiency, rather than the skill level at the end of your training.
Now, I'll show you all these tasks look like, but he clearly identified a set of core knowledge priors that are responsible for doing well on this data set. And we already have a question. Doesn't the general intelligence also need to be able to decide which job be delegated to the [INAUDIBLE]?
Yes, I would say so. And then maybe you're kind of reducing this problem of having a system that is capable of doing this kind of abstract reasoning to having some sort of central, maybe specific, modules that are neural networks, and then having some sort of central controller that will be figuring out which ones to use. And in that sense, that would be a similar kind of system that we're going to be talking about. I don't know if that answers your question, Kent.
Well now, going back to ARC, to solve these tasks you need a number of, in both priors, some of these are just the notion of having an object, the fact that objects persist, they can interact, and the ability to count, the basic concept of numbers, basic concepts of geometry and topology, and a few others. And this is an example of [INAUDIBLE].
Here we have on the left three different samples, three different examples for a given task from which you have to figure out a rule that takes the input to output. And you have three examples like this. And then you're given the test grid, which is here, and you're meant to figure it out what's the test output of this should be.
So this is an example of a single ARC task. In a way, you can kind of readily see that you have to look at the fact that you have this grid that is kind of split in two. And you somehow have to mirror inverse one on top of the other, and kind of add them up. This is an example of how to solve this [INAUDIBLE].
But there is a wide variety of tasks, actually, within ARC. In fact, there's 400 different training tasks. Each of them has about three to four training examples. Again, each task is completely different, and on each task you have to completely learn what the rule for it is. And you have to do so from only three or four examples.
And then you have one or two test outputs, and for each of these, you have three attempts at solving it. If you don't solve it three attempts, then you failed. So you basically can have top three predictions. Then apart from the 400 training tasks, there is 400 evaluation tasks, and then a hidden test set that only Francois Chollet has access to, and it has 200 tasks. Now, by all accounts, the hidden test set is very different from the training set. And in fact, even the evaluation set is quite different from the training set.
Here's a few of these priors that you might need for solving this. This is a task where you need to be able to recognize that there are objects, and be able to count these objects, and find the most common objects, for example. So this is a set of the three training examples and a test set. Here's another one that requires you to understand the concepts of symmetry and completion of the symmetric patterns.
Another one is some sort of goal directedness, where you're having to connect two objects that you somehow have to discover in the image by a line that's made out of segments. And this-- if you start thinking, how would you write a program to do something like this? Sure, you can write it down. But can you do this in a way that is not hard coded, that could be discovered? This is an interesting challenge.
And this need for generalization that I mentioned, here's an example from the evaluation set of a task that needs a completely new set of rules that you have to learn that is not present in the training set. This requires some kind of non-trivial stacking that you can kind of see visually. Take all of these color picture and smoosh them into a 3 by 3 square. But how do you define this and how do you write down this set of rules nicely? It's a bit complicated. Again, you need to learn this only from three examples.
So back in March last year, there was a [INAUDIBLE] Kaggle competition in which this thing was a challenged. And out of that there was one solution that was like winning. And that achieved 20% on the evaluation set. And what it did, effectively, is it created-- the guy who made this, he goes by icecuber, he created a domain specific language which apply some number of transformations to the inputs. It starts this list of images, moves them to origin, it colors some pixels, it resizes them to fit, an output size that you kind of guess, and it doesn't feel at all like anything that is a [INAUDIBLE] general intelligence.
So how do we design something that would be a satisfying human life solution to that? Well--
AUDIENCE: [INAUDIBLE] I ask a good question?
AUDIENCE: Yes, a conceptual question. And if you prefer, we can just kick it down to later. But I'm wondering about the principle here when we look at ARC as a kind of test of general intelligence. Because as far as I understand, the suite of tests is defined-- intelligence here is defined, basically, extensionally rather than intentionally, right. So it's saying, here's a set of tasks, and if you can do well in these tasks, or learn efficiently on these tasks, then we can declare that you're an intelligent agent.
Rather than saying, here's a definition of intelligence, do you satisfy that definition? That would be the kind of intentional definition of intelligence. And I guess the problem with extensional definitions is that it seems hard to kind of reassure other people that your extensional def-- your selection of tasks is the right one. Like how do you know that, in terms of generalization, how do you know that there's not some other task that doesn't quite fit with the set of tasks that were chosen here, and yet another, an observer, would say, that's also a reasonable test of intelligence. See what I'm saying?
ANDRZEJ BANBURSKI: I see what you're saying, Sam. And I will agree with you wholeheartedly that this is not all intelligence is. This is a nice small benchmark for intelligence. This data set, by the way, is growing over time, so that more and more tasks like these within this kind of grid [INAUDIBLE] something can be added.
This was trying to remove the need for things like math abilities and so on. But I agree with you that I don't think doing well on ARC is going to suddenly give us an AGI.
AUDIENCE: I guess the-- I guess my concern here is kind of more general than this particular set of tasks, which is that everyone wants to find something like artificial general intelligence. But the way that we go about doing it is by defining benchmarks. And so I totally get that this is a benchmark. And it seems like a good benchmark. I'm not criticizing it as a benchmark.
AUDIENCE: Hey, Sam. Can I just jump in for a second. Because, I mean, I agree this is a good question. But I think to some extent-- I think Andy already said what I am about to say, but I'll just add this-- that part of why I find ARC interesting and why we've also been working on it is that Francois does have a positive, a more positive thesis here of human-like intelligence, which is very consistent with the views that some of us have, including you and me that we wrote about in our BBS paper, for example, that basically these tests are designed to start off with core knowledge representations, the kind of things that seem like they're built into human brains.
And then to have some general-- few-shot learning and compositional abilities, compositionality in expressiveness representation. And the thesis is that you can put those things together to learn how to solve these tasks. So I just think it's true that it's just an extension of definition in terms of a benchmark, but it's an especially interesting one because of how it reflects I think a very cognitively informed thesis on what might be the power of human intelligence.
AUDIENCE: I mean, I totally agree with that. I guess my concern is, are we going to always be in the regime of I know it when I see it, or are we going to have an actual definition where we can state that a system actually satisfies that definition?
AUDIENCE: I don't know if there's going to be a definition or more gradations of dimensions, but what's nice about it is that there is, I think, there is a scientific thesis behind why this benchmark was set up in the way that it was. Anyway, let's not-- I don't think we should get too distracted on this. But it's a good question. And I think there's, to some extent, that's a part of what's going on here.
ANDRZEJ BANBURSKI: Yes, this is an excellent question and I don't think we have an answer for it. I agree that this is a nice attempt at making more abstract reasoning benchmark. But it's just that. It's just a benchmark. And we're going to have to have many more of them. I don't think you can define a consistent set that would be a good test of general intelligence.
In a way, the "Measure of Intelligence Paper", it starts by bashing on the Turing test. And I don't think that it's-- in a way, you would want to have-- maybe what you do want to have is a benchmark that is not gameable in some way. And in a way, I suspect parts of ARC are maybe not as gameable as some other benchmarks, and that's what's positive about it.
So as I was saying, how do we design a solution that we would be happy about? And, again, there is another question from Kent, which is, we're trying to benchmark something that humans are good at, but machines are not? Basically, yes. We would like to somehow get to what we call actual intelligence for us, with the fact that we are capable of generalizing far.
It seems the difficulty can be turned up and down. Has this been explored? By Kent. Yes, in the sense-- so you could look at just subsets of ARC. And, for example, one subset of ARC that you can do reasonably well on-- and this is something that we just chatted about I think Tuesday morning on slack, or something like this.
If you restrict yourself to grid size that is 10 by 10-- and Evan is mentioning this-- if you restrict to the 10 by 10 great sizes, you can do reasonably well with a differential and neural computer and transformers. So you can do like 80% of the task solved within restricting to one specific grid site. And then you don't need anything to be more fancy than-- arguably the most advanced [INAUDIBLE] networks we can use. But you would do well with just the deep learning.
And this is where I'm going to end and let Simon take over. Where he'll start talking at some of our work on expanding some of the results from Josh's lab, and then some of other stuff that we were also doing. And Simon, take over.
SIMON ALFORD: Let me just get this sharing going again, and then we'll get going. [INAUDIBLE] this chat, Google, too. So yeah. So Andy introduced the ARC dataset better. I'm just going to go down into the trenches, and forget a little bit about the-- maybe not forget, but focus a little more on just how can we design something that performs well in ARC, and what sort of things are needed, and where that takes us in terms of engineering design.
So, first, I'll start with just a basic program synthesis approach to ARC. Francois Chollet in his paper mentioned program synthesis as a field that's especially applicable to the ARC benchmark. And then we'll go from there to the content of what was in our workshop paper from the endurance learning meets combinatorial algorithms workshop, and show how DreamCoder can be applied to ARC. And then we'll do some-- if we have time, we'll do some preliminary extensions. Which is just work in progress and just different thoughts we have about it.
So where do we start? So you look at some of these tests. I guess before I start, I should say I'm going to be presenting a lot of these task grid pairs. So I'll explain how this works. Say for task 173 here, the top three grids here are the input grids, and the bottom three are the output grids.
So when I show a picture like this, the goal is to convert. Each column is an input, output pair, and the goal is to produce the output. I'm also only showing the training examples. In reality, there's the test examples for which you have to produce the solution.
So for your task 173 here on the left, the solution might be something like, we want to get the object that has vertical symmetry. Whereas this test on the right, the solution might be to get the largest object in the grid. And so a basic approach to solving ARC tasks is to say something like, all right, we're going to solve these tasks, again, in programs synthesis. We're going to solve tasks by writing programs to convert the input into the output.
And programs are useful because they capture a lot of things that we want that Josh mentioned, like compositionality, flexibility to extend and generalize, and also rule-based reasoning. So that's the approach we're going to start with. Program synthesis is a very fascinating, interesting field with lots of different techniques.
And one of the most common ways of approaching a benchmark like this is to develop-- you start with some domain specific language, a set of functions that are widely applicable. And you can combine in different ways to produce solutions to different tasks. And then given this set of functions and given a task, then the problem becomes to do some sort of combinatorial search to produce a function that uses your functions from your domain specific language, and produces a solution to the task.
So there's a bit of challenges here. The main challenge is that this combinatorial search is a very challenging problem. There's an exponential space of possible programs. And it's like finding a needle in a haystack. Especially, it's very easy to see the program here, it's a somewhat-- it may seem short, but from a search perspective it's a very long program.
So the simplest way of doing this is to do something like a brute force search, where you just combine all possible programs. In order to make this tractable, it's appealing to make your programs short. So to design a domain specific language such that different solutions to task can be expressed succinctly.
So instead of writing the program like this, we might introduce a new function like, get the largest object in a grid. And so now we can shorten our program. And so it's very tempting to just design a very large, very powerful, domain specific language that can cover all possible ARC tasks. You might look through different examples and get ideas on what types of things are tested. And then you can try and see how it performs.
Will this work? I would argue no, and I think Francois would agree. The key thing to remember here is that the evaluation test sets feature tasks which are very different from the training set. So if we're using the training set to develop this language and functions, they may not generalize to the test set. The programs there will be just as long as they were originally.
And so if we really want to do well, you need to sort of learn the concepts from the examples, as we were saying. And the fundamental challenge, just to outline it again, is that there's this exponential search problem. And without a carefully crafted DSL, many tasks will have solutions that are too long to be discoverable.
At the same time, we need to perform well on the hidden test set, with unknown types of tasks. So how can we do this? One solution is to say, OK, we want to learn these new concepts from the examples, which is a great idea. And actually, I think there's a lot of evidence that the ARC benchmark was designed with this in mind.
There are certain tasks which are very simple, like this task where to get the solution you just crop the grid. And then the tasks sort of build in difficulty. So you might introduce another task, which introduces the concept of taking a grid and combining it horizontally with itself.
And then after you've learned those two concepts, you might combine the two. So first you crop the input, and then you combine horizontally. And this is a really small example, but throughout ARC there's this idea of building to ever more complex combinations of things. And this supports the ability to start simple, learn those concepts, and then reuse them for the harder tests.
Also, I realize there's stuff going on in the chat. If anyone has a question, they can just speak up. I'll just let the chat discuss things on the side.
ANDRZEJ BANBURSKI: Yeah, I think the chat is also having a discussion in parallel to this talk at the same time.
SIMON ALFORD: OK. So let's see. Is there anything else? So here's just one more example of this, where a crop is used in tandem with another concept. And in addition, you might see different concepts represented in different ways on different tests. So here there's the concept of taking an object and deflating it. So where there is previously blocks of pixels, you reduce it to a single pixel.
And so it doesn't simply suffice to say, whenever you solve a task, learn that as a function and apply it. You need some more sophisticated concepts learning. So this is going to introduce us to DreamCoder. This is a really exciting work that came out last year from Tenenbaum's lab from Kevin Ellis, who is now incoming at Cornell.
And the idea behind this work is exactly the setting we've been talking about. You have a set of tasks that you're trying to solve that range from easier to more complex tasks. And you want to learn a library of concepts that enable you to solve progressively more challenging tasks.
So one way to think about DreamCoder, I won't go into the full details. I highly recommend checking out the paper presentation on it if you're interested in it. But the idea is that you have some search algorithm that solves tasks. And then in tandem with the search algorithm, for whatever tasks you solve, you run this abstraction algorithm which creates new functions that are based off concepts reused between the test results so far.
These functions are added to the domain specific language and can be used to solve progressively harder tasks. So the example here, the figure shows the problem of sorting a list and how by, starting with easier tasks like filtering a list, finding the maximum element of a list, and then to the nth largest element of the list, you can express the solution to sort as a very short combination of these learned concepts. Whereas the original program shown here at the bottom was a very long program. So this is really exciting.
It's a very intriguing work. And it seems almost perfectly appropriate for ARC. so one of the things we just set off trying to do is just to see how well does it work applying DreamCoder to ARC? This is kind of what our workshop paper was on. And, I mean, nothing crazy here. But just some evidence of applying DreamCoder to ARC, and what things we saw.
So here's the first experiment we did, where we generated these simple synthetic tasks. And we said, OK, you'll have two different actions, move down and draw a line down. And the different tasks will be to apply that action in different directions. So you could move left, move right, move up, move down, and then likewise for drawing a line.
And the way we enable this is we provide the ability to rotate the grid. And so doing something like moving left, if you have moved down, you can first rotate the grid counterclockwise and then move down, and then rotate clockwise at the end. And in this way, you can apply an action in any given direction with just these original primitives.
And so if we task DreamCoder with this really simple synthetic task, indeed it discovers the programs. Here on the right, you can see the programs that are discovered. They're written in lambda calculus due to how DreamCoder represents programs. And then after running the abstraction algorithm, we learn these new concepts, which are created from the tasks solved.
So the new concepts we learned are applying an action left, applying an action right, and applying an action up. Of course, down we already had from the start. So it's just the identity function. So the takeaway here is that we're not just copying individual tasks. Given a set of solutions, we're constructing a more generalized version of the function that's shared in common between all of the solved programs.
Kent asked if DreamCoder also looks for shorter formulations that give similar results. The abstraction algorithm is a little technical. So it does have a preference for compressing the library. And it also looks at refactoring, so semantically equivalent, but syntactically distinct programs, and this can enable it to find shorter solutions, although it's within a certain bound. So, sort of, yes.
And so the second experiment we did was actually run on real ARC tasks. And, again, a relatively simple experiment. We just provided six different initial primitives, functions, that do different symmetry related operations, different Richard transformations. And then we collected a subset of ARC tasks that do different relevant-- test different relevant things as seen here, Different example tasks.
And, yeah, let's see what happens. So first iteration of search, we solve 16 tasks. You can see here, they're all relatively short programs. And then we plug this into the abstraction algorithm, and what do we get? We get these new functions learned.
I'm not sure with these box is. But, basically-- ignore the blue box. If you look at some of these new functions, the lower functions are sort of noisy and aren't interpretable concepts. But these new functions we learned are. One of them rotates the grid 180 degrees. Another function does the equivalent of a horizontal flip.
So I forgot to mention, but in the functions we provided, we only-- we didn't include all directions, similar to the previous experiment. We just included the ability to flip a grid vertically and stack grids vertically. And this is interesting, because now we've learned to function which is able to do this horizontal flip. In addition, it could do a diagonal flip.
And not only are these interesting that it's learned these concepts, this will also enable it to solve progressively more challenging tasks. So the second iteration of search, we solved three additional tasks that we couldn't solve before. One of these tasks is, for example, tasks 248 shown here. Which I showed before, this is just combining a grid horizontally with itself.
Which, if you had that primitive, it would be a very nice, very short program. But we didn't start with that primitive. Instead, this is expressed by this following program. Which if you rewrite these learned primitives from the previous iteration, you can write it as this. First, you diagonally flip the input, then you stack vertically the diagonally flipped input with itself, and then you do another diagonal flip. And this is equivalent to a horizontal stack.
So kind of cool that it can write things that way. But the really important thing is that after we solve these tasks here, the other three tasks that were solved also relating to horizontal motions and stacking. The new attraction that you learn is, indeed, things related to these horizontal things, as well as the ability to rotate counterclockwise.
And then the last iteration you just do one more round in search, and now you're able to solve these more difficult tasks where it requires mirroring over the x and y-axis, and combining both vertically and horizontally. So this shows the original program expressed in terms of the original primitives, but indeed, most of the primitives here are actually the learned concepts from earlier examples. For example, to combine horizontally, it just called some learned abstraction from before.
Taking stock of the situation, so it seems like DreamCoder is working well.
AUDIENCE: Did you did you run more than three iterations, or did it get stuck and not solve anymore? [INAUDIBLE].
SIMON ALFORD: Exactly. Exactly. I mean, I don't think we by any means exhausted squeezing all levels of performance out of it, but in this case, yes, it did plateau.
AUDIENCE: OK, good. That's all I want to know. Thanks.
SIMON ALFORD: Yeah, so again, it's pretty cool that it can learn these concepts and that they can be applied to new situations. And that it enables solving progressively more challenging tasks. But there's still a limitation. I mean, in the very simple experiment we did before, this task 210 was unable to be solved efficiently by DreamCoder search algorithm even though we have these new higher concepts.
This task, it involves too many different possible reflections for the reasonably, the fairly small search, that we were doing to discover it. And this is related to this type of search that DreamCoder is using when it's discovering programs. It's using neural guided synthesis where you evaluate the input, output grids with a neural network, and use this to guide the search in the form of a distribution over programs.
And this distribution over programs helps narrow down, for example, which functions might be applied for a given task. But it lacks the granularity to say specifically, oh, I'm going to apply this function, then this function, then this function. This is an intentional choice for DreamCoder because it enables efficient search. But on the other hand, it's not fully taking advantage of the different visual signals, visual cues that the ARC tests offer.
So another example of this is this task here, task 15, where the solution expressed at least in this way could be written as like mapping different colors to each other. And to map them, you could do these eight mappings. And this is a really long program. And there's not really a way to shorten it, although you could do some fancy DSL primitive to help with things.
And in general, the search isn't as strong as it could be. So what we're going to look at perhaps next with the remaining time is possible extensions that will enable DreamCoder to use it's really exciting concept learning ability, but also scale well to the complexity of ARC tasks. I also noticed that Kent had a question for how do you bucket the abstracted functions?
AUDIENCE: Yeah, sorry. I just mean like, so you're talking about using network to-- or a learned function to decide what the distribution over function usage is. I'm just wondering how do you factor in new functions that are just-- if there are four axiomatic functions or whatever, how do you bucket the new-- how do you bucket the next function?
SIMON ALFORD: I can't say this with 100% certainty, although to my to my knowledge, you just increase the size of the neural network. And you just train that channel, or that part of the network, from scratch for the new primitive to predict it. So what you do is you can, both for the tasks that you solved and you can generate random tasks to see it in action and understand the semantic effects it has.
AUDIENCE: I'm not sure if this is already done, but I think it might be interesting to do a covariance-- look at covariance matrix between the functions so you have just a smaller number of weights so that it it'll converge to the right general idea faster.
SIMON ALFORD: Right, so it's not being wasteful with a neural network, somehow reusing between functions.
AUDIENCE: Yeah, basically. Just like you probably don't need that-- you probably don't need a whole neural network worth of functions, or worth of weights.
SIMON ALFORD: So this is the point of the talk where we sort of move to more speculative things. This is research we've been working on that's still in progress. So please bear with us with the lack of concrete results, and still just the thinking of three different ideas. But what we're going to look at next is just a taste into, how can we search for programs the way humans do?
There is the learning through concepts from examples, but as Andy also explained, there's the sort of step-by-step reasoning involved to solve ARC tasks. And we still need to formulate an approach that can do this in a tractable and effective way. So what we propose here is a sort of execution-- again, a program synthesis technique where we use-- we are guided by execution, and we also do bidirectional search to discover solutions.
So I'll introduce this, just we have execution-guided program synthesis. Also work done by Kevin Ellis, although multiple people have done this. And the idea is that we don't only use a neural network at the beginning of a task. The neural network, which can be thought of as intuiting what to do next, is called multiple times during the reasoning process as you're constructing the program.
So for another symmetry task like the one shown, you might first call a function which flips the input horizontally. And this produces a partial evaluation which is not the final solution, but it's part of the way there. And it's another grid that you can look at. And then you can base your recognition by looking at the whole state, and guess another function to do, which is to combine the grid horizontally with the original input grid.
After you did this, then by looking-- if you just think about the task of given this partial evaluation, how do you get to the grid? The solution is just to stack it with itself. So somehow just going from the personal evaluation to the solution is easier than doing the whole thing at once from the beginning. So this is getting us towards a step-by-step reasoning that can get us to the solution.
Some things to note about this, first, this is, in the context of program synthesis, this is like a bottom-up search. We construct programs starting from the leads of the abstract syntax tree, and we finish by connecting to the output. So the final program here, hstack, hstack of hflip. The first thing we did [INAUDIBLE] the leads here, where we hflipped the input.
Another thing to note is that you may have to adjust your DSL so that your functions can be evaluated as you go. You don't want to have a function where you have to supply a bunch of arguments beforehand, and then you'll get the output all at once. So it's just the design, a little bit.
But now we're going to look at just some go full, just introspection. And look at some motivating examples of multi step-reasoning in ARC. So here's a task we shared before. How does this step-by-step reasoning process go when we solve a task like this?
At least to me, the first thing I notice when I'm solving this is that the output grid is present in the input grid. And once you notice this, it kind of helps you get closer to the solution. The question then becomes, OK, which object from the bridge are we choosing?
And this sort of changes the search problem. After we do this, you can find the solution, which is to say that the object chosen is the largest object. Or equivalently, in this case, you can say it's the object that's different from the other objects, actually.
And so here's another example, again, from before. A similar thing we first notice is that the output is one of the objects present in the input grid. Then we reframe-- this is sort of the step-by-step reasoning that Andy was talking about. And we think about which object was chosen.
Again, the chosen object differs from others in its vertical symmetry. So how does this connect with the execution guided synthesis? One way to do it is to say, OK, first we get the objects from the input. That's an evaluation, we can partially evaluate it, and then reassess how we get to the solution.
But this next step when we notice the output is one of the objects present in the input grid, it's almost like an action in the opposite direction. We start from the output grid, we know how to generate the output grid given the input if we knew some other argument. So in this case, we can say the output is-- if you filter the input with some criteria. And now we just need to determine what that criteria is.
In this case, the criteria was that the object has bisymmetry. In the previous case, the criteria was that it was the largest object. And then here's the last example, this one says, somewhat easier task, you could say, OK, first thing you notice is that the solution is a three by three grid, and it's just one color filled in. Then the question is, which color should that example become?
The solution, in this case, is the most common color, or the color of the largest object, equivalently. So, again, we're speculating a function, but this is at the end of the program. It's not a forward evaluation. And we can evaluate this function in reverse to say, if we knew the color, we could solve the task. And then the question becomes, how do we find what the color is?
So here's just a sketch of what we were thinking about for this. The idea is that certain functions are invertible and can be evaluated in reverse. And when we're solving ARC, I guess, in general, different concepts are invertible. And we can utilize this when we're looking at input output pairs to try to determine what the solution is.
But not that many functions are invertible. Like in this case, block, given the output grid you can tell what the width, height, and the color is. But most functions are more complicated than this. And in addition to directly invertible functions, there's also conditionally invertible functions. Which can be evaluated in reverse, conditioned on one or more of the inputs being provided.
So in this case, here's [? hstack, ?] where you combine your grids horizontally. If you're given what the left grid should be, and you're given the output, there's a one-to-one relationship between the output and the right argument. So this is a conditionally invertible argument. It's also known as a-- the function that produces the right argument is known as a witness function. This is terminology from a program synthesis word called FlashMeta, which utilizes the invertibility of functions and the conditional inverses to do a efficient top down search for different programs synthesis domains.
So what we were thinking would be applicable as a reasoning approach for ARC would be a bidirectional search, where you start with the input grid, you start with the output grid. At each step, you can either evaluate a function forward, just the typical execution guided tests.
Or you can evaluate a function backwards either by choosing an output node from the set of backward evaluations and calculating the inverse directly, or via conditional inverse where you first provide one of the input arguments, and then you can calculate the inverse. And you succeed once you have a path from input to output, where a path is a fully formed program.
And so I'll just show one example of how we do that here with a more interesting task. And just show some of the interesting techniques you can do with this approach. So this task is actually one of the harder tasks, I think. I think the key to solving it is to notice that-- to think of the output grid not as like a bunch of three by three blocks, but as a--
Oh, sorry. Kent had a question. We want our abstractions to be intuitive, non-invertible? I don't understand.
AUDIENCE: Sorry, I was just, I think I just misinterpreted why you were introducing the invertible concept.
SIMON ALFORD: Yeah, so, again here, I think the key to solving it is to say that-- realize the output can be thought of as the three by three grid, where each of these three by three blocks is like a single pixel. And so if reframe it, instead of creating the output grid, it's creating a three by three grid. Then you can notice that this super three by three grid is actually present in the input grid, just like before. It's one of the objects.
We can reframe again, which three by three grid should we choose from the input subgrids? And the solution, in this case, is that only one of the inputs subgrids has area four. Or equivalently, only one of them is lacking cyan pixel.
And so just to sketch this out in the bidirectional search framework, we start with our tree like this. First, we evaluate this function in reverse that gets rid of the grid lines. Then we might evaluate another function that deflates the grid to a three by three grid.
So notice here, these are technically inverses if you condition on the color of the grid lines or the amount that you're deflating. So those are constants we provide, which any constants can be thought of as part of the forward evaluations. And then the next step is to say, OK, we get the objects in the forwards direction. I'm being a little handwavy with the exact semantics of these functions, because it doesn't really matter.
And then we're going to do the filter inverse, where the filter conditional inverse, what it does is it takes the input list and it knows that the output is one of the input lists items. And so it creates a new target output, which is true where the object was chosen, and false otherwise. So this is, again, helping these functions be evaluatable.
So now our task is, how do we go from the objects to true, true, false. So, again, we could look at it like this. And we have these three objects-- I'm not showing all nine just for clarity-- and then you can evaluate a function in the forwards direction, which is area, which gives you four or five and five. And so you know this matches up, actually matches up perfectly, because the item which is true is distinct from the others. So it's almost like a look up problem.
I may be hiding a bunch of details there. I just wanted to give this high-level view of what we were thinking. The only last thing to talk about is how does this combined with DreamCoder? DreamCoder doesn't talk about inverses at all. So one way to think about it is, any function in which we learn as an abstraction, we can attempt to find the inverse just as a synthesis problem.
Sample some input output examples, see what the output is when you plug in the inputs, and then see if, given the output, you can synthesize the inputs, or conditionally synthesize some of the inputs as shown in the example here. So then there's some implementation details. This is still a work in progress. I don't think the implementation matters as much, but we're doing it in a reinforcement learning framework, and it's very similar to that done by Kevin Ellis's reinforcement learning execution guided synthesis paper.
So, yeah. I might conclude there. Basically, this is just saying what's been said before. We show that we need to learn new concepts from examples, and you also need a reasoning approach to do well. And that's what we're working in progress on right now. So, yeah, I hope this is a nice way of introducing some of our ideas to you guys, and sharing some of the work we found really exciting and interesting that's out there. So yeah, thank you.
HECTOR: So if you have any questions, go ahead and ask us. This could be a nice discussion, because there is many things that you might think about ARC, and how you would actually try to evaluate things. I mean, I have things that I'm particular interested in here that we didn't even touch upon. But maybe we can get to in the questions.
AUDIENCE: Yeah, I have a question. I think DreamCoder and kind of any search technique requires some kind of base DSL of some notion of completeness. In some sense, like are all tasks in ARC captured by the DSL that you have? And when you can find it, how do you quantify that is it just your search algorithm was not able to find it, or is it just kind of like impossible to begin with?
And so essentially I was thinking is there some kind of a universal DSL that can capture all of our problem? But that's probably some really low-level pixel manipulation language, which may or may not be useful. So what do you think about that?
SIMON ALFORD: That's exactly right. You could provide some sort of-- a bunch of different general functional programming operations, and then pixel manipulation, and hopefully, theoretically, you can express anything with that. There's no way of knowing if you're verifying that you cannot produce a solution with the given set of primitives. At least, there's nothing attempted to do so. So I think it's a--
ANDRZEJ BANBURSKI: I just want add, especially since you might accidentally find that some operations you learn from the given training examples in combination with the primitives you have. So you can accidentally learn things like translations, even never having included them and the original operations. And that's purely just because the inputs had a very specific symmetry, but allowed for translations.
That's one example that we've run into, that we've discovered that you could learn [INAUDIBLE] vertical translations without putting that in, but purely because the examples have them.
AUDIENCE: Sorry, I'm not sure if there's a follow up question to that. Feel free to interrupt me if there is. Somewhat, fairly unrelated. I'm just curious, do you think there's value in trying base functions that are useful in classical signal processing and stuff? Do you think that's a-- it seems like you could apply similar things to ImageNet or something with a lot more data, and still do this few shot [INAUDIBLE] learning idea if you just had like-- if you just started with higher [INAUDIBLE] functions, I guess, that are useful in signal processing in general.
SIMON ALFORD: So you're saying, apply to more perceptual benchmarks where your set of functions are from signal processing?
AUDIENCE: Yeah, just like prior work in the field, or something. I think ARC is good because it's a minimal example and it's easy to think about and stuff. But if the ultimate goal is to make machine learning and DNNs better and stuff, it seems like starting with something has a better history of being applied in high-level tasks, to start with that and then kind of build up from there. So I guess my question is, why are we starting so low on the pixel manipulation tiers?
ANDRZEJ BANBURSKI: Maybe I'll say something. I mean, ARC is not really a visual intelligence task, in a way. These pixels, they only have 10 possible colors. And don't think there is a point in applying many of the visual image processing techniques. The stuff that could be relevant is like de-noising. You can use some techniques from there, but I strongly suspect that signal processing is not the correct language for solving something like ARC.
AUDIENCE: Sure, but, I mean, the ultimate goal isn't to solve ARC, right. The ultimate goal is to get something that's useful. I mean, I guess it's a step toward ARC AGI, which I guess is also understandable.
ANDRZEJ BANBURSKI: What is the eventual goal is to be able to do operations on abstracted identities. And in a way, there is a jump from taking something that looks at normal visual scenes, and then creates a simplified representation, on which you could do something like we're doing with ARC. But I would say that this will be a multi-step process. You look at the visual scene, you see things in it, and then you have this abstract object, and relations between objects, you can count things and so on. But that is at a higher level, in a way, than just at the pixels base.
AUDIENCE: OK, but-- sorry, what about, you're trying to simplify-- it's like you have a neural network that's obviously way too small for the task, or something, and then you're trying to simplify an image down so that it can fit in a smaller neural net, or something like that. That seems like another way to do this kind of synthesis, like removing of other external variables, so you're not just brute forcing things, you know.
ANDRZEJ BANBURSKI: Simon, do you have anything to say about this?
SIMON ALFORD: I would have to think about it. Yeah, I don't know. But I do agree in general. This would be the hope, that this-- any insights could be applied to other domains, and not just have some very specific techniques just for ARC.
ANDRZEJ BANBURSKI: That I definitely agree with. I mean, there are datasets like CLEVER, where you're trying to do some kind of simple relations between objects that are geometrical in nature on semi-realistic looking images. And you would hope that if you can do something, do well on ARC, then a combination of these techniques we use here with a CNN will do well on CLEVER. I mean, that's the hope. But you would first have to find something that would actually work well on this.
AUDIENCE: Can I ask a question about the ARC challenges, the ARC tasks? So are there higher level tags that group or cluster like tasks together?
ANDRZEJ BANBURSKI: Nope.
AUDIENCE: So is it the case that when you go into the ARC dataset, you're treating each of these tasks as individual and sharing no commonalities?
ANDRZEJ BANBURSKI: Well, that's not represented to you. You could go the way that we have gone, and create a massive document that somehow classifies these different tasks into different categories. But this is not provided to you in the dataset.
AUDIENCE: OK, thanks. I'm going to think about that for a second.
ANDRZEJ BANBURSKI: To answer Eric's question, doesn't this end up potentially being limitations to benefits? We might come up with solutions that only fit ARC without any deeper intelligence, or general intelligence. hands Hence the hope that the test set is sufficiently different from the training set. If you really have to learn new things as they're solving the test set, actually, then hopefully you're actually doing some kind of intelligent behavior. Now, if you can game ARC because you've discovered some nice set of primitives that somehow can solve these grid world tasks nicely, then you need a better benchmark.
Has someone taken data on human eye movement while serving ARC? I don't-- so I remember, Evan, you guys were looking at some basic psychophysics, some kind of like, describing how solutions are done. And how someone basically getting a DSL based on behavior, can [INAUDIBLE] actually trying to solve the tasks?
AUDIENCE: It's a bit weaker than that. I would say we haven't arrived at a DSL yet.
ANDRZEJ BANBURSKI: I'm saying you are attempting to get there.
AUDIENCE: Yeah, we are. Kind of the computation problem. ARC is very different from the kind of computation problem that computer can do. And then we're like, well, our computer is too weak. So we wanted to run ARC programs on people. So the DSL is like natural language, and the computer is like a person.
So what we did is we have one person synthesize the program from some input output, represented as natural language. And then we give that program to a different person, and they have to solve the task on the new input. And so what we find is human can communicate fairly well, I think. Like I would say 70%, 80% of the task can be communicated, at least on the training set, with natural language.
But looking at the natural language doesn't directly give you what the programmatic structure might be. That's why I'm very keen on asking the kind of question of what is the actual DSL? Because right now, it's like very late in somebody's head right now. It's like a single dude, and he's on his Microsoft Paint, whatever he used, right.
So yeah, we're thinking about getting more fine grained annotations for these tasks. Like maybe we would be-- so, yeah, Kent was saying eye tracking. So, basically, we can get additional annotation of what are objects that you use in solving these tasks. So, basically, once we kind of pool into more of the details of the operational specifics of solving this task and constraining them the right way, maybe we can get those into the computer and basically have a computer able to execute these whatever programs. But, yeah, we don't have a DSL.
ANDRZEJ BANBURSKI: I would say that given the fact that even with this national language descriptions, you're not getting to 100% reproducibility of the given code. It is really difficult to communicate some of these solutions that you come up with. I suspect that using human eye movement data, I don't know how much that will help you. I mean, that would probably inform you that there are certain shapes that we discover as objects to be useful, but I don't know if it will actually help us in trying to find a solution to ARC.
I think it's-- you could abstract it away by saying, hey, we need some kind of object detector that is going to discover separate objects within the examples, but I'm not sure if you will get something from the series of eye movements on the image. It's kind of like, would you get anything from any movements by looking at the eyes of a chess player? Perhaps, but it would be difficult to decode them. But that's my personal opinion.
AUDIENCE: I have a question. Could you go back to the slide where you show your approach that you're working on now, and you have the example of the three different objects?
SIMON ALFORD: This one?
AUDIENCE: No, I think maybe back a little bit. Yeah, this one. Next one, the one-- yeah, this one. So if I understood correctly, the idea is that you've got a multi-step process. And the first step you're saying, oh, it seems like we're taking one of the objects. And then the question is, which object? And the reason you want it to be invertible is that you can then reconstruct from the data which are those objects it is.
I guess my question is, in this first step when you are discovering that it's one of the objects that you are extracting, do you imagine finding the concrete functions? So in the first image I'm taking the pink object and the second image the yellow object. And then from those instances, abstracting the fact that I'm taking one of the objects, or do you imagine directly go into this partial function, which is saying I'm extracting one of the objects? Does that make sense?
SIMON ALFORD: I don't understand, sorry.
AUDIENCE: So at some point, you're going to have to construct this function with an unknown input, right, which is that we have this function and it has this unknown input. And I guess the question is, how do you get to that function?
SIMON ALFORD: So I guess that function is a function that could be applied if you knew the arguments. But it's also a function that we know that if one of the arguments and the output, you know what the second argument has to be. And so in this case, we know what the output is and we can provide one of the arguments. And then we evaluate the witness function to produce the second argument.
AUDIENCE: So in this case, the second argument is what, exactly?
SIMON ALFORD: So the function here, again, would be filtering. Again, I sort of handwaved on the details. Let's assume that filter returns a single item instead of a list. So provided the input list and the output object, the witness function will produce the second argument, which is another list, which is a list of Booleans. So filter in the normal sense, it'll take a list of objects and a list of Booleans and return all objects in the first list for which the corresponding Boolean was true.
AUDIENCE: I see. I guess my question is like, when you proposed this partially defined, or filter, without-- with some of the inputs missing, I guess my question is, how do we get to this stage? How do we propose this as a candidate function to explore?
SIMON ALFORD: So indeed, in order to propose it, we do need to have what the list of input objects, and we need to have the output. So those already have to be constructed either from partial forward evaluations and partial inverse evaluations, or just being the input or the output. And once those are there, then we can call this function. And we choose that its arguments, a forward item and the backwards item, and then it'll evaluate in reverse. Does that answer your question?
AUDIENCE: Maybe. I need to think about it a bit more. But thanks.
ANDRZEJ BANBURSKI: I mean, you can think about the fact that [INAUDIBLE] if you especially go to the later slides, you try-- you're given the input and output, and you can [INAUDIBLE] apply these operations into different directions. And you end up with this kind of big bag that grows from both ends. And some parts are not going to be useful for you. But you might be able to find the part of the next input output.
The question is, is this more useful than just going one way? I mean, the hope is yes. [INAUDIBLE] And the hope really has to be yes. Because if you think about it, you're given the output, you're given the training, the input outputs, and you're now getting the output at the end from the test.
But you're getting the answers that-- you make an attempt at solving the test task, and you failed. That's a terrible-- you get the signal there, which is no you didn't do well, but you don't get signal as to what you did badly. But you still intuitively know that some things you definitely done right.
And this is like maybe not exactly getting fully to getting this kind of attempts at figuring out which parts make sense, but at least within this going back, you know that these parts that we have that are correct, at least within some sentence, or something like that.
AUDIENCE: Yeah, right. I mean, I think it's a good idea. I'm partial to this idea of going backwards and forwards. Just wanted to clarify the issue. I guess one other question is, if you go back-- not literally, but if you think back to one of the examples you had at the start with the planning, this little agent that went up and to the right, do you think this approach can conceivably scale to that kind of problem, or would that require some different ideas?
SIMON ALFORD: In general, I don't think this is like a comprehensive approach. There's definitely plenty of tasks for which the type of reasoning we perform to get to the solution is slightly different. For that specific task, I have thought about it and there is a way to do it. But it's somewhat-- again, you get to the point where the type of-- I'm just not even going to go to the slide, because it'll take too long. But the type of invertible operations you have almost are very carefully crafted in order to be invertible. And that starts to violate the spirit of learning concepts from examples, again. And it's unclear whether synthesizing the inverse would be able to achieve something like that. So, yes, there is a way to achieve that sort of thing. But in general, no, you can't do everything with it. And, yeah.
AUDIENCE: Cool. Thanks.
HECTOR: Felix has another question. But maybe Evan also has a question, and they're going to fight over this.
AUDIENCE: No, I don't have a question. I was just going to comment a particular nine by-- three by three grid thing. That program is so freaking hard. So if you could get that, that'll be quite impressive. This one. This one's brutal. A lot of the human annotator don't really know how to do it. And when they do it, they don't know how to communicate it. So this one is particularly hard. So it's a good one, if you could do it.
[INTERPOSING VOICES]
HECTOR: Go ahead.
SIMON ALFORD: I was just going to say, I think the problem is that it relies on this previously learned concept of deflating the output grid, which is present in other examples. But random humans solving this aren't necessarily people who have gone through and solved the tasks from easiest to hardest. So that's almost an advantage that our agent would have in that it has seen these examples before.
So like someone like me who's been researching this, you look at the output grid, and you can immediately say, OK, one of the possibilities is that it's this little three by three grid. And I think to the general population, that concept is not immediately available.
AUDIENCE: I think the hard part on the solving-- so there's two part, right. You can fail if first you fail to solve, second you fail to communicate. The solving part is really hard. You need to notice that it's a yellow square in the center, that's important. A lot of people don't notice that.
And even if you do notice that, how do you communicate that output? Yeah, so then inflate, it's not in the-- if you look at the [INAUDIBLE] available to the people, There's [INAUDIBLE] color, there's like flood fill. And so inflate is not one of the operators available. So it's really hard for them to communicate that particular function when it doesn't exist. You need to kind of describe what the function do. But this is a really good one.
SIMON ALFORD: Felix's question. How did the surface dynamics change as the DSL grows? So that's a good question. Intuitively, if you have more primitives, the search space is larger and it will drag you down. This is ameliorated by the fact that we have a neural network guiding recognition.
So for a given task, you're predicting which functions are applicable to that test. Which means that for a given task, you'll only use some of the full library. So in theory, it could scale to an unlimited number as long as for one specific task there's only a small set used.
I don't know whether that's true in practice or not. We have found that it does slow down. But on the other hand, you are learning functions that can express-- that can search the search space more efficiently. So usually it seems to work out.
AUDIENCE: Another way to phrase the thing that I'm curious about is-- because we were talking about brute force approaches earlier-- so in light of that, how do you expect a properly grown DSL clustering coder to behave with the evaluation set, this hidden test set? Would it be incredibly slow?
Because at the same time, if there's no-- I'm curious if there's no higher level data about like, oh, this is a reflection task. This is rotation task. Or this is a whatever task that limits or softly types the DSL, or gives you a subset of the DSL. I'm just curious how it doesn't become some enumeration problem.
SIMON ALFORD: I think there's two-- you will-- performing well on the test set is indeed a learning problem. So optimally, there would be some sort of acclimatization to the distribution of tests seen on the test set. I'm not saying how, exactly, that learning approach would work. But it's-- I mean, I think, I guess the confusing thing is that for humans, we can solve the test set right off the bat like we do the training set. So it's not clear where to take inspiration from the type of learning that needs to be done on the test set. But I think there are possible ways of doing it. But I agree that it is a potential problem.
AUDIENCE: I have another question for-- it's related to this topic. I think it's more for Evan, in this case, since you're doing those psychophysics experiments. In the natural language description tasks, have you asked anybody to describe the type of the task? Or a high-level description, like as I just said, this is a rotation task, or this is a mirror task.
Simon, you just gave a little bit of info related to this, where you said you looked at this task on this slide. And you're like, oh, the moment I saw this, I figured it would include this kind of solution, or that it would include this kind of function. So I'm just wondering if there's a way that--
AUDIENCE: We don't ask them to annotate that. We thought about restricting the keyword that we use so that they would try to use the same words that previous users have used. But that wasn't too helpful. I think these tagging system are very helpful if you do them right. But they also require the right community of people that moderates can make sure that all the tasks are label in a consistent way.
So some people might create a tag of square, but other people may refer to them as box. So do you merge these two things? Do you some word embedding to automatically treat those two tag as the same one? It's really, really noisy. Yeah, I haven't figured out how to do the social aspect of it.
I think you could do something like a GitHub, where there's a few mods that controls a report of what to get tagged. And people can propose and then you can-- I think those kind of system needs to be set up in order for these tags to be helpful. Otherwise, you just end up with a bag of redundant tags.
And then the boundary between the tags are not very clean. So is this like a rectangle enclosing task, or is this a square exclusion task? Those things are really, really, hairy. And so we haven't figured out what to do with that respect.
AUDIENCE: I'll just say before you wrap up, Andy and Simon, really cool work, really nice stuff. Obviously a lot of connections to things we've done and are also working on. So I just wanted to say, thanks for that. And we'll follow up by email. I saw your response to [? Will. ?] We'll find a time Thursday, hopefully, to touch base more about what we've been up to in the space, and just talk about ideas. Because I think there's many connections.
I would also just say, one of the last points you guys said about-- it's interesting how much humans can just-- like I agree that the ability to learn and adapt from training to test is an important part of flexible intelligence. And it's also really interesting, as you said, that humans can just do it out of the box. And in fact, they can solve the training set out of the box, too.
It's kind of useful, it makes sense that there's a training set and the development set and the test set. But really, like from humans, it's all of our previous experience plus evolution that lets us do all of these problems. Or if not all of them, definitely the far majority of them, with no training at all. Which it's quite striking how much we can do that.
So I think it's a big enough data set that it is interesting, and it's worth trying to bootstrap learn, as you've been doing, and we've been trying to do also, using the training set as a training set. And it's designed for that, as you pointed out. It kind of has a curriculum built in. But when we think about human intelligence, it is so striking that we can do-- we have the abilities for really far transfer and abstraction that lets us do all of this without having done any of these tasks, anything even like it, basically.
ANDRZEJ BANBURSKI: I mean, these are very good points. And I'm definitely looking forward to discussing with you more. Let's see how [INAUDIBLE] on this thing. I mean, this is a very interesting dataset, in this sense. It's a nice place to think about ideas.
AUDIENCE: Absolutely, absolutely. Have you guys talked to Shimon at all about this? Shimon Ullman? Because, again, I think he would be very interested. And there's connections, whether it's to his interest in programs and routines, but also bidirectional search. That seems like a very-- I'm sure you know that-- obviously I'm sure you're thinking. [INAUDIBLE] I was thinking of in the background.
It just it sort of connects the dots between a bunch of different people's interests in CBMM. So it's cool that this could be also a place where various different parts of [? CMM ?] could come together on some shared effort, as well.
ANDRZEJ BANBURSKI: This would definitely connect to Shimon's ideas of routines. One thing that maybe it doesn't directly connect as in sense that, we've been mostly using modules that have been hardcoded, rather than neural networks. But there is no reason that any of these primitives that we're using could not be encoded as neural nets.
And maybe that would be the obvious place to connect. But even at this abstract level, I completely agree that this is something that we've been, at least partially, inspired by a Shimon's work.
AUDIENCE: One final question, if there's any part with respect to ARC in DreamCoder that you wish you could replace with a human, which part would it be?
SIMON ALFORD: It seems like you could just say, the search algorithm should be a human.
AUDIENCE: I see. So you want the search algorithm to provide you a solution. So you give me a DSL, and then I can propose a solution for you. That would be helpful. I see. We can get a label for that, I think.
SIMON ALFORD: I don't-- I would have to think about that more, though.
ANDRZEJ BANBURSKI: I think I agree with Simon, as well. Because let's say that you have a DSL that has been pre-fixed, then could you give this DSL to a human to just search for a solution to any task? I think, yes.
AUDIENCE: I think so, if you communicate the semantics of the DSL carefully enough so that a [INAUDIBLE] can understand right. I see. So the difficulty here is-- I guess a [INAUDIBLE] can also try within a DSL and tell you, this didn't work out. And then you will have a notion that, oh, maybe our DSL is not expressive enough to capture these kind of operations, I guess. OK, I think I'm happy with that answer. Thanks.
ANDRZEJ BANBURSKI: I mean, at the lowest level, I can probably show that, yes, this is definitely a search problem. Because the DSL could be just the pixel operations.
AUDIENCE: Yeah, but the thing is, you want-- so there is different-- I guess, yeah, if you represent it like that, you could get a car-- this is just me introducing this as an [INAUDIBLE]--
AUDIENCE: No, no, no. I understand.
ANDRZEJ BANBURSKI: I agree with Simon.
SIMON ALFORD: I originally interpreted your question as what would boost the results the most, and not, what's the most useful to your research. So that's why I hesitated.
AUDIENCE: I think that's a good one. So it's like, DreamCoder cost a lot of money for Kevin to run. So if you want to measure it in that sense, if you could outsource part of the DreamCoder computer to [INAUDIBLE]. Like which part would you like to outsource?
Because at the end of the day, we want to strike the balance of finding a program in a space. A program, and then some of that effort is best done by enumeration, and within something like, where do you put these kind of ratchets in the right place? I feel part of the ratchet could be people. And so if they are good at solving some particularly very deep search problem that require probably some crazy learned heuristic that you don't have, then maybe they can give you a few seeds of like, this is how you navigate this search space. And the you could use an imitation somehow. That might be helpful. But, yeah, I see. Yeah, cool. Thanks.
ANDRZEJ BANBURSKI: OK with this, we can officially close the first CBMM meeting of this year.
Associated Research Module: