Shimon Ullman: Visual Understanding: State of the World, Future Directions
Date Posted:
June 9, 2014
Date Recorded:
June 9, 2014
CBMM Speaker(s):
Shimon Ullman All Captioned Videos Brains, Minds and Machines Summer Course 2014
Description:
Topics: Overview of visual understanding; object categorization and variability in appearance within categories; recognizing individuals; identifying object parts; learning categories from examples by combining different features (simple to complex) and classifiers; visual classes as similar configurations of image components; finding optimal features that maximize mutual information for class vs. non-class distinction (Ullman et al., Nature Neuroscience 2002); SIFT and HoG features; Hmax model; state-of-the-art systems from the Pascal challenge vs. human performance; deep learning and convolutional neural nets (e.g. ImageNet); unsupervised learning methods; fMRI and EEG studies indicating high correlation between informativeness of image patches and activity in higher-level visual “object” areas (e.g. LOC); recognition of object parts with hierarchies of sub-fragments at multiple scales (Ullman et al., PNAS 2008); object segmentation (e.g. Malik et al.; Brandt, Sharon, Basri, Nature 2006) using top-down semantic information to enhance segmentation; future challenges include recognizing what people are doing, interactions between agents, task-dependent image analysis e.g. answering queries, visual routines, and using vision to learning conceptual knowledge in a new domain
SHIMON ULLMAN: Good morning, everybody. I'm sorry that I arrived a little bit late to the school. I had some other constraints, so I still don't know you, the audience. I hope I will have an opportunity to meet some of you later on this week. Today we're going to talk primarily about vision and language as connected to vision. And within vision, the emphasis will be not on low level image processing, but on what we call visual understanding. You will, see as we go along, what I mean by that. Really, we try to see how to extract high level, useful information to understand the world through images.
And I will talk for an hour and a half. I will divide it. Most of it, I will try to present as sort of the state the art. What can we do today? What do we understand in this area of high level, visual understand? The start of the art. I will not have time to go into many technical details. What I'll try to do is to describe for you some of the main ideas that proved useful in this direction. And then I would like to spend some more informal time, at the end, on computer directions and areas where I feel that things are, more or less, completely open. There is a lot of interesting, good work that needs to be done. And basically, this is what you should be doing or may be doing next, people who are interested in this area.
So when we look at an image, we can get, immediately, a very large amount of information and complicated information. You look at an image like this. The contrast is not great here, but I assume that you can still see and understand what's going on here, what happened before, why these people are hanging out on the electric wires, and what these people are doing here, and so on. And so people are not dealing with vision-- maybe they are not surprised enough, but we you have to do is to start with an image.
We have pixels. We have, basically, electromagnetic radiation. Something like this. And this is a digital representation of what the camera delivers to a computer and also, more or less, what the retina of our eyes delivers to the brain. And then the task of the computer, or the brain, or any intelligent system using vision is to say what's in here. Who is there? What are they doing? How are they feeling? Their goals, intentions, and so on. So it's a tremendous task. And it's really the gap between the input and what we can derive in a fraction of a second that's the problem we are correcting.
In the last 10, 15 years, there has been a lot of work in this area, but it focused, primarily, on a subset of the problems here. Mainly things directly involved with understanding objects. Looking at single objects or objects embedded in images, but basically focusing on an object and trying to say what the object is to label the object. And this includes a number of sub-problems. The main one that most of the community spends most of the time was object classification or categorization. And the task here is to look at it singly, imagine of one of these images, and decide which category object, and which class of object the image belongs to. These are dogs, cats, trees, and so on. So that's the problem.
And the big problem here is, of course, the variability. You look at the class of dogs. There are so many dogs from so many different directions and so on, that looking at an image and understanding that this is an image of a dog is a very difficult problem in the face of this large variability. A related problem is to not just classify an object, but also recognize an individual object because of the variations in appearance that the object might have. That all of these are the same person, all of these are the same type of car, and so on. So this is a more precise recognition.
In addition, even when dealing with a single object, the problem is more difficult than just producing a label that this is a car or this is an airplane. When we look at an image, we sort of know almost every pixel and what it represents. This is the windshield, or the door, and so on. So we would like, eventually, to get the complete interpretation of the object in terms of everything that the person might get out of the image. And this is a difficult problem and relatively little progress has been achieved. And I've mentioned a little bit about this before.
In addition, a major part of the problem is not just being able to produce something that works, but there is a learning issue. You don't want to program each and every class by hand how to proceed and how to analyze a particular image. But what we would like to do is to mimic what people can do. We show the system images, and we show people images, and they learn, after awhile, to recognize them. So a typical set up is that you give a learning and vision system, training images typically labelled as belonging to the class and not belonging to the class. So all of these are class images and non-class images. And from then on, the learning system should do the job on its own. It should somehow discover all the differences between the class and non-class images. It would use some kind of a classifier as it does. So the input is images and the output is a classifier that's sort of ready to go, and look at new images, and produce labels for it.
Now for humans, we need even less. The images are typically not completely labelled, certainly not each and every image. If you think about a baby learning to understand the world, we don't tell them, this is a car, this is a motorcycle, and so on. But we do it sort of sparsely. Here and there. And they can also do things in a completely unsupervised way. So the unsupervised problem is more difficult. And the task that people usually look at is-- you have again the images, the digital image, class, and non-class. And it will see it from there and it will produce the output, which is a classifier.
There has been quite a lot of work recently also on the unsupervised classification, but I will not talk about it today. Typically, the success rate of the performance in the unsupervised case is more limited, but there has been quite a bit of work on the unsupervised way. And this is a result of our own work in which we just gave the system a lot of images, but a good fraction of them contained horses. And the task was then to detect horses. And this is, in fact, the output of images that the system did not see before. And without supervision, it managed to discover the class of horses. So some focus has been done on the unsupervised case as well.
And a terminology that has come up-- and the thing is also, it's not just terminology. There's a conceptual, natural divide in the whole stream of doing classification between producing features from classification, and then using these features in order to reach a decision, and make the classification. And it's a real divide in the sense that, for example, you can use the same features, but try them in different classifiers and see which classifier is best. Or you can use, very often, a single classifier, like SVM or something, and try it out with different features. You can sort of mix, and match, and decide on the features for the task and, later on, on the optimal way of combing, using, and deriving the information from the features, and produce the final decision.
And this has been a main challenge in the field for many, many years. Those of you who do not do vision, it's a big field of recognition. There are probably 10s of thousands, maybe more than that, of papers dealing directly with object recognition. So it's something that has been studied a lot. And a large proportion of it, for many years, has been on looking for good features to derive from the imagine and then use for representation. The features that were proposed over the years range enormously in scope and in size, from very simple to very complex ones. In fact, most of them occupied the low point and the high point of this range. Some of them were very simple, some of them are complex. There's relatively very little in between.
When I say simple, I mean things like wavelets, Gabor functions. These are sort of mathematical functions that look sort of like sine waves, or something similar. And you can describe the intensity distribution rather than giving the intensity. At the pixel, you can look at the neighborhood and use this as sort of [INAUDIBLE] functions.
At the other extreme, a very popular scheme, until about 15 years ago, 12 years ago or something, was to describe images in terms of higher level, three dimensional primitives including things like boxes and cylinders and cones, and things like that. So you look at an image, you produce a three dimensional reconstruction, you try to fit the three dimensional primitives to the image. And then look at them and try to classify the object in terms of the collection of the three dimensional primitives, called Geons, that you see here. In the cognitive science, there we books and articles. And if you look at the earlier review papers, people said that they roughly understand optical cognition. At least we know that everybody agrees that it's done by going through this kind of a presentation.
But this got stuck and you will not hear or read papers-- in the last 10 years, you would not even hear the name Geons described. So there was a large transformation in the field while trying to find good features for representation. And the features that are being used today are sort of variations. All of them are. As you will see, directly looking at pieces of images typically taken from examples of training images. And the intuition behind it, I think it's sound and clear. When you look at a visual class, it's almost defined by being this configuration. All of these images are in the same class of similar configurations or from shared components.
So if you look, for example, at the class of faces-- and I'm not talking about an individual face, but just saying, here is a face in the image. Imagine that these are training images and then you see a new image. This is a face that you have not seen before. So the face may be new, but typically, part of the image, like the hairline, would be similar to the hairline that you have seen before. And the eyes would be similar to regions of eyes you have seen in other faces.
If we want to think about it more biologically or [INAUDIBLE], you can think of cells in your brain that have receptive fields sensitive to the region of the hairline, or the eye, or other parts of the face. And then when a new face comes up, many of these neurons will be activated. And the collective activation of these feature detectors will be a good signature if you again encounter a face [INAUDIBLE].
Now if you go along this direction, the interesting question that comes up is what are the natural building blocks for a new category? So if I give you, for example, the collection of horses, what would be a good way of describing a horse in general and the entire class of horses in terms of the appearance of local image features? If you think about it, we can take from the image-- these may be parts of training images. Again, in the case of faces, we can take different pictures, or fragments of faces, bigger ones, smaller ones. Not all of them are expected to be equally good.
Let me turn the question to you. If you take a large part-- maybe a full face, or half a face, or three quarters, is this likely to be good features for detecting new faces in general or not an optimal feature? Any ideas on that?
AUDIENCE: Non optimal.
SHIMON ULLMAN: Non optimal because--
AUDIENCE: It doesn't feel like it would generalize.
SHIMON ULLMAN: That's right. The problem would be generalization. This is a very distinct feature. If you see this in the image, you're probably looking at the face, but the probability of seeing something like this in a new faces is, of course, low. Since you started, what about a very small feature? Like a piece of an eyebrow may appear in many, many faces, if not all of them. Would this be a much better feature? Any other thoughts?
AUDIENCE: [INAUDIBLE].
SHIMON ULLMAN: Exactly. This is going too much in the other direction. Because a small feature like this, you're likely to see in plans, and in this general picture, and so on. So intuitively, you would think that optimal features for building blocks, sort of an alphabet for a new class, would be something intermediate between the two. And you can formalize this. You're looking for a feature that will be highly informative for the class. It will deliver to you as much information as possible about the class. Yes?
AUDIENCE: So it seems clear that very small features are very diagnostic when they're the only feature, but it's not clear to me that configurations on such features wouldn't necessarily--
SHIMON ULLMAN: Right. A single one, at least, will not be. And you're trying to look for features-- this is maybe even a part of the insight. In some sense, the pixels themselves are already carrying all the information. The configuration of the pixels is carrying all of the information, in effect, by mathematical theory. You cannot, by looking at some configurations of pixels and so on-- you'll never increase the information. So the collection is not more informative, but then you have an enormous task of dealing with this. And you want to meet the problem sort of half way and say, let's look for features that, individually, will be highly informative and then combine them.
I agree that is not the only possible way of thinking about it, but since configuration of pixels proved impossible to use, with this configuration, it made sense to sift the information out by highly informative features. And then we'll have a small number of highly informative features and combine them. And in fact, this proved to be very good.
And I will not go into the details of information theory and so on. I will assume that most people know, roughly, what information is and what mutual information is. How you deal with information. If not, it's just not the time to do it now. I'd be glad to explain it to be people individually. But for those of you who know, roughly, what information is, if you can compute information between two variables, you can look a set images basically as a random variable, which is 1 for class image and 0 otherwise. Similarly, for a feature, you can think of it as a binary variable, which is 1 if the feature is present in the image and 0 otherwise. So if you have say 200 training images associated with this, for any potential candidate feature, you have it's signature and what kind of variability it has. And you can see how informative it is, how much information it delivers for the classification class.
And this is the mathematical formula. But intuitively, you basically look for a feature that will be highly correlated with this. It will have 1s when the image is really a horse and 0 otherwise. But it turns out that you don't want to do correlation, but in fact, measuring the information is better because you can show, mathematically, that this will reduce the classification error. And I will not go into that.
And you can do it now automatically and mechanically. Suppose that you had class images, 200 faces, and say some number of images which are non-faces. And the question is, what's a really great feature? If you needed a single feature to decide, based on just this one feature, whether there is a face in the image or not, what would be the feature? It turns out, you can solve this, at least for the data given, by looking at all the possible sub-images. So each one of them-- each candidate feature like this, like this. You can look at how often it appears in the class set and how often it appears in the non-class set. From this, you can plug it directly into a measure of information and you can find the best, or the second best, or the [INAUDIBLE] best features for finding faces in the images.
You have to deal with the redundancy. You don't want features that look like copies of one another. I will not get into dealing with this redundancy, but the general idea is that you look for pictures which tell you as much as possible about the presence of a face or a class in the image. And as we expect from the intuitive argument before, you get local image appearances, local image regions. Now sometimes you get low resolution, large images. Sometimes they're smaller, sometimes they're slightly bigger. But you get typical sub-regions of classes, which turn out to be the most informative features for finding classes.
And based on these considerations, either directly or indirectly, all the systems in existence now basically try to describe images, or class images, by dividing them into localized parts. And the parts are not necessarily parts in the semantic sense. It tends to be an eye, or it tends to be a nose. Or it tends to be exactly the head of the horse, in the case of the horse. You get pieces of its general appearance and each one of them turns out to be informative and useful. And a collection of these localized features can be used very successful for discovering new members of the class that you have not seen before.
This is typically combined with at least some degree of relative position information. Because if you look for, say, a horse, you would like the heads to be relatively high, the legs to be below it and so on. So you can produce based on learning the representation, which tells you what the main features are and what the relative arrangement in space is of these features. And this is the kind of configuration that becomes the object presentation. When you look at a new and you want to see if there is a horse there, you basically look for a configuration, this arrangement of localized parts, which you have derived during learning.
So a typical algorithm would be to look for each of the parts individually. For each part, it produces sort of a vote as to where the center of the object should be. There is some kind of star configuration of various feature relative to a central location. And if this measure is high enough, it's typically sufficient for representation.
So people do not necessarily use the direct measuring or deriving of the mutual information of each feature, but the common aspect of existing schemes is that they're all using the same family of features, deriving them either in this direct way or somewhat indirectly by trying different features individually and seeing if they help classification or not. There are various ways of doing it. And some people just know that if you take in the immediate features-- if you take an [INAUDIBLE], which is 150 on a side, and you build parts which are 30 pixels on the side, you're roughly correct and can proceed with that.
There has been very rapid progress in this area since people started to use more sophisticated models and classifiers based on features like this. And bringing this-- this was taken from a paper that was written circa 2003 or something when the work of classification started to take off. And this was described as something that computers will not be able to do in the near future.
People showed it in the paper and said, here are different airplanes. How on earth can you learn to automatically discover and recognize that all of these things are all airplanes? And already in the same year, people produced systems that, without seeing these images, classified all of them airplanes. It's interesting because you don't see it very often in computational theories of cognition and so on, that you have such a rapid transition from a problem that you say this will not be solved to producing workable solutions using the span of a couple of years.
A related and important question that comes along with this is, how do you measure the similarity between, say, a feature in an image-- or how you represent the image. You don't necessarily do it on the basis of pixels. And it turned out, this is by trial and error, that a very good way of locally representing the image and comparing whether a new page that you see is similar or not sufficiently similar to a part of the image, is by using the local gradients and some combination of those.
If you read the literature, you will see two terms. One of them is called SIFT and the other one is called HoG. HoG is histogram of oriented gradients or something like this, but people already treat them as words in the English language, SIFT and HoG. And they're doing a very similar thing, which I think, again, it's very intuitive. And you can get a lot, I think, by just using some healthy the intuition and then formalizing it.
It turns out that, and people knew it in biology for a long time, since the work of Hubel and Wiesel, that what's important in images is not the intensity values, that it's now 117, because this can change because of the ambient light. But the edges and gradients in the image, the orientation of the boundaries, that's important. So you want to represent the local region by the direction of the local edges, if they exist. And then, in order to allow some flexibility and distortion in the images, you cannot expect, when you see a similar image again, that the edges will be in exactly the same position that they were in before.
So you want to allow some slope and some position tolerance in where you expect to see an ad that you've seen before. So an intuitive and plausible way of doing it is that you produce local descriptors in which you look for the local orientations, and then you allow them to move a little bit, and say, I'm looking for this roughly in this location. And this is what the whole description is doing and all the other useful descriptors. They look for the local edge direction, the gradient directions, and they produce a local histogram that doesn't care about the variably sized location of these edges. And a vector can sort of concatenate together this local histogram with the HoG descriptor. And this is the way images are compared. These local descriptors, the pictures, are being compared in most of the computational scheme.
This is, for example, from a real paper. It's showing how the system describes bicycles, for example, in this local implementation of each local and image. It is then divided into parts, as I described before. So local parts are described by these rough, local orientations proved to do the job as best as we can. This produced also an interesting convergence between biology and computation. Because while people were busy doing the computational theories and biological modeling, in particular, the Hmax Scheme that [INAUDIBLE] and his group developed, they used the more directly biological way of modeling object recognition. But if you look at the kind of descriptors and the kind of way things combined in order to do classification, they are very similar to the kind of schemes that people were working with on computational algorithms. And many of them, without knowing much about the biology, figured they could use this way.
Maybe you've already heard the description of Hmax, but again, it starts as local orientation, it then lets them move a little bit, and produces a local descriptor that looks for an orientation that's in a local region, and these are combined in the same way that the HoG descriptor would be combined. So there is a really natural match between some of the most successful computational schemes being employed today and Hmax and similar models that try to explain and fuse the architecture of the human and the brain object recognition system. And maybe here, using the biology, you can also get some plausible estimates of how large the receptive fields should be and so on. So it's an interesting convergence between the two.
Let me show you quickly a sort of state of the art, where we are, what the rough level of performance is these days. And it's also interesting to mention the way the sociology of this field operates. Our yearly competition and, until recently, the main one was the so called, Pascal Challenge in which a central site puts out challenges like, let's see who is the best in the world in recognizing airplanes. And they give you a data set of airplanes. Some of them are kept secretly. You cannot use them, but you can train your system on the images you've be given. And then, at the end of the year, when the time is up to do the competition, you submit your algorithm and then people run the algorithm on the test set. And they produce curves and see whose curve is the best. And then they publish the results. I think they get a $100 prize for being successful, and they put it on their CV, and it's a nice thing to do.
And it produced huge boroughs of activity. Many people are submitting entries to this competition. You can see here, there are various curves. The highest, the better ones-- you can see that somewhere here are the winners. And you can see the range of images that you have to treat this way. So these are different images from the test set and you can see that they are very, very different one another. So the system that we have to learn, the system in the competition for the classifier, needs to be able to be this large variability in appearance.
These are also labels. I will not go through the labels. But by places, and here, and so on, you can see that there has been a nice progression. Initially, many of the systems were not that good, but over the years, we're getting better and better. So that's something interesting. And we discussed this yesterday, maybe something like this will be useful for other areas of computational, cognitive science. If you do it correctly, it generates some healthy competition. Yes?
AUDIENCE: Where's the human performance curve?
SHIMON ULLMAN: Sorry?
AUDIENCE: Where's the human performance curve?
SHIMON ULLMAN: The human performance would be close to here. On this curve. 100 % or perfect is this corner. They may miss one or two. Some of them are so difficult that you may miss them. So it will not be 100, but it would be 99 or something like that. So they will be better. And it's an interesting question, which I will mention a little bit at the end. If we continue with this competition, are we going to push this from here to here, to human level, or are we stuck here? There is an indication, maybe, of [INAUDIBLE]. And I think there is an [INAUDIBLE]. There's a real gap, which is not going to be solved by the current method, but it's an interesting and important part.
And in continuation of this, some classes are still not as good as others. So this, in the meantime, became somewhat better. But some classes are-- the gap between human performance and computer vision is high level. And it's interesting to inspect which one came closer to human performance and which one did not. You can see in the numbers what the average precision was. So for airplanes, it's 53, but some other things-- plants. Just looking and saying that this is a plant in the image was much lower. So there's still a way to go in terms of this.
Let me mention quickly as we describe this system-- there has been a lot of buzz recently, which is an interesting buzz on the use of so called Deep Learning and convolutional neural networks to do object classification. So in the last few couple of years, in fact, the winners of some of these competitions, like Pascal and similar competitions, were systems that come under the general lab of this Deep Learning and convolutional neural networks. How many of you have not heard the name, Deep Learning, in the general press and so on? Right. So that shows something. The general public hasn't heard the name HoG or SIFT descriptors, but they did hear, I think, the term, Deep Learning. It was all over the place in the general press and so on.
Also, it produced that Google bought Geoffrey Hinton and his kind of operation. And then Facebook bought [INAUDIBLE] from NYU and his operation. So it also spilled over from the more academic sphere to the big companies. And they're all trying now to solve some intelligent tasks that they're trying to do using Deep Learning. And I think that that's just, by itself, interesting. Even in discussion. Which part of this buzz [INAUDIBLE] will the problems still lie and so on.
But let me give you some examples of what it produces in the field of object recognition. The big breakthrough for Deep Learning in recognition and vision was very recent. In 2012, it won, by a large margin, one of the biggest competitions in vision classification. It was an interesting competition and that was properly a part of why it captured the attention.
In trying to do object classification, people moved from a small number of classes to increasingly larger and larger numbers of classes to recognize. So you're given an imagine and the question is not, is this a bike? Yes or no? But here is a list of possible categories, tell us which one in the list-- show us a location and tell us about the object at this location. Which object in the list of possible objects does it belong to? And it started with maybe 100 categories or so. Then for several years, the main competition was on the database from CalTech using 256 images. And then it jumped to the recent one, which was, if I remember the number correctly, 15,000. Is that the number in the collection called ImageNet?
It said ImageNet here. So there was an ImageNet competition and [INAUDIBLE] and other people at Stanford collected and, laboriously using the mechanical [INAUDIBLE], labelled images for 15,000 categories. And there were several thousand images in each category, so this is a database of many millions of annotated images. And again, you can do a competition by revealing half of the database to people who want to develop classifiers. They develop the classifiers and then the images are being tested on the other ones to see if the recognizable all the 15,000 categories. Maybe the competition was slightly less, maybe 10,000, but that's a rough number. It was a very large number of different classes. Some of them were very similar to each other and it's a big challenge. And then you do not these curves, which are relatively high up. It's very difficult to get good average results from such a complicated task.
So this was one of the tasks. The people competed and they started to take it more and more seriously. And then a family of algorithms started by, as I said, various people-- including [INAUDIBLE] was the one who I think coined the term, Deep Learning. These classifiers were composed and consisted of many layers. Not just two or three layers, but many layers in which you tried to automatically discover a whole family of features starting from very simple ones, and then to more complex ones, and more complex ones, until you get to the very top. And at the very top, one of these nodes corresponds to one of the categories you wanted to recognize.
And in fact, people know the literature from the '80s. Going back to the '80s, this is very similar to multi level, multi layer [INAUDIBLE] with the kinds of connections and networks that people used and developed during those days. It's very similar, conceptually. It's with bigger computers, larger computers, faster computers, and a few additional tweaks, but not very many. So basically, it is the revival of a method. The revival was based not on a conceptual breakthrough, as far as I can see, but on using the same methods, but bigger, and faster, and stronger. I will not go into a real description of how the system worked, but maybe we have some numbers with it-- I don't see it. I'll show you in a minute.
Typically, there's a huge number of parameters in this network. It's usually a repetition. Basically, it turns out that there is a kind of architecture for this by now through the deep networks. It's a repetition of three operations. You do some linear convolution and you insert the filters. And filters mean, for those who are not versed in the terms, it's just a linear combination of pixels in a region. So that's a linear operations. Then you perform some nonlinearity. You had, for example, a piece which represents an eye. You can hold the image with an eye piece and then you set a threshold to see if it's about a certain threshold. And then you do some pulling because you're in the neighborhood, and then you go up to the next layer. In the next layer, you, again, apply to the input of the previous layer. You do some linear information, and nonlinearity, pulling also some space, and you repeat.
So it's a huge network that does these repeated operations, but you have to decide on the nature or the weights of all the parameters. There can be millions and millions of parameters. You have to optimize them. You optimize them basically by gradient descent of some sort. You initialize them and you start to change them in a way that will improve the performance until you stop [INAUDIBLE] or when you run out of time.
In 2012, this deep network received 16 % error and 84 % average correct results. And the second runner up was 26 % error. These numbers were very impressive and people who are doing these competitions are looking at these numbers and are impressed by these numbers. And typically, the gradual progression was very small. 1 %, 1.5 %, 0.8 %. So suddenly adding the competition in which the gap between number one and number two was so large-- in the community, it made it a big splash. And people saw that as a very big advance. And in practical terms, it did produce results which were better than other ones.
I should mention that because the task was very difficult, one way of pushing the numbers up was instead of looking at the percent correct, they looked at the percent correct within the top five suggestions. So the network is allowed to produce not one answer, but five. And if one of them is great, that's considered OK. So when they say 84% correct, it means that [INAUDIBLE].
What I find interesting is also-- when you have this network of multiple leaf features that will learn simply by some gradient descent of a huge network, you can have a look at the different nodes within the network. And this is maybe an interesting general point, but some people say that when you produce these huge networks by huge optimization, you don't really understand what's going on. So in what sense did you produce science, even if the percent correct is impressive.
I think one advantage is that, unlike the brain, we can easily poke each one of the nodes and take away each one of the pieces of the network. And if it works well, you could investigate it more and more, and get an understanding of what's going on inside the system or the biological system. And you can look at different levels within this Deep Learning network, and look at the sort of artificial nodes within this network, and see what the receptive fields are, what they like, what they respond optimally to.
And when you look at the low level layers in this multilevel architecture-- it's not surprising, but it's interesting to look at what they respond to. They respond to sort of local features, edges, uniform regions, maybe some parts of different orientations. So the kinds of things we find in V1, maybe V2 of the primate. This is layer three of one of these networks. And each three by three sub matrix here shows the strongest stimuli that activated a particular neuron or node in this network. So this three by three, blueish corner here are the optimal stimuli for activating one of the nodes in layer three of this network.
And you can look and see-- it may be difficult from where you're standing. They are not exactly objects, but you have sort of a dark shaded object in the uniform area. It's sort of intermediate features are more complex than just lines and so on, but less than an object. This is layer five out of nine. And you can see that you get your flower-like stimuli and various other things. So there is a systematic progression in the complexity and also, in some sense, the semantic variables, they become more recognizable and more object-like.
I should mention, briefly, some of the big limitations. The one thing that didn't work out. The people tried hard, but what didn't work well in this kind of network, at least in my view, it didn't work well, and this is doing this in an unsupervised way. Initially, when the sub-field of Deep Learning emerged-- Hinton started it and worked very hard on it. It was supposed to be a model of the brain being able to learn about the world around us in an unsupervised way.
You build this deep network with all of these parameters you have to tune. And then you turn it to the world, and you start collecting images, and you hope that the high level nodes of the system, in some unsupervised way, will learn to discover repeating, interesting structures in the world. And in this sense, they will learn about objects around you without laboriously giving them millions and millions of training images which are labelled.
And people tried to do it in an unsupervised way. This is one example of a large scale attempt of building one these networks, and feeding it lots and lots of data, and seeing that just by some internal optimization, you will start to discover semantic categories. So it was an effort combined between Stanford University and Google. They did it in 2012. You can see here some of the numbers. Just to see what's involved in typical networks like this. It had 1 billion parameters, this network, or a billion parameters, you have to say. They trained it with images from 10 million YouTube videos. They use [INAUDIBLE], they trained 1,000 machines with a high total of 16,000 cores over three days running. So this is the training. And the question is, what will it learn after such a tremendous training?
It learned about faces. So you didn't tell it about faces and it turned out that one of the nodes in the network, or some of them, responded specifically to faces. So somehow the network learned about faces.
It also learned about two other things after all of this effort. Maybe the images are not doing justice. They are not great, but our real images taken from the system, it discovered also faces of cats, not only human beings, and it discovered the rough outline of the human body. So these are three things that they discovered well above chance. Nothing really more than that. Again, this was described in the open literature, in The New York Times, and another places. It was a big success. It told us that Google can learn about discovering cats and can learn about objects without being taught.
But if you look at the size of the effort and the fact that it discovered these structures-- And by the way, cats is not arbitrary, as I understand. Any guess why cats were discovered and not other things?
AUDIENCE: Cats videos are extremely popular.
SHIMON ULLMAN: Right. It turns out that the most popular object in YouTube is faces of people and the second one is cats. So basically, if something is really highly repeatable and it just appears over and over again, this was able to pick it up. But with all the subtle things that we pick up and infants pick up after a few exposures and so on, again, the gap is so huge that I would call it, basically, a failure. This network and these architectures, as we understand them now, cannot even begin to mimic or to explain how it discovered things in a unsupervised manner.
One interesting view is not in machines, but in brains and in cognition. So an interesting part is to relate various aspects of emerging concepts and theories from computational vision and compare them to the brain. Let me show you, very quickly, some examples of our work collaborating with fMRI and EEG, in this case. that is trying to relate some of the findings from computational vision to the brain. In this case, the question was whether higher level regions in the brain, object related areas in the brain, are also sensitive to local, highly informative pieces of images. And it was a challenging question because we still don't know, even after this study. We know only, in a very limited way, what the features are that drive higher level areas in the visual cortex.
So we know about V1. We know that you have to drive it to certain orientations and direction of motion. If you have to predict which visual feature would be very good at driving area LOC, for example, which is an object recognition are in the visual cortex, it's very difficult. If I give you two images and ask you, can you guess which feature will be better at activating LOC? This was not a question that people could answer. And the natural thing to try was to measure how informative different local regions are for the purpose of recognition and see whether this objective measure of information for recognition can predict the activation of higher level visual areas in the brain.
So here's an example of one experiment that involved different configurations and different variations. But here, you can see an example of a stimuli given to people within an MRI machine. And the contest was between more informative and less informative features. So in one array, and you have a collection of features that if you measure mathematically, they are more informative than the pictures here. And you're trying to correlate the response of different brain regions with the measure of information. And let me show you the kind of results you get. I'll maybe show you different brain areas in a different categories. So the experiment involved faces, horses, and car images. This is what's shown here.
And for each one of them, you can look at these regions, which are more or less informative, and then you can see-- you have, for example, a neck of a horse and the legs of the horse. And you want to say which one will activate LOC more. And you look at it and you really don't know. If you measure it, you'll find that, for example, the neck is 50 % more informative than the legs. And you can try to see it. And the bottom line is that there is a high correlation between the objective measure of information for categorization and the activity in high level visual areas. And importantly, you do not see it in low level visual areas.
So when you look at V1, these features which are being compared are identical. They are equivalent. So in terms of local edges, contrast, and energy in the image, they will counterbalance and you cannot distinguish. You can see here, for example, the horse. The more informative and the less informative, they're identical, but when you go to LOC, there is a relatively big and significant difference between the more informative and the less informative ones.
Similarly, I found this relatively impressive. This is an EEG study in which images were just randomly-- they took random pictures from the same categories, measured the amount of information for recognition, and divided them into 5 bins, from the most informative to the least informative. There are bins one through five. Five being the most informative. And then we look at-- yes?
AUDIENCE: I have a question about-- do you have any particular patch for your subject when you present your [INAUDIBLE]? In the MRI.
SHIMON ULLMAN: You can look. We tried both. Just passive looking was one. Recognizing was another one. And doing a one big identification, is it the same as the one you saw before? And we got the same results in all of these conditions. So in these conditions, in LOC, the graph will look very similar.
AUDIENCE: So I'm curious about-- I know that you defined how informative a feature is from a mathematical perspective, but I wonder if that corresponds to the intuition that we, as [INAUDIBLE]?
SHIMON ULLMAN: It could be.
AUDIENCE: [INAUDIBLE]. And if there's a mismatch, what if you re-analyze your data according to this perceived information that people report?
SHIMON ULLMAN: We can discuss this. We also asked people to categorize them and we got some intermediate results. They weren't that good. If you ask them, by the way, which one you think is more representative of a horse? You don't get very good results. But one of them is more informative than another, nevertheless. And when you try the machine, one is more. So I can show you and you give the reference. Many of these controls, at least what was tried-- there are more controls and you can look them up. These are good questions. I agree.
You can see here-- it's a similar study, but here it was divided into five bins. We looked at a particular wave in the EEG, which is known to be correlated with classification. And the size of the peak pointing downwards is the magnitude of the brain response. And you can see all of that very nicely. That the most informative is the strongest. This is bin number five, bin number four, three, two, and one. They really all did very nicely. There was very good correlation between the objective amount of information and brain response.
Let me skip over that. And I'll just mention that a half open question, but not completely open question is to produce not just a label of the object, but to produce a full interpretation of the object. So if you look at the face, when you are done, you want to be able to say that this is a face, but this picture belongs to the eyebrow, and this belongs to the nose, and this belongs to the upper lip, and so on. So you would like to look images, many faces, and eventually, have a result of an output which produces what is termed full interpretation rather than a label for the class.
This is, again, from our work and it's done in an unsupervised way except for labeling our class versus non-class objects or faces versus non-faces. So there were no labels of the eyebrow, lids, and so on. The basic idea is that in the same way that you can decompose an image into its useful components, its most informative components, you can take pictures like this and divide them into their own informative sub components. And you can repeat this until you basically get no gain of information by repeating this process. So there is a relatively simply mathematical procedure that will decompose the patch into its sub-patches based not on semantic labels, but on measures on how to do it in the most informative way.
In fact, it's the repetition of the same algorithm that decomposes objects into informative patches in local regions. The same algorithm can be used to decompose the images of patches into the [INAUDIBLE]. And then you get hierarchical representation that looks like this, which you think are more biologically plausible in the representation. The world in the brain is hierarchical in this way.
And when you do this, you gain somewhat in classification performance, but perhaps more importantly, you gain in the sense that you can produce much more and much richer descriptions than just labelling the objects. You can find the various parts and sub parts. And then the representation looks something like this. So this is how the representation of the class of object looks in this model and may look in the brain. It becomes very similar to the work that Tommy has done previously based on biological modelling. Again, it's part of the convergence that I described earlier, that you get this kind of hierarchical representation.
Mathematically-- the way that you describe it in computational terms, for those who are interested in the terms, these are probabilistic models in which are described-- you have a node here for the class and then you describe the probability of the children of this higher node given the presence of the parent. And then you will see that, hierarchically, you have to measure the condition of probabilities of say, node x5x1. I don't see it here at the bottom for some reason. The full mathematical description is this. You describe the probability of this particular configuration in terms of the class, the main parts, and the sub parts. This becomes the product of all [INAUDIBLE] the probability of the class times the probability of the children, given the class times the probability of each of the nodes, given [INAUDIBLE].
And the nice thing about this, and that's the only thing I will say about this kind of presentation, is that it's very efficient to do inference and find the full interpretation of an image given this [INAUDIBLE]. So the problem you are faced with when you try to find all the parts of an object, given a model like this, is if that you're given just an image and you're trying to find the best assignment of which parts appear in the image and at what location based on this probabilistic model. It turns out that you can get the globally optimal solution using just a single cycle of bottom up and top down [INAUDIBLE].
So you start with the image with feature detection in the image. You have to go up and find what the probability is of this guy, and what's the activation and probability of the higher, and then it's all the way to the top. You get a decision at the top. Following the decision of the top, you update the probabilities going down. And you can show, mathematically, that the single cycle for such a model is going to tend to give you the optimal assignment of the features and the locations based on features present in the image.
So unlike the initial recognition in which you only want to know if the object is there, yes or no, in the image-- if you want to also produce the most reliable interpretation of the full image, it turns out that you also need the top phase. And a single one, in this tree like architecture, is enough. You have to go up once, down once, and you're done. You can't improve it further by doing additional cycles and iterations in the structure. Yes?
AUDIENCE: So in the top down, you combined the probabilities that you got from the bottom up and then you combined with the top down?
SHIMON ULLMAN: Yes. Roughly.
AUDIENCE: And then how do you set this probability?
SHIMON ULLMAN: You have to learn this probability during training. So you have 100 or 200 images. And based on the images-- it becomes a bit technical. But basically, you have to find features in the image and you have to estimate the probabilities. And that's a part of the training. So the training is somewhat lengthy, but you can do it [INAUDIBLE] and find the probabilities. The actual recognition, including the full interpretation, is relatively fast because it's a single cycle. [INAUDIBLE] from 2008, it can give you all the technical details. You can look.
And it really does a good job because you can look at images like this. You see here that the [INAUDIBLE] parts are interesting. You look at this, you know where the eyes are or the mouth. But each one of these parts, when you look at it individually, it's sort of meaningless. You have to understand, and find, and detect the parts really within the configuration of the larger structure. It's impossible to do it, in many situations, based on the [INAUDIBLE] forward, limited information. The full interpretation does it correctly. So you can find the various parts of the face, even though they are, locally, very difficult and [INAUDIBLE] by the full cycle of going all the way up to the face and then propogating information down.
The bridge of the nose is labelled in each one of these images. This is one of these red rectangles. It's labelled correctly, it's hung correctly. There is no way that you can find the bridge of the nose by building a bridge of the nose detector, but with the full bottom up, top down, it works.
A final thing that I wanted to say about vision and then a few words about challenges, open things, and things that we cannot do. What humans can do in images and machines find still difficult is to do full segmentation. When we see an image, we know that this is a horse and that this is a horse. But it's not just the label. And it's even not just knowing where the head is and where the legs are. But we can really delineate-- we can draw a boundary around this object. We know which region in the image belongs to the object and which does not. This is called segmentation. It's a very tough problem.
And people treated it primarily by looking at image properties. This is sort of a bottom up approach in which you look at semi uniformed regions and aggregate pixels that belong to semi uniform regions into regions. Like in this horse, you will end up having a number of regions. For example, some are the whites one, some are the brown ones, and so on. So there are sophisticated ways of doing bottom up image segmentation where we divide the image up into a small number of regions and then you can try to do the cognition based on this region because at least one [INAUDIBLE] combined segmentation and recognition in someway.
And you can see some results here of the best segmentation algorithms. In particular, something based on an elegant mathematical called graph cuts, which was developed by [INAUDIBLE] and used widely now. Even in PowerPoint you can do an image segmentation and it's basically using a variation of this algorithm.
Now we know that segmentation also depends a lot, very often, on more semantic, top down information. How many of you have not seen this image before? This is a popular image. Oh, quite a few. For those of you who have not seen this image before, how many of you still cannot see what this image is? Please, don't be shy. Raise your hands. Almost all of those who haven't seen it before.
This is an image of a cow's head. And we can see here an image of the cow head. And these are two line drawings produced by people who were asked, just draw what you see. And some people produced something like this and some people produced something like this. Obviously, these are the people who saw the head, these are the people who did not see the head.
Just to help you here is sort of top down information in vision, let me start with this point here, the top of the ear. It's here in the image. The ear is like this. Then the snout is going down right here. It's going up here. And here's the top. Just roughly. How many people can now see the cow that did see it before? This was not a great sketch.
So you can look at it. So this was a difficult image, but as you can see, it's sort of a good way of showing the [INAUDIBLE]. You really know that there is no way that you can segment this image correctly into an object and background without knowing about cows. And to this extent, this is particularly difficult. But the fact that very often you have to combine it with object knowledge is true, in general.
And here is a very simple way of doing bottom up. Putting top-down, semantic information to complement the processes of image segmentation. I said that you can do recognition by having object components, object parts, which you look for when you try to recognize the image. So if you recognize the image, you basically tie the image with these components that you store in your memory and look for when you do object recognition.
So imagine for a minute that for these components in your internal representation, you already know, somehow, the segmentation of those. You know which pixels belong to the objects and its sub parts, and which pixels belong to the background. If you know this, then once you tie the objects with the objects components stored in your library of object representation, then the union of all the object pixels define for you what the full object is in the image you're looking at. So you can induce the full top down segmentation based on the limit of segmentation that you had for the stored image pieces in your memory.
How do you get the segmentation of the images you store in your memory during learning? It turns out that can build a library based simply on the variability within and outside the object region. If you look, for example, at the piece of a horse, you have a segment like this, a patch like this, and you look for it in many, many horse images. The actual horse part would be more repeatable. The background will be very variable from image to image.
So you can identify the more repeating part, this is typically the object, and the more variable part, sub components like the background. So in this way, you can do learning for the things you see [INAUDIBLE]. You train with them and you can produce a reliable segmentation of these components. And using these segmented components, you can now approach a new image, do the recognition, and together with the recognition and the segmentation, it becomes a mixed process. You can identify local features that give you the correct answer. You can see it in an image, input image. You do the classification based on these pictures. The same pictures are used for segmentation and you can get a segmentation.
So let me not show this slide and I'll go to the last part until the break. And let's take a peek at some future challenges. What we cannot do, what I think is cognitively very important, and what are really good open questions for the near future. I said that the focus so far was to take images and produce class names-- what are the objects in the image. But clearly, in vision, we do much, much more than. We really get very rich, conceptual information. And let me show you some example. This is one.
What is common to all of these images? What would you say is [INAUDIBLE]?
AUDIENCE: Drinking.
SHIMON ULLMAN: Yes. That's good. Say that slightly more loudly. What was the answer?
AUDIENCE: Drinking.
SHIMON ULLMAN: Drinking. Right. All of these are drinking people, but you can see that the variability is so large that it goes beyond what happens in object recognition. In object recognition, as I said at the beginning, you can expect repeating configurations of repeating elements. So in a car, you have the two wheels, and you have the roof, and so on. And these kind of features repeat themselves in the same places, roughly, again and again.
When you think about drinking, for example, it's no longer like this. It's really much more variable. With a little time, you can train an image and classify it with drinking people versus non-drinking people. And you don't get anything reasonable out of it. You can have men, and women, and children drinking with the right hand, with the left hand, with both hands, with a cup, with a bottle. It's just endless. You really have to understand that drinking has to do with something more conceptual. You can see, in non-drinking, in the rough configuration, it's very similar to what happens in drinking.
You can see here sort of non-standard drinking. And we know that all of these are drinking, that are at least [INAUDIBLE]. So we know that all of these are drinking as well. We know that drinking has to do with something like bringing liquid to the mouth and you have to be able to analyze this. And it's not going to be just by seeing parts of some pictures in the right location. So understanding the activity of people in images is not just more difficult and more variable, my own feeling is that it's a more complicated problem in a deep conceptual way and it cannot be solved by just tweaking classifiers.
Another important example, which I think started to deserve more and more attention and should deserve more, is understanding how we understand and how we perceive interactions between people in the world. So it's related actions. You can say that arguing or hugging is an action, but it's a special kind of an action. It's not a relation between an agent and an inanimate objects, it's a relationship between agents. And people have done some psychophysics on things like this. And it turns out that people in roughly the same time that they can say if they're looking at an airplane or a car, they can say if people were hugging or disagreeing and so on.
And in this case, again, it will be very difficult to come up with a standard configuration of what you expect and which visual feature you expect where when you have people disagreeing or quarrelling. So we don't know what the visual features are we're very interested in this center sponsoring this meeting in the school here. We are very interested in the computational aspects and also there is a big effort doing brain studies and fMRI studies on agents interactions. What happens in the brain when we see people interacting and different parts of interaction? Where is it in the brain?
We can go even farther when we look at agent interactions. All of these are on the left, so we can say that these are hugging and these people are quarreling. But even if we look at hugging, what would say is the difference between the top row and the bottom? Any description of what's the difference between these two things? They're all hugging. They're all instances of hugging. You are laughing, what? What's the difference?
AUDIENCE: The top row is more of a formal--
SHIMON ULLMAN: Sure. It's easy to see. And somebody told me it's a politician's hug. But it's called here as not really intimate. So we can say things not only about the type of interacting, hugging as opposed to quarreling, but about the tone. Is it warm? Is it close? Is it just cold here and formal? And the interesting thing is that the tone often comes up to you, as a person, as fast, and sometimes even faster, than the exact type of interaction. So we know if two people are sort of in a close, positive, intimate interaction or not. People do it in the flesh, under very limited [INAUDIBLE] conditions.
So thinking about the visual features for a positive interaction becomes even more difficult to conceive what they might be. So I think that developing algorithms and computational schemes that will be able to deal with agent interactions, including both the type and the tone of the interaction, would be very interesting. In the study that we started, we looked at the number of these interactions. In addition to taking images from the web, we also worked with professional players and asked them to play a certain episode. For example, one person interviewing person for a job or something like this. So one of them was assigned to be the more dominant figure and the other, just by the nature of the task that we assigned to them, was the more subordinate one.
It's pretty easy to get in short videos, but you can also get many of them in single static images, I think. If you had to say, of this image on the left, who is the more dominant person, who is the less one, which one would you say is the more dominant? The left or the right?
AUDIENCE: Left.
SHIMON ULLMAN: How many think left? Anybody think right? No. It's true. I mean, true-- we have the ground rules in terms of we told the people that you are the interviewer. So in some of them it's more difficult and from here it may be more difficult. For the interest of time, I will not show you the video. It becomes more apparent in the video. Even in a short video, it's clear, from the motion, which one is the more dominant person. It would be interesting to know-- we know that standard classifiers would not work. If you give it images and then you tell it, here is the dominant person, label it, and try new images. It's not working good. It's not clear what exactly the features are.
And it's known that this comes very easy for people. It also comes out very early. You would be, perhaps, surprised to know that inference can give you an estimation of who is the more important person, who is the more dominant person at the age of nine months or so. You already make some interesting, reliable decisions of who is the dominant person.
AUDIENCE: You're driving us crazy. We want to see the video.
SHIMON ULLMAN: For some reason, it's difficult to-- let's see. What will you say now?
[LAUGHTER]
Any guess? This time, it [INAUDIBLE]. The person on the left, who is very different from the left person, is now more crouched and submissive.
I'll skip a little bit. The next thing I wanted to mention is that I think that if you want to understand human vision, we have to work on task guided and task dependent analysis of images. Because unlike schemes now-- they take an image, and they bottom up, they start to work on the pixels, and produce, basically, labelling of the objects in the image. But we can look at images as a question and then answer sort of a Turing test. You can answer arbitrary questions posed about this image.
For example, these are people looking. This is a person lecture or giving a talk to a group. And I can ask you, is everybody looking at the speaker? What would you say? Is everybody looking at the speaker? This is a view from the speaker's podium, looking at the audience in front of them. And is everybody looking at the speaker?
AUDIENCE: No.
SHIMON ULLMAN: No. For example, this person is not. And I could ask you if anyone has eyeglasses or how many males are in the image. So given a single image, I can get you an unbounded set of different questions that could be of interest in any particular situation. Once the question is posed, you can answer it by looking at the image. And this is very different from what we can do today. And the question of how you can guide the entire process of getting information out of the image, so that it will get the right information and all of the other things, I think it's something very deep. In the same sense that we look at the Turing test as a good test for intelligence of what happens in humans' brains when they have to understand things.
I think that if you understand how we can shape and guide the extraction of visual information based on the task at hand, this will be very important. And clearly, we have to immediately decide on an algorithm. If you want to know if everybody is looking at the speaker, we have to do something like find a face, or if you saw a face, find out the next face, and test whether this is true, whether the face is looking at the speaker. If the answer is no, you can stop. Not everybody's looking at the speaker. If the answer is yes, then you have to find the next face and you have to make sure that it's not the face you have seen. And if all the faces are exhausted, you have to say yes.
Once I give you the task, you will have to go in your minds for something like this. You need to compile a certain, sequential routine of what is necessary to achieve the goal of this particular one. Then you need to be able to map it to the visual system, each one of these is now written as a command, but you have to do some visual processing in order to do it. So you have a query, and then you have to apply some visual routine, visual sequence of operation, that will produce the answer. This is one of the things that the group of people here, part of the center, is very interested in. Between language and vision.
So a query will be posed in English, the natural language, will be understood, will be mapped into a set of visual operations that together can give you the required information out of the image. So this is language. And then there is some kind of cognition of mapping it into the right kind of process that will give you the information. And then you apply real vision to it, look for the right features and so on, and then say, yes or no, who's looking at [INAUDIBLE].
And I wanted to mention, briefly, that although this is not usually done in many of the simple object recognitions that we know, that even in simple tasks, very often, our extraction of information from the image is shaped, in an unconscious way, in some kind of a visual routine that applies a sequence of operation and gets an answer. For example, this is from psychophysics we did many years ago, in which people look at it and the task is not object recognition-- there are always some lines and some dots. And you have to say if the two red dots are on the same or not. So the answer is yes and the answer is no.
And when you ask people, how did you do it? They say, well, I looked at it and it's obvious. Here it's yes, here it's no. But if you do it, it turns out that the response time increases linearly. Let's take just the yes answers. It increases linearly with the lengths of the curve between the two red dots. And this was later on tested also physiologically. This is some kind of physiological work, which I will not go into. We showed that receptive fields along the trajectory, along the curve, are activated sequentially. So there's good evidence, both psychophysically and physiologically, that the way we answer this question is that we apply a routine. Find the first red dot, maybe it's this one or this one, then trace along the curve. And if it's a long curve, it will take longer. And if you hit the second dot fine, they're on the same on curve. Otherwise, they are not.
And you will be hard pressed to find a page of visual feature that just tell you that. As I mentioned here, it would be interesting if people would try-- I think that this is something that would really be worth trying. Try to train a deep network to distinguish between things like that. This is just one example, but something like this that people can do very quickly. And it will be important for extracting different types of information from the image.
The last thing that I will just a say a few words about-- and that's also by the way of introduction to the talk by Dan [INAUDIBLE] later on. A very interesting question for human cognition and intelligence is not just to produce a system that can do a particular task, but how does it all start? Does it all start in humans as babies? You have unsupervised learning from scratch. You are a baby. You've got a baby system, a baby intelligent computer that you tried to build. You are looking at the world, and you see pixels, you see images, and images change over time.
And from the streaming pixels around you, you want to build a complete understanding of the world. And that's what we'd like to do computationally, to build a system in we build in some capabilities, we show it months of videos, and after these months, your system has insights on the presentations of the world around it. Different objects, and agents, and goals, and so on. So in the same way people do it. So that's, of course, a long term goal. And when I said understand, it's not just labelling images. For example, here is a difficult action. The action or pouring liquids. So you take something and pour the liquid out of it. So if you think of standard computer vision, then it's a problem of classification. You see examples of pouring, and you have a new imagine, and you have to say whether there is somebody pouring liquids or not.
But if you think about the developmental trajectory and how we get to learn about the world-- when we see people pouring liquid, we don't just label new instances of pouring, but we learn about containment, and about liquid, and about gravity, and about flowing. All of these concepts become clearer and clearer to us. And we'd like to be able to learn them from vision.
Of course, with babies, they learn not only from vision. They learn from other sensors and from manipulation, but clearly vision plays an important part of it. People who are interested in the visual part of it would like to know how vision can generate this kind of understanding by watching and then producing, as I said, not only labels, but complete understanding of physical entities like this, the properties and interactions.
So let me skip this one. I think that object recognition is really just a small part of it. We want to deal with rich, conceptual domains with actions, with goals, with social interactions. We want to be able not just to always process an image in the same, fixed way, but to be able to have tasks. And then, whenever a task posed, to be able to fulfill this task, or to answer a query about the image, and shape the visual process in the way that will produce the right answer. And we want to use vision in order to build conceptual knowledge in new domains, like the domain of liquids and so on.
So we imagine a system in which we make babies. It applies equally to both cognition and intelligence systems. We want, in the system, some innate capacities that make it all possible. And then we want to watch a lot of sensory information. And then, as a result, produce representation and understanding of, as I said, the goals, and tools, and social interactions that can be derived from the images. Just let me say this. This is, of course, a central issue in cognition. And I think it's a very interesting direction for building intelligence systems.
[INAUDIBLE] proposed already-- during his paper, it turns out that he actually suggests that maybe the way to build an intelligence system is to build a baby system, and to understand learning, and then combine the two rather than directly produce the mature intelligence system, and then follow human cognitive development, and produce an intelligence system this way. This is a good way to stop. So thank you. And you had a question?
AUDIENCE: Yes. I was just wondering if you have a particular proposal for the compiling routines for answering particular queries?
SHIMON ULLMAN: It's a big question, so I cannot can give you-- I'll say two sentences and we can take it offline, if you are interested in more. I think our repeating motifs and repeating patterns, if you look at them in the right way-- for example, is everybody looking at the speaker? If you try write a logical formula for this-- so you have a relation that everybody's looking at this. And you ask if there is a unique object A, such that all the other ones, say from X1 to X-- have a particular relation to this.
So the formula will be able to exist in A, which is this amount, such that for all X, X are faces, the relation looks at-- between X and A [INAUDIBLE]. Now if you ask a completely different question, which one is the tallest mark on this table? I can see that it's this one here. The existence of this A, which is this mark, which all of the other ones on the table fulfill the relation and total them. So I think that there is a library of patterns, which are more abstract than looking at and so on, which invoke a very similar kind of routine. In this case, go one by one, check for relation, you can stop if there is a violation. We are thinking in terms of having a library of these relatively abstract type of tasks and ways of mapping particular queries and saying, ah, this has a natural relation to one of the pre-compiled, abstract routines.
We think of this as the beginning and then maybe, later on, you can put together two of those to create something new. But at the moment, we think about the library of abstract ones, ways of mapping concrete ones to abstract ones, and then evoking and mapping the more abstract one into the one that you have to be [INAUDIBLE]. But does this makes sense?
AUDIENCE: Yes.
AUDIENCE: Some of your examples didn't seem like something a dog good to. The visual routines. Do you think the visual routines are uniquely human or are there other kinds of routines that--
SHIMON ULLMAN: It's a good question and I don't know about the whole tree of species and so on, but here is an interesting anecdote. I was talking to Herrnstein and his group [INAUDIBLE]. Herrnsteing was a very famous pigeon person. And he worked a lot about the cognition of pigeons. And pigeons can recognize faces, and they can recognize different types of fish in images, and can do-- Herrnstein's motto was that pigeons are 95 % human. He said, whatever humans can do, I pigeon will not do as well. But if you train them with food and so on, in six months, they will do it. Eventually.
So I asked him about inside, outside. An adult is either inside the closed curve or outside, which I thought would require [INAUDIBLE]. And he gave up. He trained a pigeon for four months and repeated a task that pigeons cannot do.
So I think pigeons can do fragment based recognition, but they cannot do routines. I'm sure that people have shown it in monkeys. Even physiologically, the [INAUDIBLE] primates, of course, can do routines. Fairly complex ones. What animal can do-- I think we do have a distinction. Certain animals can do very little to not at all. Although, they have fairly impressive visual cognition in some sense. So I think it's a distinguishing property between some species. It's more advanced [INAUDIBLE] an animal to these impressive visual capabilities.
Associated Research Thrust: