Successes and challenges in modern artificial intelligence
Date Posted:
March 22, 2019
Date Recorded:
March 19, 2019
CBMM Speaker(s):
Amnon Shashua All Captioned Videos CBMM Special Seminars
Description:
Dr. Amnon Shashua, President and CEO Mobileye, an Intel company; Senior Vice President, Intel Corporation; Sachs Professor of Computer Science | Hebrew University
This talk is co-hosted by the Center for Brains, Minds, and Machines (CBMM) and MIT Quest for Intelligence.
Speaker Biography: Professor Amnon Shashua is senior vice president at Intel Corporation and president and chief executive officer of Mobileye, an Intel company. He leads Intel’s global organization focused on advanced driving assist systems (ADAS), highly autonomous and fully autonomous driving solutions and programs.
Professor Shashua holds the Sachs Chair in computer science at the Hebrew University of Jerusalem. Prof. Shashua’s field of expertise is computer vision and machine learning. Prof. Shashua received the MARR Prize Honorable Mention in 2001, the Kaye Innovation Award in 2004, and the Landau Award in Exact Sciences in 2005. Since 2010 Prof. Shashua is the co-founder, Chief Technology Officer and Chairman of OrCam, an Israeli company that recently launched an assistive product for the visually impaired based on advanced computerized visual interpretation capabilities.
ANTONIO TORRALBA: Well hello, everyone. Welcome to this event. This event is hosted by the Center for Brains, Minds, and Machines and the Quest for Intelligence. I'm Antonio Torralba, the director of the Quest for Intelligence, and it's my pleasure to introduce you to the speaker today, Amnon Shashua.
While most of us are struggling to have a career, he managed to have two amazing careers, both in academia and industry. So in academia, he was actually a student here at MIT. He did his PhD with [INAUDIBLE] but it was mostly [INAUDIBLE] by [INAUDIBLE]. And then he stayed for a postdoc with [INAUDIBLE]. And then he became professor at the Hebrew University. He has done lots of contributions in computer vision and he has a ton of awards, like I'm not going to say any of those because it's just way too long, the list.
In industry, has also an amazing career. He founded several companies, one of them, Mobily, which is a great company that implemented computer vision techniques in order to solve the task of detecting pedestrians and vehicles in applications for autonomous driving. He is also the senior vice president of Intel, so as you can see, you know, very successful careers in both domains, industry and academia. So without saying more because I think we really want to hear him and not myself, so let's welcome Amnon Shashua to the podium. Thank you.
[APPLAUSE]
AMNON SHASHUA: OK, hello, everyone. It says, [INAUDIBLE] driverless cars good or bad? I'm afraid most of the answers to the questions here is yes, not no. But you know, it is what it is. But I'm sure it's less sinister than this thing that we have here.
So I thought to build this presentation-- the bulk of the presentation is going to be on autonomous driving. But first, start with a bit of science, 10 minutes. Let's move to autonomous driving, the bulk of the presentation, and then I'll go to another area of AI called AI as a Companion. If AI was on us, what we can do with it. And if we'll have the remaining time, I'll talk a bit about NLU, Natural Language Understanding, which I believe is the next frontier to kind of capture broader intelligence.
So in terms of the first part, the scientific part, I have a very neat way to kind of show in a pictorial manner what are the mysteries of deep learning. You know, deep learning is very, very successful in industry, and engineering and technology are really leading the way in applications, but science has a lot to say as well. And we'll have to say a lot in the future, but even in the present, there are some mysteries of deep learning that are still mystifying everyone, and I'll show it in kind of a pictorial manner. The first slide has equations, but then the rest of the presentation, there are no equations.
And then I'll go into a few slides about how deep learning is impacting other areas of science, specifically condensed matter physics, quantum mechanics. It's really a very exciting and surprising connection. Then autonomous driving, wearable computing. So let's start with the mysteries of deep learning from a scientific point of view.
So I'll first set up-- the problem of machine learning can be-- the setup is not more than one slide. If we have an input space, input space could be images. So let's say a 100-by-100-pixel image. So it'll be 10,000 dimension input space. Then y would be the output space, say labels. Labels could be whether the image contains a cat or doesn't contain a cat. So plus or minus 1. Or labels could be, you know, 1,000 different labels. What kind of category the image contains, whether it's a whale or a cat or dog, car, pedestrian, and so forth.
Then there is a distribution from which the examples are coming from-- the input-output examples are coming from. Normally, this distribution is unknown. And then we have a loss function. So we have the true label and we have the estimated label and there is a kind of a penalty if the true label deviates from the estimated label. And what I have here is the simple square root [? l2 ?] norm. The difference of the true and the estimated label, take the square of it.
So what is the task? We are given a training sample of pairs of input-output, say m pairs of input-output, and they are drawn from this unknown distribution. And the learner is supposed to predict a hypothesis. It's supposed to predict a function that maps input to output. So given an image, would say whether there is a cat in the image, yes or not.
And we want to minimize the penalty. So how would we minimize the penalty? We'll take the expectation of the penalty over all the examples that we could see, drawn from this unknown distribution. So this is the task. The problem is that this distribution is unknown.
So the closest thing that we can do, we can minimize the penalty over the training set. So we take the expectation or the sum over all the penalties and we divide it by the number of examples. And this is what we want to minimize and this is called a training loss. All right? So this is the setup of machine learning. It's quite a simple setup.
So we have here four different entities. First entity is what is the ground truth. So this is the space of all possible functions, input to output. What we normally do in machine learning, we tie our hands and look for a subspace of all possible solutions. We call this the hypothesis space. It is this one here. And then there are reasons for doing that. That's for generalization.
So this is the space from which we want to find a function that will optimize the penalty. So the optimal solution in this space is this point here. The difference between these is called approximation error. That means, the more I tie my hands, the bigger this approximation error is going to be. But since we are optimizing only on the training examples, not on the true unknown distribution, I may end up here. The difference between these two is called estimation error. When the estimation error is big, it means that I over-fit it. It means that I cannot generalize the data.
Now if the optimization function is non-convex, I may not even get here. I'll get to some place here, some local minima, and the difference between these two is the training error. So these are the three errors that we have in machine learning. And then you can see something quite intuitive here. If I increase my hypothesis space, I tie my hands less and less, the approximation error would reduce-- this pictorial setting here-- but the estimation error would grow. That means I'll be able to less generalize. This is why we are tying our hands. If we tie our hands stronger, than we'll generalize better but we'll have a higher approximation error. So these are the three entities.
Now the classical machine learning. What we know about classical machine learning. So in the '90s, the support vector machine of the '90s is that the hypothesis space was all space of linear functions. And there was a trick called the kernel machines or the kernel functions that allowed us to get nonlinear functions but still in this linear setting. All right?
Now, so in the classical machine learning, what we had, we had the following. First is that the function that we want to optimize was convex. Therefore, there was no training error. We would reach the global minima of the function that we want to optimize. Then, two things happened which are very, very intuitive. If we expand the hypothesis space, we reduce the approximation error. This gets smaller. And we increase the estimation error. We less generalize. And opposite, if we shrink the hypothesis space, we generalize more because the estimation error reduces, and the approximation error gets bigger because we tied our hands stronger. So this was the setting. This is the classical machine learning. Everything is well-understood.
Comes now, deep learning. With deep learning, you have these networks-- fully-connected networks, the recurrent networks, convolutional networks, and since 2015 or so, there is a zoo of different typologies, inception networks, resonates, and so forth. And they're all very, very successful.
What happens with these networks? What happened with these networks is the following mystery, is that we use the stochastic gradient descent, the node gradient descent, and even though the function is non-convex, we reach good local minima. Always, we reach good local minima. Second what happens is that when we expand the hypothesis space-- the hypothesis space are the weights of the network. We build a topology of the network and the parameters of this network are the possible solutions. Setting the parameters of the network, the weights of the edges of the network, is our hypothesis space. So if we build bigger and bigger networks, we are extending the hypothesis space.
Now in classical machine learning what would happen, you make the hypothesis space bigger and bigger and bigger, you over-fit. You generalize less and less. So what happens-- and this is still a mystery-- that when we extend hypothesis space, we reduce the approximation error, as you can see here pictorially, but we also reduce the estimation error. We don't over-fit. And you know, by and large, this is still a mystery, because even though we have this tool that works very, very well, there are fundamental mysteries that even today-- I saw a poster in [INAUDIBLE] lab that is dealing exactly with this problem.
So the next thing is how deep learning is impacting other branches of science. Three more slides and we'll get to autonomous driving. So this is work that we have done in finding a connection between deep networks and quantum physics. So in quantum physics-- in machine learning pictures, you have pixels that can be correlated. So two neighboring pixels are usually correlated. In physics, when you have particles, particles are correlated. It means that they're entangled. So physics calls correlation among particles, they call it entanglement.
And there is a connection between them, which I'll say in a moment. What physicists are-- they want to simulate systems with many, many particles, and normally, the systems of interest-- the natural phenomena that are interesting-- they have a high entanglement. Means the particles behave together. And it's not a cascade. It really behaves together. In phase transitions, they behave together. It's called high entanglement. Many, many particles behave together.
And physicists want to simulate systems with many, many particles. Why is that? Because if you take a small system, say 10-by-10 particles, the edge effects-- the boundary conditions-- dominate all the observations. So you need to build systems. And these systems, the systems with many, many particles, the size of the dimension is exponential and simulating it is tricky.
So comes the following. In machine learning, we have these deep networks and we have the RNNs. In physics, people use what they call tensor networks. It's a completely different structure. They represent this high-dimensional space of interacting particles in something which is polynomially-sized, which approximates the true system through a graph, where nodes in the graph represents tensors of lower dimension.
Well what we show here is that we have a one-to-one correspondence between these two structures, and the one-to-one correspondence shows that we can use tools from machine learning, from deep networks, and create systems in quantum physics that are polynomially more expressive than anything else people have been using. And it was recently published in Physical Reviews Letters.
And I am saying that this is interesting because it shows this interdisciplinary effect that is happening in modern times. You know, I don't think that 10 years ago, computer scientists would publish a paper in one of the premier journals of physics. It's not because of us, it's because of the field. The field is becoming very, very interdisciplinary, and anyone who is studying natural sciences should study more than just one topic because they all influence each other. That was the purpose of all of this. OK.
So let's go to autonomous driving. And the point I want to make on autonomous driving is that this is really one of the best sandboxes for AI. It really encompasses-- it's a microcosmos of all the interesting problems you have in AI, problems that people deal with in other domains, and problems that people decide not to deal with at all, like ethics of AI. You know, people talk about the ethics of AI, but nobody really wants to deal with it. It's a hot topic but nobody wants to deal with it. But in autonomous driving, you must deal with it, and I'll show you how and why and how we deal with it.
So why it's a microcosmos of Artificial Intelligence. First, you have perception. Perception is one of the major pillars of AI. You know, we have our eyes and our brain. Now a significant chunk of our cortex is engaged with visual processing. With understanding of our visual world.
And in an autonomous car, you have cameras. You have other sensors, like radars, like laser scanners that are called LiDARS. All of this raw data goes into high-performance computing with the purpose of understanding our environment. Understanding all the road agents, like cars and pedestrians and cyclists and motorcyclists and trucks. Static data, like drivable paths, traffic signs, traffic lights, where is the free space. So really understanding the surrounding in great details and it has to be very, very accurate. And the reason for accuracy, I'll mention in a moment.
The next thing, which is also a pillar of AI, is decision making. Once the car understood its environment, it needs to make decisions. It needs to decide whether to change lane, which vehicle to give way, how to plan a trajectory that will be safe and comfort, and how to negotiate with other road users. With other drivers.
And you Bostonians, you know, here, you really negotiate, right? When you drive in congested traffic, it's a struggle. It's not something that you breeze through it. And you need a machine to do it as well and this is part of AI. It's reinforcement learning. The machine needs to learn how to do these things.
And this is where ethics will come in because we need to guarantee something, right? We can't put a machine that can kill people without guaranteeing that it will not kill people, right? So there is ethics involved. How do you make these guarantees in a way that makes sense?
So let's look at the challenges. When you talk about perception, we need to reach, and very likely surpass, the human-level accuracy. I'm talking about not matching human-level perception-- that's too difficult-- but matching the end result, which is the probability of making a mistake. So if you have a perceptual mistake that leads to a fatality, that leads to an accident, we will need, as technology suppliers, to prove that that probability is below the probability of humans making mistakes that lead to fatality.
And why is that? Because the burden of proof must be higher than human driver. Why is that? Because society has no choice. It has to allow humans to drive because we need to allow people to move from place to place and this is how the economy flourishes. Otherwise, we wouldn't have the standard of living that we have today. So human driving is given. This is why we tolerate mediocre drivers that gradually improve over time. You know, young kids, 16, 17 are mediocre drivers and they will improve over time. We tolerate that.
But society does not need to tolerate machines that will drive. As a society, we come and say, we don't want machines to drive autonomously. So the burden of proof must be higher. So we can't have machines that will make perceptual mistakes that will lead to an accident at a rate higher than humans. Maybe as technologists, we'll say it's fine, but society, unlikely to accept it.
The next thing is the driving policy. The decision making, right? We can have a lapse of judgment making a decision that will lead to an accident. This is something that simply cannot happen. So this should never happen, period. This has to come with a guarantee that this machine will never make a decision that will cause an accident. And this is very, very tricky, as I will mention in a few more slides. How to frame it in a formal sense. OK? So these are two major challenges in making a car drive autonomously.
So just to remind everyone how it looks like driving, so I'm going to show you of a drone above my way to work. And it's very, very similar to Boston, so you will relate to it quite easily. One moment, it's not working. Yeah. So you'll see this car squeezing its way, right?
The next example, this car would not succeed in changing lane. We are going to fast-forward the clip and focus on this car, and the poor guy will not be able to-- nobody lets him. So it's tough, right?
Now the next one, it's a long truck. It will take minutes until the guy changes lane to the left. And see we're fast-forwarding and it's negotiating its way and negotiating. This is in Jerusalem, but it's similar to Boston. You know, Israeli driving is similar to France, Italy, Boston. It's not for the weak-hearted. OK.
So let's start with a driving policy. Let's assume that our perception is perfect. We see our environment, we understand the environment, we detect everything and all the details. So perception is perfect and let's focus on the driving policy challenge. What about decision making?
So here are two questions. Do we allow an accident due to a lapse of judgment? You know, with humans, there is lapse of judgment, but as I said, with humans, we don't have a choice. With machines, we do have a choice. So I believe, no, we cannot allow lapse of judgment.
Can we frame it in a statistical sense? So say there is a lapse of judgment but it happens infinitesimally small in terms of the probability. Well if you do this back-of-the-envelope statistical analysis, you would see that the amount of hours-- that the size of the validation set that you would need would be unreasonable. I'll do this back-of-the-envelope calculation when I come to perception and then we will see how it reflects on-- so the answer is, no, it doesn't make sense to look at it statistically.
So this is where ethics come in. So we said that the safety of this technology is outside of machine learning. So it's like the three rules of robotics of Asimov. Most of the people here are young so I'm not sure that you know who Asimov is, but there was a movie, I think, 10 years ago. So the idea is to set rules. So you can say, OK, well what's the big issue? So set rules.
It's a bit tricky. Why is it a bit tricky? Because when you look at the rules of driving, everything is clear until you reach the point of you need to be careful. Now what does it mean to be careful? We're talking about machines. You have to define everything mathematically, right? What does it mean to be careful?
So it's based on societal norms. Careful in Boston is not careful in California. It's societal norms. It's also based on legal precedences. That means there was an accident, someone was blamed. That person that was blamed was not careful. So now there's a precedent of what careful means. So now what are we going to do? How do we go and define careful? So this is really the crux of the issue. How do we define what it means to be careful?
So now we can come up with all-- so we can put lawyers together and philosophers, say, and come up with definitions of careful. Why do you need mathematicians here? You need mathematicians because you really need to optimize three axes. The first axis, let's call it soundness. Whatever definition of carefulness you make should comply with the human judgment of what careful is. If the model says, this is a non-dangerous situation, human judgment should comply with it and say this is a non-dangerous situation and vise versa. If the model says, this is a dangerous situation, that means if you are in this situation, you are not careful, then human judgment should comply. So we need to somehow capture the common sense of human judgment of driving when we define what careful is. But still, lawyers and philosophers can do that.
Second is we need to be useful, right? Because we can be very, very careful and be useless. I'll give you an example of a useless definition of carefulness. Let's assume that I define carefulness by when I change lane, I should not interfere with the motion of any other road user. So they should continue their motion. That sounds like a perfect safety definition. I change lane, nobody is affected because they can continue their own motion.
Now obviously this is useless, right? Because in congested traffic, when I change lane, somebody needs to slow down. So I am affecting the motion of others. So it's not enough just to make a definition of carefulness. It has to be useful. It has to allow for agility, just like the clip that I've shown before. We need to negotiate and this negotiation should be assertive and agile. And this conflicts with the notion of safety. So this is a conundrum. So we need to optimize usefulness.
The third one, and this is where mathematics come in, when the machine decides that an action is safe, to prove that the action is safe, if you need to roll out the entire future to see chain, action, chain, action, reaction, action, reaction that will lead or not lead to an accident-- if you need to roll out the future, this is not reasonable. It's unfeasible. So basically, you need to build a model that has the inductive principle embedded into it.
That means, if this step is safe and I continue making the same type of decisions, then I'll guarantee that I'll never make an accident. And not every model complies with inductive principle. I'll give you an example. Let's assume that I define carefulness as follows. If I'm tailgated, I need to be careful, so the definition of careful would be accelerate a bit. You know, increase the gap.
Actually, we humans do it. If somebody tailgates me for a while, I accelerate [INAUDIBLE] So one can prove that if this is part of your definition of carefulness, you are going to violate the induction principle. You'll need to roll out the future in any decision you want to make. So you need to build a model that optimizes these three axes. It has to be sound, it has to be useful, and you have to have the inductive principle. OK?
So this is what we set to do. So a year and a half ago, we published this, and then the reason that we published it, we told ourselves that if we keep this closed kind of as an IP or a competitive advantage, it will not help us because we need to engage with regulatory bodies. Because the way this model works, it must have parameters-- and I'll mention in a moment what I mean by parameters-- and these parameters should be set by regulatory bodies.
So the way this model works is that there is one thing that we don't want to do. We don't want to predict what humans are going to do. Because you can think of, let's build a machine learning engine that will make predictions. If this guy is doing whatever it's doing now, then in the future, he will do this or that. It seems plausible from a machine learning perspective, but we can never really predict what humans are going to do.
A better way is to build a system which is based on a worst-case analysis but under reasonable conditions. So let's give an example. Let's assume a car is in front of me. We keep a safe distance to the car in front. The reason we keep a safe distance because the car in front can unexpectedly apply emergency braking. And we don't need to predict that the car would apply emergency braking. It will happen unexpectedly. So we keep a safe distance that in case the car applies emergency braking, we will be able to apply our brakes after a response time-- certain response time-- and still not hit that car.
Now what is the reasonable assumption here? Well what is going to be the braking force of the car in front of me? Is it going to be infinite deceleration? 100g deceleration? Is it a formula-one car that decelerates at 5g? No, we can make a reasonable assumption that it's 1g. 1g braking is really, really tough, OK?
So this is the assumption, and then on that assumption, we take the worst case. The worst case that this car would break unexpectedly. So why a regulatory agency is needed here? To agree that it's going to be 1g. That's regulatory, but he can say, no, it's 1 and 1/2 g. So it means that I'll need to keep a much larger safer distance.
So the idea is to build a model which takes the worst case under reasonable assumptions, where those reasonable assumptions are parameterized, and you deal with regulatory bodies in order to agree on the setting of these parameters. So it has to be transparent. It has to be open and not part of an IP or not part of a competitive advantage. And this is what we set to do. It was a year and a half ago. And we are working with a Chinese standards body that decided to standardize this model. It's called RSS.
In US, Europe, we have industry actors that have joined. Israel is also an interesting sandbox. We created a joint venture with Volkswagen to commercialize ride-hailing, a robo-taxi mobility-as-a-service, starting from 2022 in Tel Aviv and then later scale to entire Israel. So it's full stock. It's completely autonomous driving, no driver behind the steering wheel, 2022. And there is a business model. So it's not a pilot. It's a complete business.
Now why Israel? The reason is that the Israeli government created a task force of multiple ministries-- Ministry of Transportation, of Legal Affairs, of Finance and so forth-- to deal exactly with these issues. Because one needs a regulation. Otherwise, you put this car on the road, there is an accident, then what do you do? Right? And this is part of the ethics of AI.
OK. And the way this driving policy works, there are two machine learning elements to it. There is what we call strategy and then tactics. So strategy is the decision to change lane. That's a strategic decision. Tactics would be which vehicle to give way and which vehicle to take way. So normally, when you change lane, you fill in a gap between a vehicle that you gave way and a vehicle that you are taking way. And this decision is changing all the time because you decided to take way from a certain vehicle, but then you change your mind because that vehicle is not allowing you to take way. So it's dynamic.
And then path planning, with the RSS together, guarantees that you'll never cause an accident because you are never getting into a dangerous situation, according to this model. So one can then provide a guarantee that if we all agree on the parameters of this model, the car would never cause an accident, if the perception is correct. So we basically remove the decisions outside of the statistical machine learning domain.
So let's look at the perception challenge and then I'll show you some examples. So the perception challenge is the following. No we know that the world is rich with details and they're static and dynamic, like road users, pedestrians, vehicles, and static like lanes and the semantic meaning associated with lane boundaries, whether it's a curb, a road edge, and so forth and so forth. And the basic question is, can AI-- AI that we know, the current machine learning, the state-of-the-art techniques, can it cope with this richness?
Well the answer is yes but there is a snag. And the snag is accuracy. To understand the accuracy challenge, let's do a back-of-the-envelope calculation. So the back-of-the-envelope calculation is that the statistics of human driving is the probability of a fatality per one hour of driving is one to a million. So now let's exclude driving under influence because machines are not going to drive under influence. Let's exclude texting while driving because that machine's not going to be involved in that.
So I think it's safe to assume that the statistic is 10 times better. So it's 1 to 10 million. OK? So it means that the mean time between failure that we need to show as technology providers is 10 to the power of 7, 10 million hours of driving. We need to show a failure once every 10 to the power of 7 hours of driving.
Now to understand the weight of this, let's take driving assist. Driving assist is very, very mature technology. It is the camera and sometimes camera plus radar, which is front-facing in the car. It's detecting vehicles and pedestrians, making decisions of an imminent collision, and if there's imminent collision, the car would apply brakes and avoid the collision. And it comes almost with every new car.
And this technology has the state-of-the-art of deep learning. Whatever you can imagine is there. So there we know what the statistics are. There is false negative and false positives. False positive is a braking actuation that should not have happened. I'm driving leisurely and all of a sudden my car brakes because there was a shadow on the road and the car thought it's a shadow. And then there's false negative. The car should have applied brakes but it didn't.
So the system is tuned to have really an infinitesimal false positive. Why is that? Because the driver is still responsible. The driver is driving the car. So if there is a false negative, the driver is still there. The driver could apply the brakes. False positive is really scary. As I said, they are driving leisurely and all of a sudden, your car brakes. You'll never want to touch that car again. So the sensitivity for false positives is very, very low.
So the question is, what is the rate of false positives and is driving as a systems? Now the rate is very impressive. It's once every tens of thousands of hours of driving. That's really impressive. But not sufficiently impressive. We have three orders of magnitude here. 10 to the power 4 versus 10 to the power of 7. And here, it's only to match human performance.
AUDIENCE: You said existing AEB systems?
AMNON SHASHUA: Yeah. Yeah, exactly. And this represents the state of the art. There is nothing there missing in terms of technology. It has the deep networks. It has everything that you can imagine it has. And still, you have three orders of magnitude difference.
So the way to handle this challenge, or the way we handled this challenge, is to break down the system into subsystems. Now this is tricky because you want to create subsystems that are as independent as possible in order to create this effect of multiplication of MTBFs. All right, so let's assume that I have two subsystems and they are statistically independent. OK? That will never happen, but let's assume they are statistically independent.
Therefore, what I need to show is the mean time between failures of each subsystem to be 10 to the power of 3.5 because it's the product that matters here. All right? Now 10 to the power of 3.5 is only 3,000 hours of driving and that's a piece of cake. So in order to create subsystems that are as independent as possible, the best is to use different modalities. For example, one subsystem would be only cameras. Another subsystem would be radars and LiDARS. OK? So it's not 100% independent but it's quite close to being independent.
Now why is this tricky? Because creating a system that is powered only by cameras is really, really difficult. First, we know it's possible because we are the proof of concept. We have eyes and computing device called brain and we can do it. But where comes the challenge? The challenge is that cameras are two-dimensional devices. There is no direct access to depth, unlike radar and LiDARS that have, for every pixel, they have also the depth. With cameras, you have pixels. You have no depth. Right?
Our depth perception comes indirectly, through perspective cues, through shading cues, motion cues. It's all indirect. There's no direct access. Even our stereopsis is only for very, very short distances. For 1, 1 and 1/2 meter distances. So being able to detect an object and put it in 3D at sufficient accuracy for safe and comfort control, that's very, very tricky. Right? This is why, normally, suppliers who work in this autonomous driving space don't do that.
What they do, they create a fusion system at the start by basically trade-offing-- trade-off each sensor strength. So cameras, because of the high resolution and texture sensitivity, would be good for detecting an object, and then you put a LiDAR to tell you what's the range of object, for example. So you do this low-level fusion at the beginning, but then you have a validation problem. You'll need to validate this to the 10 to the power of 7 hours of driving, which is very, very difficult and very unlikely that you'll be able to drive 10 million hours of driving without a single mistake. So this is the way it operates. So those are two major challenges that you have for reaching autonomous driving.
So let me show you some examples. So what you're going to see here, while I'm showing this, this is the top view. So here are all the objects displayed top-view. So this is in 3D. The car, it's going to drive, and then on the highway, there is going to be a car stopped and the driver is coming out of the car. So then the system decides from a strategic point of view, we need to change lane. There is a blockage. We need to change lane.
So when it starts to change lane, it will need to make a decision which vehicle to give way and which vehicle to take way. So it will color-code the vehicle that it's taking way by green, and the red, it's a vehicle that is giving way. And notice that it's going to change its mind while it's doing so. So let's have this run.
So all these are the vehicles around. This is the autonomous car. And you see here, the driver is coming out. So it's giving way to this one. It wanted to take way. Now it's taking way and was successfully taking way of this vehicle and moving into it. All right.
Let's see this from top view from the following here. So there's going to be a drone view so you'll get more sense of whether the autonomous car is human-like-- is driving like a human, yes or not. So this is the autonomous car. And there is a blockage here. We didn't orchestrate it. It's a true blockage. And you see all other vehicles are kind of moving around. Now it will start negotiating its way out and doing it in a way that it's more or less smooth with entire traffic. So you need this agility.
This is from inside the car. So at some point-- see the red and the green? It was green. Now it is red because the car didn't allow us to take way. And here it succeeded.
Another example here of blockages inside the city. You'll have a truck here that's blocking traffic. It needs to make a quick decision whether it's a traffic jam that you need to simply stay behind or it's a blockage and you try to overtake. So it makes decisions about what other cars are doing to make that decision. And this is from inside the car. So it will color it in orange, meaning that it's a blockage, and then it will negotiate its way with other cars.
AUDIENCE: [INAUDIBLE]
AMNON SHASHUA: It rarely happens that we intervene, but then it means that we are in a too-easy scenario. So then we need to increase the level of difficulty of the scenarios. We are now at a position in which-- so we do it in two stages. We're focused now on Jerusalem and then moving to Tel Aviv. Jerusalem is more complicated than Tel Aviv. First, it's hilly, like San Francisco.
Second, it has lots and lots of pedestrians. All the ultra-orthodox there and they don't respect the road, so they'll move in and out. So moving in narrow streets. So right now, we are 90% of the type of scenarios that we want to be in. We have another about four or five months to finish that. Then the car would be handling the most difficult scenarios a car could ever handle. So some of those scenarios would be even scary for a human driver. If you come and rent a car and drive in those areas that we drive, at some point, you'll simply stop the car and get out of the car. So it's really scary.
Here is another example. This is the last example that I'll show. It's about right of way. So we have the right of way. This is our car. But we relented. But in this case, we are more assertive and we take the right of way. And it [INAUDIBLE] to smooth, because if you are too conservative, you'll simply block traffic.
And it's all under this definition of what danger is, of this carefulness is. And you must have it, because if you don't have a formal definition of what danger is, you'll become conservative because you don't know whether to be assertive or not because you have no formal definition of what danger is. OK.
So in the remaining 15 minutes, let me talk about another application of artificial intelligence. Of perception. But it's not only perception but also a natural language understanding, natural language processing. And the idea is this, is we called it AI as a Companion. So this is another company, Orcam, that I founded back in 2010. It has about 250 employees.
The idea is that you have a device that is always on you and it's always on in terms of sensing the environment. So sensing the environment is vision and sound. Always on you. So it can't be a smartphone because smartphone is sometimes in my pocket so it doesn't see and doesn't hear anything. It can't be a smartwatch because smartwatch doesn't see. Right? It's all here. So it has to be a device on you facing forward.
So we call this sight and sound plus brain-- all the AI software. So we started working on this in phases. So the first phase-- and the phases are based on market segments of markets that really need this AI as a Companion. Normally, we are used to be around technology that we don't need, right? Trying to sell us things, right? But here, we are focusing on technology that people actually need in order to overcome disabilities, let's say.
So the first one is vision. You know, blind and visually impaired. So it's a unit that you'll see in a moment in a videotape. So it's a small unit of this size you click on your eyeglasses. There are two magnets. You simply click it on eyeglasses. And everything is done on the unit. Nothing is sent to the cloud because it has to work in real time and the bandwidth does-- it's not practical to send it to the cloud. And already, 100,000 units have been shipped. Blind and visually impaired.
So what it does. It does a lot of-- so I'll first show you how it looks like and then I'll show you under the hood. What's going on under the hood. So it'll recognize faces, because if you are visually impaired, someone is standing in front of you, if this person doesn't speak, you don't know who this person is, so it'll whisper to your ear what's the name of the person.
It will read text, wherever that text is, whether it's a newspaper, a book, a billboard, text in the wild. It will read your hand gestures to understand what you want. So if I'm pointing, it will focus on the air that I'm pointing. If I'm doing like this, it will stop. If I'm raising my hand, it will tell me what the time is. So all sorts of hand gesture recognition.
And it also listens to me and understands what I say and helps me, for example, if I'm looking at a menu. It's no point of reading the entire menu. I want to ask a question. You know, is there a kid's meal here and what is their meat or fish and so forth. Tell me what there is for dessert. So there's an interaction with the system. So let's see for some examples of this.
[VIDEO PLAYBACK]
- [INAUDIBLE]
- Hi.
- Can you show us what the device does?
- Yeah, sure.
AMNON SHASHUA: That's the device. Yep.
- [INAUDIBLE]
- Oh. It helped me recognize you. Let's see who else.
- [INAUDIBLE]
- Hi [INAUDIBLE]
AMNON SHASHUA: OK, so that was fun. So she looked at Jonathan and it said Jonathan works [INAUDIBLE] She looked at him, it said Gil. So this works in real time. You don't need to tell it, recognize someone. It works in real time.
- Hi.
- What else can it do?
- So the device can read any printed surface and recognize colors. Let me show you how it works. Newspaper. And you read by pointing.
AMNON SHASHUA: That is the hand gesture. Recognize it.
- The most coveted free-agent slugger on the market has--
- How do you stop it?
- Open my hand and it just stops. It can also recognize any color.
[END PLAYBACK]
AMNON SHASHUA: So it will recognize money notes and organize products. So let me show you how it goes from under the hood. I think it will be more interesting. But before under the hood, this is the example of menus.
[VIDEO PLAYBACK]
- I see a menu. How can I help?
- What do you have for kids?
- There is a kid menu.
- Do you have something vegetarian?
- There are two options. First is kid burger-- beef or veggie. Second, macaroni and cheese.
[END PLAYBACK]
AMNON SHASHUA: OK, so the way this works is the system uses also location services because it's connected to smartphone, so it first downloads all the menus in your area. Because all the menus, restaurants, it's all online, right? Downloads all of them. Then it looks at, from location services, it says, OK, you are one of those four possible restaurants. So then it will match text to see which of their menus it is. And then it takes the entire menu-- so it doesn't need to read anymore on the menu. And then it only parses it. And then there is an interview natural language exchange.
So how does this thing look under the hood? There is much more than-- the system needs to understand what's going on. So if we let this run, you see it knows that there is a door here. Let's see what more. And this is done in about 10 frames per second. So now it will show business cards. It knows it's a business card. It knows who that person is. Screen.
And it does the layout of the page. The stop gesture that you saw before. Let's give it a bit more. Now it will take some money notes. So it will recognize what that money note is. It has a vocabulary of about 500 different money notes. It also recognizes barcodes. It has a database of millions-- a size of, I think, 10 million different products just based on the barcode and so forth. So it's kind of lots of lots of visual understanding in order to help the visually impaired.
OK. It also recognizes logos. The reason for recognizing logos, this is a very-- it will not be obvious for someone who is not blind. Say you're blind, and if you are a blind person, you go to places that you are familiar with. And let's assume that you go to your favorite mall. Now you know that if McDonald's is here and Starbucks is there, you know where you are, right? So let's assume that you are disoriented. Now if I tell you, this is McDonald's and this is Starbucks, you're back to your orientation and then you can start moving. So being able to recognize logos around you is as important for orienting yourself as a blind person. OK?
OK. Phase number two. So this was phase number one. Phase number two, let's see what we can do with sound. So if we can combine sound with vision together, what can we do? So this is something that there is academic research on it. We can solve this problem called source separation, or the cocktail party problem. I'll show you in a moment an example and this is something quite cute.
Now if you have hearing aids and you are in a situation in which many, many people are talking at the same time, your hearing aid amplifies all the voices because it doesn't know which voice is relevant. All right? As people with normal hearing, we are very, very good at this source separation. I'm looking at someone. I can focus only on the voice coming from this person.
So we can do this using deep learning. Train a network which accepts the voice sound wave. Accepts the video of the person in front of me and then learns how to filter out only the voice coming from that person. So let me show you how this looks. So it's me now talking. While I'm talking, many, many people are talking at the same time. So I'm reading something in order to make this run smooth.
[VIDEO PLAYBACK]
- These energy ratios will not always fit the perception of silence of each component within--
[END PLAYBACK]
AMNON SHASHUA: OK, so this is kind of the input. So it's a cacophony of sounds. And if we look what the system is doing--
[VIDEO PLAYBACK]
- These energy ratios do not always fit the perceptual assignments of each component within this.
[END PLAYBACK]
AMNON SHASHUA: So we see that the separation was kind of a miracle. You saw we put here the video. The network is probably focusing on the lip movements, even though it's different frequencies. You know, the sound and the video. The network is able to combine the two and use that in order to filter out the sound. And it's all on a device of this size. So you simply put it here, and now if I'm looking at someone, it will send through Bluetooth to my hearing aid and isolate only the voice of this person.
So actually what we did-- and this is coming out around June. So now comes the third phase. The third phase is, let's try to load more capabilities on a device of this size. And we launched this with a Kickstarter. So we did 1,000 units Kickstarter. It's going to be shipped, I think, in two weeks time.
So it's not a product. We call this a platform. To load as much AI as we can on a device of this size and then get feedback from users and then identify market segments. So what this device does-- not for the hearing disabled. It focuses on face recognition and text recognition. So if there is a name tag, it will associate the face and the name tag. A business card, it will then associate the information on the business card to the recent face that the system saw.
It will do also-- NLP can transcribe. Whatever I'm saying can be transcribed. It can do topic understanding. So it will extract what is the topic of the discussion. It can do event detection. If I'm saying, let's meet for lunch next week, it will add this, too. So add as much AI as possible into a device of this size and then gradually look for market segments.
So for example, the facial recognition, the market segment that we're focusing on is people with early onset of Alzheimer's. Losing-- you know, face blindness is the first of the things that happen in the course of this Alzheimer disease. And there are people with face blindness. If you google face blindness, 2.5% of the population have issues with recognizing faces. They know it's a face but they don't know who it is. So gradually building this position of AI as a Companion when you have something on you. So I'll stop here and I'll be happy to answer questions. Thank you.
[APPLAUSE]
AUDIENCE: It's well-known Google ran into problems with social acceptance with Google Glass, right? You--
AMNON SHASHUA: Yeah. Oh, that's a great question. So privacy issues. So that's right.
So phase one, the blind and visually impaired, there are no privacy issues. It's not sending anything to your smartphone. It's all on the device. And if a blind person recognizes you, I think you're fine with it. So that's not an issue.
The hearing disabled, it's also not an issue. It's all localized on the device. It recognizes the face and uses that to know which voice signature to extract. You wouldn't have an issue with someone with a hearing aid and so forth.
This one here is a bit scary. So there were things I wanted to do that I'm not doing. For example, early, when we designed this thing, I had this picture of superhuman capabilities.
What is that? Let's assume I put this on and I become superhuman. Every stranger, I can recognize. Let's assume that many people have this. Any person that I tag goes to the cloud, and anyone with this device can then go to the cloud and match the face signature to the person. And then, lo and behold, you can recognize strangers.
Well, this is too spooky, right? I don't mind-- it's OK that I recognize strangers. I'm not sure that I'm OK that strangers recognize me. So it's a bit tricky.
So this is why the first thing that we said, we don't upload anything to the cloud. It all stays in the device and the smartphone, which is useful for display. But still, there could be privacy backlash.
This is why we started with Kickstarter. We didn't go and try 1,000 people, early adopters. Let's see what excites them, what scares them. Let's start getting the feedback, and let's call this a platform, and then only go and look for market segments that people that need this capability and not just it's a nice-to-have, like all the Alzheimer and things like that.
So I think that Google Glass did a bad thing for everyone. There are so many things that we can do if we had [INAUDIBLE]-- of all this privacy thing will be pushed aside. But so we have to do it carefully. This is why we did this with the Kickstarter, not sending anything to the cloud.
Because I believe that, now, technology eventually cannot be stopped. Just a matter of finding the right recipe, of making this work in a way that people feel comfortable with. So it will take time. Yeah.
AUDIENCE: You talked earlier about the MBTF between--
AMNON SHASHUA: Yeah. Mean time between failures.
AUDIENCE: --[INAUDIBLE] for Are we still in a bit of a honeymoon period when it comes to-- in apps and things like that, in the sense that you don't have an adversarial layer messing with the system, so you don't have--
AMNON SHASHUA: Cybersecurity kind of things?
AUDIENCE: Not so much cybersecurity. Just for example, a scenery impact. So for example, you'll have something that looks like a stop sign messing with the car [INAUDIBLE].
And if you start doing things like that, it's not even cybersecurity. [INAUDIBLE]. Just messing with signs. You could change the numbers quite quickly, right? So I just want to get your view--
AMNON SHASHUA: Well, so it's a very important question. The way the commercial deployment of a generation 1 robo-taxi is going to look like, first they're going to be level 4 and not level 5. So level 4 means that there is an operational envelope. And within that operational envelope, you drive without the driver. For example, the operation envelope could be you don't drive when it's snowing. Operational envelope can be that you don't take unprotected left turns.
I don't know if you know that, but for example, FedEx. All FedEx drivers, by FedEx law, are not allowed to take unprotected left turns. So you can get to point A to point B without taking left turns. So that could be part of this level 4.
Level 4 can include also, in case you see someone like a policeman waving a traffic sign or something like that, you stop, and then a teleoperator makes the decision. So we are building teleoperations where you have one teleoperator per 10 vehicles. Because the teleoperator is not supposed to drive the vehicle, or avoid accidents. Teleoperator, when the car stops, asking for guidance, then the teleoperator gets into action.
So one of those situations with the car stop would be if somebody is waving a traffic sign. So somebody waving a traffic sign, it could be a policeman, or it could be someone disguising himself as a policeman. So let's stop and have the human take that decision.
So in a level 4, you have this flexibility to handle these edge cases through a teleoperation. Level 5, I think will become much, much later. I think level 4 will take us at least a decade of maturing, practicing, going from one teleoperator for 10 vehicles to one to 100 vehicles at the end of the maturity cycle. And then you can let go of teleoperators at all for level 5. Yeah?
AUDIENCE: In negotiating in traffic, you explicitly control the signals that the car is sending? Its body language, if you like, [INAUDIBLE] that it's trying to move in?
AMNON SHASHUA: So the body language is very, very similar to human driving. So when you want to signal that you want to change lane, it's not enough to do your turn signal. Doesn't matter, turn your turn signal. Nobody will pay attention.
So you start edging your way. So you go to the edge of your drivable path, and you are signaling to the car behind you that you want to take way. If the car behind you doesn't allow you to take way, then you stay there at the edge. You don't go back.
Cars that do like this, it's not human driving. And there are many operators of autonomous driving that when you read, they do like this. That's not human driving. You should stay, and you should persist, and gradually block the traffic until you enter. And this is exactly what the autonomous car is doing. It's very human-like.
AUDIENCE: I just bought a new car that has your autonomous driving in it. I've had it for four months now. I can't drive. So my question to you is, do you drive autonomously all the time, 100%?
AMNON SHASHUA: Well, the car that you saw here, I have it also. So all my routes from home to work and so forth are all mapped.
AUDIENCE: 100%?
AMNON SHASHUA: Yeah. So this is how engineers are on their toes all the time. I see all their mistakes.
[CHUCKLING]
AUDIENCE: It's interesting, if you're well aware, Boeing 737 [INAUDIBLE] MAX [INAUDIBLE] issue with-- or that seems to be an issue, maybe it's [INAUDIBLE] only, where pilots don't have to intervene for a long time, then they appear to lack certain skills. Do you see this an upcoming problem with driving?
AMNON SHASHUA: Well, in autonomous driving, there's no driver. So we're not talking about a system in which the driver is behind the steering wheel and is asked to take control.
So let's put things in order. There is level 2 to level 5. I explained level 4, level 5.
So level 4, level 5, there's no driver behind the steering wheel. And level 4, there's a certain envelope of operation. Within that envelope, there's no driver behind the steering wheel. Level 5 is all conditions.
What is level 2, level 3? Level 2 is the driving assist systems that you have today. They will keep distance. It's called ACC. Of course, they'll do emergency braking before a collision.
They'll do lane-keeping. They'll center in the lane. There's a kind of force feedback.
And sometimes, like autopilot of Tesla, you can take your hands off the steering wheel for limited periods of time. Or with the GM Super Cruise, they have the camera watching you. So as long as you're gazing forward, the system will allow you to have your hands off the steering wheel and on mapped roads, so only on roads that are mapped.
In all that situations, the driver is supposed to be in control. And here is the danger. If you allow too much time to be out of control, the driver could get bored, and then will not take control back when needed instantaneously. This is why there's a camera watching the driver. Your gaze must be fully forward.
Or simply state that it's 30 seconds at a time. After 30 seconds, the car beeps that you need to take control. So that's level 2. Let's call that semi-autonomy.
Then there is level 3, which I believe will not happen. Level 3 is, the system has enough redundancies to allow you to have your mind off dri-- you are still behind the steering wheel-- have your mind off driving. So you can do something else. You can text or do something else. Your gaze doesn't have to be forward.
And when the system will need you to take over, it will give you a grace period. So it's not instantaneous. Grace period means, say, 10 seconds. So you have time to stop doing whatever you are doing and take control.
Sounds reasonable. The problem is that, from a cost perspective, the kinds of redundancies you need, it's not only sensor redundancy, computing redundancy. It's also steering redundancy, because your power steering happens rarely, but it can go off. Braking redundancy.
So you need all the redundancy as autonomous car, almost. And from a value proposition to the driver, the difference between this level 2-plus in which, for 30 seconds, you can be hands-off, or on certain roads, while you are gazing forward, your hands are off, to this level 3, in which you can do something, the value proposition doesn't justify the huge difference of cost.
So I believe that level 3 would not happen. What will happen is level 2, increasing it in capability, also being able to do hands-free in urban settings with this system that either gives you only 30 seconds at a time to do hands-free, or watches you as long as you are facing forward, will allow you to continue.
And then the jump is going to be to level 4 or 5. And level 4 or 5 is going to start purely robo-taxi, not passenger cars, because of cost. With robo-taxis, you can build a business model of ride hailing. And we did all the math, all the business calculations, because we are going to do this in Tel Aviv as a commercial business. You can have a system that costs tens of thousands of dollars on top of the cost of the car and still make a very profitable business of ride hailing because there's no driver in the loop.
With passenger cars, you cannot put a system of tens of thousands of dollars. You can put a system of thousands of dollars. So you have an order of magnitude in terms of cost, and that will take time.
So first would be the robo-taxis because there's a business concept around them. And the business concept is really a game-changer. If you compare traditional ride hailing-- the Uber and Lyft that you have today-- to robo-taxi ride hailing, it's a game-changer.
AUDIENCE: But they have a driving steering wheel?
AMNON SHASHUA: You don't need to. Again, it's level 4. Within the envelope, you don't need a steering wheel. But the most sensible way is to buy the vehicle as a [INAUDIBLE], treat it as a capex. And those vehicles have a steering wheel, so doesn't matter if they have a steering wheel.
And then, after a maturity and cost-reduction, sensors that would be new-generation sensors that would be much, much lower cost-computing, next level of computing and so forth, you can bring the cost of the system to a few thousands of dollars. And then you can start seeing them in premium cars.
So for example, you buy a BMW 7 Series. Instead of paying $100,000, you'll pay $120,000. And those extra $20,000 will enable you to sit in the backseat and have the car take you to wherever you want to go.
But that's next-generation. That's 10 years from now, 15 years from now. It's not now. Now means the next few years. Robo-taxi. That can happen. Yeah. Yes.
AUDIENCE: [INAUDIBLE] for your level 4, where you're assuming some variety of teleoperator to step in and fix the problem, what kind of reaction time are you assuming?
AMNON SHASHUA: Sorry. When I mentioned-- the purpose of the teleoperator is not to avoid accidents. With all the stuff that I talked about in terms of validating the perception, validating the driving policy, the autonomous car should never cause an accident. The teleoperator gets engaged when the car stops and is confused of what to do. There are multiple choices, and the human operator tells the car what choice to take, or tell the car, stay there and I'll send some help to take you off the road.
AUDIENCE: Yeah, but if I get confused while I'm--
AMNON SHASHUA: Not while you're driving. After you stopped. The car stopped, and now it's confused. So [INAUDIBLE] a situation in which it doesn't know what to do. It stops safely. And then the human operator needs to select a choice for it.
AUDIENCE: So maybe, what's the definition of stops safely? I pull off to the side of the road?
AMNON SHASHUA: No, no, it stops. It happens also human drivers. Your engine breaks down and you stop.
AUDIENCE: [INAUDIBLE] middle of the highway!
AMNON SHASHUA: Sometimes-- first of all, robo-taxi is not highway. Robo-taxis is urban.
AUDIENCE: [INAUDIBLE] dragged out.
ANTONIO TORRALBA: OK, this is a wonderful discussion, but maybe--
AMNON SHASHUA: Yeah, I know.
ANTONIO TORRALBA: --just for the interest of time, we should just thank Amnon for his--
[APPLAUSE]
AMNON SHASHUA: OK. Thank you.