Fit Without Fear: An Over-Fitting Perspective on Modern Deep and Shallow Learning
Date Posted:
April 19, 2018
Date Recorded:
April 18, 2018
Speaker(s):
Mikhail Belkin, Ohio State University
All Captioned Videos CBMM Special Seminars
Description:
Abstract:
A striking feature of modern supervised machine learning is its pervasive over-parametrization. Deep networks contain millions of parameters, often exceeding the number of data points by orders of magnitude. These networks are trained to nearly interpolate the data by driving the training error to zero. Yet, at odds with most theory, they show excellent test performance. It has become accepted wisdom that these properties are special to deep networks and require non-convex analysis to understand.
In this talk I will show that classical (convex) kernel methods do, in fact, exhibit these unusual properties. Moreover, kernel methods provide a competitive practical alternative to deep learning, after we address the non-trivial challenges of scaling to modern big data. I will also present theoretical and empirical results indicating that we are unlikely to make progress on understanding deep learning until we develop a fundamental understanding of classical "shallow" kernel classifiers in the "modern" over-fitted setting. Finally, I will show that ubiquitously used stochastic gradient descent (SGD) is very effective at driving the training error to zero in the interpolated regime, a finding that sheds light on the effectiveness of modern methods and provides specific guidance for parameter selection.
These results present a perspective and a challenge. Much of the success of modern learning comes into focus when considered from over-parametrization and interpolation point of view. The next step is to address the basic question of why classifiers in the "modern" interpolated setting generalize so well to unseen data. Kernel methods provide both a compelling set of practical algorithms and an analytical platform for resolving this fundamental issue.
Based on joint work with Siyuan Ma, Raef Bassily, Chayoue Liu and Soumik Mandal.
TOMASO POGGIO: This is talk hosted by the Center for Brains, Minds and Machines. And I am Tomaso Poggio. I am very glad to welcome Mikhail Belkin here.
Misha was a student of a student of mine, Partha Niyogi. Partha was one of the purest intellects that I've met in my life. And Misha worked with Partha on Laplacian eigenmaps and semi-supervised learning, something that is still a milestone in modern machine learning. And Misha has always been extremely reliable, something, somebody you can believe in all the formal details of proving something rightly. And I think today, for the first time, he will speak about something which is a bit more crazy, and I'm not sure I would believe it. But we'll see.
[LAUGHTER]
MIKHAIL BELKIN: I hope not too crazy.
[INTERPOSING VOICES]
TOMASO POGGIO: Thanks for coming, Misha.
[APPLAUSE]
MIKHAIL BELKIN: I'll moderate the craziness level. Maybe.
OK, so what I would like to talk about, the title of the talk, Fit Without Fear. And this is actually kind of a technical description of what we do in machine learning now. I'll explain this in a minute.
But meanwhile, this is based on joint work with collaborators, my students, Siyuan Ma, Soumik Mandal, Like Hui. Siyuan, in particular, had done an amazing job on this. And also my colleagues Raef Bassily at Ohio State, Daniel Hsu at Columbia, and Partha Mitra at Spring Harbor Labs.
OK. So, well, we all know that machine learning, AI, is becoming really a backbone of commerce and society. And this is changing the way we do a lot of things, both in commercial application, and science, and all kinds of domains.
Then what is behind it? Well, if you look at a typical architecture for, say, a vision problem, you have something like that. This is GoogLeNet from 2014. This is the network with about five million parameters. And as you can see from the picture, this is a very complex object.
So at this point, we have this object which are kind of running our inferential tasks. But we don't necessarily understand what this project do as well as we would like to. And there is a little bit of, well, fog of war if you wish. If that there is a lot of excitement about it and a lot of success. But we need to actually, from a scientific point of view, we have to gain better understanding of what they do? What are the key aspects? We have to isolate and analyze components of how this method or other methods work.
Let me very briefly to state the problem of supervised machine learning, just to make sure we're all on the same page. I'm sure everybody knows it. Basically, you have data xi yi. Xi's a point in some dimensional space. These are features. Say, a pixel value [INAUDIBLE]. And yi, say for simplicity, I am just taking them to be binary minus 1 and 1 values. They can be multiclass. In most practical [INAUDIBLE] they're multiclass.
The goal, though, of machine learning is to construct a function given this data. To construct a function from the feature space to r, or 2 minus 1 1 more specifically, that generalizes to unseen data. Well, it's kind of an interesting question, what does it mean to generalize? But in many thought of theoretical, at least, analysis, what we mean by that is that we have some probability distribution for the data. And what we observe on the data which we haven't seen should somehow be reasonable, be reasonably good. It should basically perform well.
Now, how is it done? Well, the typical algorithm for this is empirical risk minimization. Well, strictly speaking, this is not an algorithm. But algorithms are based on that. You take a class of functions, f. And this may be neural networks, for example. And I'll talk particular about one class of kernel architectures. And you minimize the loss function over that class of functions. And f star, in this case, is simply the function from the class which minimizes the loss. And by the loss here, this may be a classification loss or a three regression loss, like the square loss.
Now, there are a lot of theoretical analysis based on the complexity of age. In particular, some common analysis are based on VC dimension, VC theory, things like Rademacher complexity, and so on. And many of these analysis result in generalization bounds of this form.
You basically look at the expected value of f star, this is the prediction on the future data, and you say that the prediction on the future data, this is generalization, is not much worse than prediction on the test and the train data, plus some term, sorry, which is roughly one over square root of n. So when n is large, this too should be close. That's basically the option. There are other analysis, too. I'll maybe say something about this.
OK. So that's, now I will talk quite a bit about interpolation and overfitting in this talk. So what is interpolation? Interpolation is the following. So this is just a cartoon of my data. My feature, here, is one dimensional, and the label is 1 minus 1. Interpolation is simply a classical interpolation. I say that function interpolates my data if f of xi is equal to yi.
Overfitting is very similar, except I'm kind of interpolating the sign. So overfitting, to me, if f of xi a sign, is equal to yi. You can view it as zero-loss fitting if you don't like the term overfitting. Overfitting has some sort of negative connotation. I, in this case, I'm just using it as a purely technical term. There is nothing negative, negative or positive about that for me in the context of this talk.
OK so that's zero loss fitting, or overfitting, interpolation. That's what they are.
Now, what about modern machine learning? Well, one very important feature of modern machine learning is massive over-parametrization. And I would argue that it's actually an innovation. Why is this an innovation? Well, it's not strictly speaking new, but it's kind of new. So let me be first to tell you.
What is over-parametrization? Over-parametrization is simply when the number of parameters is larger, or equal, actually, but in practice it's often much larger than the number of training data. And if you look at this nice diagram of various architectures from [? Kanzi ?] and et al, you'll see that the circles here correspond to different neural net architectures. The size of the circle is how many parameters it has. So GoogLeNet, which I showed you on the first one, has about five million parameters. The big ones have about 150 million parameters. Now, you have 150 million parameters. You're training this on, you know, maybe 10, 100, or even a million data points. You have many, many more parameters than data points. So that's over-parametrization.
Now, what's upshot? Why is over-parametrization important? Well, let's look at this table. And this table is from Understanding deep learning requires rethinking generalization from a Zhang et al, 2017. They give an example. They train using this, as I say, train on a dataset CIFAR, which is, I think, it has about 50,000 data points. They use this inception model, which is a type of neural network. It has 1.5 million parameters, so many more parameters than data points. And they get trained accuracy of 100%. Now, 100%.
Maybe this, this could be kind viewed as surprising. Well, how do you get trained accuracy of 100%? Are they over feed the data using? You know, the notation of the stock, they have zero loss. Perhaps this is surprising, right? Because we know that neural network has some sort of non-convex complex architecture. Why can optimization actually get us to zero loss on this non-convex problem? In general, though, they're hard to optimize.
And there have been quite a lot of work showing that all local minima of such networks are global. But perhaps once you think from the over-parametrization point of view, this is not too surprising. Your parameters are variables in the data points are constrained. If the number of variables is much larger than the number of constraints, you usually can solve the system of equations. I mean this, of course, is not always true, but generically there is some sort of mathematical quote unquote theorem saying that those things are solvable.
So it's not surprising that we can actually get to zero loss if we have so many parameters. Of course, we cannot get below zero loss, because the way we construct the loss function, it's always positive. So that's a constraint. So perhaps it's not too surprising that all local minima are global. Of course, you have to still prove this in particular cases.
The kind of more important thing is that over-parametrization leads to overfitting pretty easy, relatively speaking. Overfitting and interpolation. Your optimization just goes down to zero loss.
Now, perhaps the more surprising aspect of this table is the following. You train to have 0% loss, 100% accuracy, yet your test performance is still very good. It's about 90, just under 90%. So how come that you are not overfitting in the bad sense? Even though you have this overfitting, you have zero loss, but you're still performing very well. Why is it that you're getting good results?
And in fact, there is a nice quote from Ruslan Salakhutdinov of Simons tutorial, he gave it to. He is a well known expert on neural networks. He gave a tutorial at the Simons Institute last year in Berkeley. He said the following. This is actually a quote. "Basically, you want to make sure you hit the zero training error. Because if you don't, you somehow waste the capacity of the model."
So I was actually in the audience, and it really struck me as something very surprising. Because, you know, we tell our students not to do this when we teach machine learning. This is just saying that there is no penalty, effectively, for fitting the data exactly. And this is somehow, when you train these deep networks, this is somehow the best practice according to Ruslan.
So let me, let me give an overview of this talk. So first, the first point I would like to make is that this very strong overfitted performance is not really a special for deep network. And what I am going to show is that classical kernels have something very, very similar to this, the classical kernel machines. And in fact, we can basically absorb it in exactly the same form. That's one.
Second, that overfitting is actually hard in the following sense, that to be able to fit with zero error is a non-trivial optimization problem for large data. And in some sense, once you make this happen, once you have some sort of good method for dealing with large data, the performance of kernels becomes competitive with neural net. At least for problems which are not convolutional in the structure. Like for images you need some sort of convolutional features. But for many other things, it seems to have become very competitive or may be better. So that's the role of depth of overfitting.
Now, the power of overfitting. It can be shown mathematically that if you in the overfitted, or more precise in the interpolated regime, the Stochastic Gradient Descent becomes extremely efficient computation. And in fact, you get a very large benefit from being able to run it. And in particular, it's a very good fit to GPUs, and I'll just call it that.
So that's why optimization works as well, or at least that's kind of an explanation for, we believe that, why neural networks are optimized so efficiently.
And finally, I'll discuss the challenge of overfitting. And really, there is not so much help from current theory with some exceptions, but very little.
So I would, moving forward, I would present some sort of analysis of an algorithm, which is nearly based optimal and can interpolate the data. So at the very least, we can show that this is not all crazy, maybe, to overfit as sort of traditionally thought.
OK. Now let me very briefly describe kernel machines. So kernel machines is, a one-slide introduction to kernel machines. Kernel machines, basically you do empirical risk minimization over a space of functions which are linear combination of kernels of this form, of this. There are other ones as well, but this is a Gaussian which is, by far, the most common in practice.
What you have you control. That if you optimize this, your solution has the following form. It's just the sum of your kernel on your data point. So basically, you classify as a sum of Gaussians on your data point. It's a Gaussian mixture, if you wish. For the Gaussian kernel, we will use another kernel as well.
And typically, this is done using a regularization term. And this is called taking of regularization. Which of the form lambda norm f of h squared. If you put lambda to be zero, that forces the output of this algorithm to interpolate the data so it will fit the data exactly. And this is called minimum norm interpolation. And the solution to this is simply alpha is equal to k to the minus 1y, where y are labels and alpha is the coefficient in this expansion. So you get an explicit solution for this which is nice, because you can analyze it mathematically.
So these are kernels. And for all generalization analysis, pretty much, you have to assume that lambda is greater than zero. But what we do, we basically put lambda to be zero. You cannot quite put lambda to be zero, but you can take lambda to be infinitesimally small, which results in the solution. Otherwise, your solution is not well defined.
Now. Well, there is a lot of beautiful classical mathematical and statistical theory of kernels. Reproducing kernel killable spaces were introduced by mathematician Aronszajn. And there are splines, and more recently kernel machines, well still 20 years ago. And why is this an attractive setting? Well, it's actually, it's convex, it's analytically tractable, and also, you can view this, and that would require a little bit of an explanation, but you can view it as a 2-layer a neural net. So it's a special case of neural nets.
Now, let's discuss interpolation. So this is an actual dataset. This is actually MNIST. And what we're doing here, we're running these kernels. So I'm basically solving kernels the same way as I would solve a neural network. An how do people solve neural networks? They run the Stochastic Gradient Descent for some number of epochs. It's exactly the same way you're running this kernel, or we're running this kernel.
So this is a number of epochs, and this is the result. So this line is, there are two different types of kernels. You can just concentrate on the blue one. It doesn't really matter so much. What you see is that, as you're running, the result improves. After one epoch it's about 1.9, and then it goes to about 1.1.
Here is interesting thing. The interpolation solution is actually this line. And as you can see, the line really interpolates the data, meaning that f of xi is equal to yi exactly, while optimum medical precisions, they are already something like 10 to the minus 26. So this fits, this dashed line fits my training data exactly. I don't train anything here. Well, I train, but I just invented this metric. That's all I do. For smaller dataset, you can just invert it. For large, of course, you cannot.
So I'm getting the solution. And what I see is that my, as the number of epochs increases, I basically converge to the solution. My performance on the test, this is not training, this test, right? So it never goes up. So early stopping would be like regularization. So if I wanted to regularize, I would stop here. But if I regularize, I just get results which are worse. This is not completely 100% true. You can see here, it dips just a tiny little bit under the dashed line. But the difference is very, very small. So if there is an effect of regularization, it must be very small. OK.
So this is, 2 sigma is chosen by some sort of validation set. But we can run the same thing with sigma being twice that or 1/2 that, and you'll see the same result. So the baseline decreases. So you'll get instead of, say, 1.5% percent, you'll get maybe 1.8.
AUDIENCE: What if sigma is very small?
MIKHAIL BELKIN: If sigma is very small, you get 1-nearest neighbor classifier. But the performance-- They will be much worse, the performance. This is far from 1-nearest neighbor. This is quite far from 1-nearest neighbor. 1-nearest neighbor would give you about 3%. So this is not, so yeah. It is clearly not 1-nearest neighbor.
AUDIENCE: OK. Go on. But sigma is like lambda, right?
MIKHAIL BELKIN: No, no. Sigma is not like lambda. No lambda prevents you-- No. Sigma doesn't prevent you from fitting the data exactly. Lam-- sigma doesn't. The error here is 10 to the minus 26. So we're actually fitting the data up to numerical precision.
By the way, notice that loss function is actually irrelevant here for the interpolation, right? Because interpolation has no loss function. You're just fitting your data exactly. So interpolation fits the data exactly. But if you do this kind of iterative update, the loss function will change your, change your path to the convergence. But it wouldn't change the actual result.
OK. Now, I'm not going to look at it, but to go over this in any detail, but basically we did it on a bunch of datasets, and it always looks the same. The shape of this curve is the same in every case. Every case regularization. Either give you anything, or give you a very small, very small improvement. This is like six different real datasets.
Yeah?
AUDIENCE: Maybe you said this, but if if you're supporting the kernel matrix once and for all, then what do you mean by epochs?
MIKHAIL BELKIN: So I solve it in two ways. One is inversion, direct inversion. The other is an iterative solution. So inversion, of course, there are no epochs. But there are two different ways of solving this.
AUDIENCE: So you're showing the classification?
MIKHAIL BELKIN: This is classification, and this is actually regression error.
AUDIENCE: Because you can have overfitting in the loss and not in classification.
MIKHAIL BELKIN: No, yeah. Over, in the loss, I really need to have 0. I have 0 loss of regression and classification. But in the classification on the test, it's fine.
AUDIENCE: You don't have overfitting in [INAUDIBLE].
MIKHAIL BELKIN: By overfitting, I mean, my overfitting which just means zero lost--
AUDIENCE: The standard definition, but fitting means when the transition, the test error increases.
MIKHAIL BELKIN: Yes. That we don't observe. We never observe this. Not to any significant effect. There is a very small effect.
AUDIENCE: Within [INAUDIBLE], we see it all the time. If it's overfitting in the [INAUDIBLE]. And not overfitting [INAUDIBLE].
MIKHAIL BELKIN: That may be special to deep networks. But I will get to that later. Because I think deep networks, they combine a lot of different things. So it's very hard to separate those issues. So that may very well, I totally believe that.
AUDIENCE: There's no explanation. [INAUDIBLE]
MIKHAIL BELKIN: OK. So let me-- OK. What's going on? Overfitted and interpolation generalization is not unique to deep architectures. That's very clear. That's what people have observed. In deep networks, we are observing exactly the same for kernels.
Can now be examined in a convex analytical setting. That's nice, because that's actually where our mathematical tools work.
2-layer neural nets. Well, if we don't even understand setting for 2-layer neural nets, we cannot have a full theory, well, I would argue, we cannot have a full theory for general neural nets which are much more complex.
OK. So let let's continue. Now, let me tell you why this is hard and why it takes, actually, some effort to scale kernel methods to large data. What is hard about it? So kernel; methods it's known, it's known, well known that they work quite well on small data, not so well on large data. What's going on? Is it something intrinsic to kernel methods, or is it because we don't do something correctly?
And basically, we need to address computation here. The good thing about interpolation, it makes a goal really, really simple. We don't have regularization parameters, we have nothing. We just have this linear system of equations. We have no choice of loss function, we have no choice of regularization parameters. So we just concentrate on solving this linear system of equations as efficiently as we can. And it turns out, basically, we get very, very strong performance once we can solve the system for large data.
So why? Let me first point out, of course f star, if found algorithmically for large data, we basically have to map it to GPU if graphing. If we can, if we need to get good performance. Why? Because GPU is basically a parallel machine which does matrix matrix computations very efficiently. It's parallelizable matrix matrix computation. So that limits the algorithm which [AUDIO OUT].
Basically, any algorithm which sort of has hope of scaling to large data has to combine some number of matrix matrix product and some limited amount as far as the computation. If you don't have this, you have to use CPU, and CPU is maybe 10 or 50, or some large factor time, time slower. So very difficult to get good performance.
So now matrix inversion. Well we can, of course, do direct inversions, the cost is n cubed. That's basically impossible for, like, if you have 10 million data points, you have 10, 21, 14 points separation. That's impossible, and that doesn't map on the GPU, as well. Well, when I say impossible, it may be possible in some circumstances. But it's essentially, we cannot do it on our machine. Maybe on a supercomputer, I don't know. Very close to impossible.
So now, what's the alternative? The alternative is doing some sort of iterative update. There are other alternatives as well, but this one is, you know, particularly inviting, in a sense, and also parallel to what's happening in neural networks. This is really great in descent for solving this system. And this goes under several names. This is gradient descent, this is Richardson iteration, [INAUDIBLE] iteration. There is, of course, quite a bit of, quite a bit of literature on this.
This is not, this is good. Because this is what takes on the n squared iteration, and this is easily GPU compatible. That's matrix vector multiplication, right? I'm updating my, sorry, my matrix, my vector, or weight, iteratively by multiplying it by the kernel matrix k.
But how many iterations do I need? Right? There's a number of iterations [AUDIO OUT]. This is not so useful. Well, how many iterations? Well, let's look at this function, this Heaviside step function. Just one dimensional, you have one here and zero here. Right? That's one of the simplest functions you can consider. After hundreds of iteration with the Gaussian kernel, you get something like this. OK. That's maybe not too terrible, and you have some loss, 6 times 10 to the minus 2 is the square loss.
If you do one million iterations, however, you'll see that nothing changes. Literally almost nothing. You didn't even get one bit of precision here. What is going on? Why is, you know, from going from 100 to 1 million, hopefully you would see something good happen, but it doesn't.
Turns out-- well, let me give you, before I sort of have an explanation, let me give you a, sort of, real example of this. The same thing we can observe in some dataset. This is some 10,000 data point subsets under datasets. Basically, you can see that to get optimal performance, you need more than 10,000 iterations of gradient descent. That's bad, because you're worse than cubic now. You're at cubic or worse.
What is going on? Well, the point is that convergence is controlled by eigenvalues of the kernel matrix. And you can prove that eigenvalues of the kernel matrix decay nearly exponentially. And that's very bad, because, well, if your eigenvalue is, if you're, how many iterations is exponential? Then the fact that each iteration doesn't cost very much is not really helpful. And you can prove that the dimension of the functions reachable by those iterations is kind of polylogarithmic. Basically, this is kind of bad news if you want to use that method. At least, this is for Gaussian kernels, I should say. So it doesn't necessarily apply to other kernels, but that's the popular type of kernel.
In fact, you can prove that only very smooth functions can be absolutely approximated by a smooth kernel in some sort of polynomial number of iterations. And the problem is that we don't really expect classification functions to be smooth.
Now, what's going on? Well, actually [AUDIO OUT] contradict classical rate for gradient descent. There is some classical rate, if you've seen it. Which of the formula of epsilon squared, I'm saying this is like some logarithmic thing. Why is that? Well, it has to do with infinite dimensionality. I can sort of explain that offline, if you would like. But that's basically an issue.
Well, what do you do? How do you solve this? Well, there are, there is a clear problem, right? The problem is that the eigenvalues decay quickly. And basically, what your convergence depends on, it depends on the ratio of eigenvalues. So a way to address this is to actually [AUDIO OUT] the spectrum of the matrix. And you can do this by constructing this kind of new kernel, if you wish. It's basically literally, maybe this is the best way to sort of explain this, you just look at the spectrum of eigenvalues as a function of the index, right? And you have some sort of decay like this. And what you do, you literally flatten the spectrum up to some point. Though that's a modified kernel.
How much performance gain do you get? You fit first. You are an ak and one iteration. So first, a ki in directions will fit in one direction, and then you get lambda 1 divided by lambda k plus 1 acceleration for each next one. And that is exponential. Well, hopefully exponential. So you're potentially getting exponential improvement in accuracy.
So that's kind of what it looks like, gradient descent is somewhere here. And this method that we have is, you know, it reaches much larger space of functions using the same amount of computation. So you can see that actually the fat shattering dimension of this is, or both are polylogarithmic, but this has some sort of exponential constant in front of it.
OK. Well, how do you actually do this algorithmically? You can, there is, I don't want to spend a lot of time on this. But basically there is a way to do it. You take a small subsample, and you can construct the, the way to do it is to construct a preconditioner and you can construct eigenvectors from a small subsample. And then you do Nystrom extension and you use that to precondition. That turns out to be quite efficient and accurately enough to get a very large increase in speed. So of course it's an approximation, but it's a good approximation.
And the nice thing is that it converges to the correct solution. So even if your approximation is not optimal, you still converge to the correct solution. So inaccuracy in constructing this only is problematic to the degree that you will not get full acceleration. Notice that we don't use any regularization in this.
So basically the kind of high-level idea of EigenPro and, so what we get, this is some practical results. And I think the interesting part about it is that if you just look, this is some standard data set of reasonable size. If you look at the [AUDIO OUT] time which our method requires, for example for this MNIST, this is some sort of large version of MNIST with one million data points. It takes a 0.35 hours, which is about 20 minutes. Well, little, yeah about 20 minutes.
And you know, if you, for example, look at the literature, comparable results slightly worse. Requires one hour and 1,300 vCPUs on AWS. So it's really a very different level of computational expenditure. We're running it on a single GPU, so it's maybe 1,000 times faster. Or not quite 1,000 times, but a lot faster. The same here. We need 25 hours as one million data points on TIMIT. And we need 0.8 minutes here. Well, that's particularly easy dataset.
So, yeah. You compare the amount of time it takes, and this is drastically faster than everything available. There is actually another fast kernel method from Lorenzo Rosasco, FALKON, which is also quite fast. But this is even many times faster than that. From last year.
Now, if you look at the deep neural networks, there is [AUDIO OUT] actually you get with deep neural networks on TIMIT, which is a speech dataset. It's 32.4% and here we get 32%, and that's, you know, it takes 20 minutes to get.
So really, the scaling is kind of almost solved. It's not completely solved, because we still cannot do it on, say, 100 million data points. For that we need infrastructure. But that is [AUDIO OUT] degree that we can rather than 20 minutes on a million data points. It's really not the problem anymore. So better performance was far less computational budget, far, far less.
Now, this is some very recent work. But basically, we can easily match the results from deep neural network [AUDIO OUT] improving speech intelligibility tasks. And I am not going to play those, but you can see this is kind of clean speech, this is noisy speech. And you can reconstruct the clean speech. It's can be posed as a classification, a regression problem. You can do it, and it works really well. And it's kind of the results we get out of the box, more or less. It's not even particularly difficult to tune up. The neural network which we are matching against, well, it's not the newest result. But it's a rather complex architecture, and we're using a single parameter.
So if you basically can scale, if you can address the issue of computation, you do get [AUDIO OUT] result, this shallow networks, this kernel methods, kernel machines.
Now let me talk about the power of overfitting. Why is overfitting so good? And by good, I mean from a purely optimization point of view. I'm not addressing generalizations here, just optimization.
Well, let me just very briefly remind you about stochastic gradient descent. Stochastic gradient descent is something like this. You're minimizing some of this loss function. And the loss function is computed at each data point, so it's a some of the things. And the kind of natural idea here is, for example, it can be the square loss, is to optimize just one at the top. OK, so that's basically what stochastic gradient descent is. You take a derivative, instead of taking the derivative for full gradient descent, you take the derivative of the sum. But we know that the derivative of the sum is the sum of derivatives. So why not just take one at a time?
Unfortunately, here is a big issue. Each of these guys is only weakly related to the total sum, because they somehow are very different. And what is happening is that after two steps, is a kind of bound that you get from [AUDIO OUT] bounds that you get for stochastic gradient descent. It's one over t. For gradient descent, you get e to the minus 2. Right? Gradient descent converges exponentially, or linearly in the optimization literature. stochastic gradient descent convergences at this rate, one over t, or even worse.
What is going on? So why would anybody use this when you can do that? Well, you can say, OK, gradient descent is more expensive. Sure. But it's not actually any more difficult to compute. It's just more expensive per iteration. But if this is exponential and this is not exponential, what's the rationale for using that? Well, however, all the general network architecture do use stochastic gradient descent.
Well, so here is a key observation. The key observation is that if you aim this overfit well, it more precisely interpolated regimes. So if f of w star were double your stage optimal solution, fits the data exactly. Then what you have is that all objective functions align. And the variants of your stochastic gradient descent actually decreases with the error. So what you get, you get exponential convergence of stochastic gradient descent.
And that has been observed in the literature. Well, not in connection with interpolation in particular. And in particular, Strohmer and Vershynin analyzed, well, the original method goes to Kaczmarz, but Strohmer and Vershynin actually showed exponential convergence of that.
And once you even realize this, you [AUDIO OUT] get fast stochastic gradient descent for free, effectively. So what you have is that if you compare this to some other methods which are available to accelerate stochastic gradient descent, which are not really widely used in practice. You get the stochastic gradient descent just faster. It's just faster under this condition.
And the question, of course-- well, so that kind of explains why stochastic gradient descent is good, it's exponential. But it doesn't explain why this is better than gradient descent itself. So why is this better than gradient descent? Or why, what's the connection to gradient descent? It just shows that it's exponential, right? But maybe gradient descent would still be a better exponential.
So what we control is the following. This as our result with Siyuan Ma and Raef Bassily. Is that, in effect, minibatch of size one is optimal per computation improvement for stochastic gradient descent. So if you have a mach-- if you are actually in that sort of classical computational model, and you're counting the number of operations, then stochastic gradient descent with minibatch size one is [AUDIO OUT].
That of course, doesn't, some of you might say, well, sure. But nobody uses this, right? Nobody uses stochastic gradient descent with minibatch size one. Why not? Well, the thing is, it's inefficient. Why is it an inefficient? Because we're running the things on GPUs. GPUs are not sequential machines, right? GPUs are parallel machines.
What is happening on GPUs is that you are basically want to do some, you can kind of get a small minibatch for the same price as this minibatch of size one. It's not quite the same price, but it's close. What is happening here then, is that we need to analyze dependence of minibatch size on dependence on performance on the minibatch size.
And we can prove the following. Here is that one step of minibatch, there is some sort of optimal, well, critical point here. Such that up to here, if you have the minibatch of size 10, one step of minibatch of size 10 is roughly equivalent to 10 steps of minibatch of size one. So we call this linear scaling. So it really doesn't matter. So if you have a fully parallel machine, then this will be 10 times cheaper than that. On a sequential machine, it will actually have the same cost.
Now, beyond this point, beyond this m star, you have saturation. And what the saturation does, it basically, you don't get anymore. This is some sort of law of diminishing returns. Beyond this point, you flatten out. And at m star, you get, at most, 1/4. So one step with m star, GD is only 1/4, is, of full gradient descent. So if your m star is much, much smaller than the full data size, you gain nothing by going to full gradient descent.
Interestingly enough, this m star is actually almost independent on the data size itself. It's basically just eigenvalues of this Hessian matrix. And you can get a direct formula for that. It's simply trace age divided by lambda 1 of h. So if you take the Hessian at it's optimal, the trace divided by lambda 1 is m star. And as you can see, at least for things like kernel method, this is roughly independent of the data size.
That means that, potentially, you can get something like [INAUDIBLE] improvement of first GD over the full gradient descent. And that actually is consistent with, this behavior is consistent with linear scaling rules, something which was observed empirically in [AUDIO OUT].
Let me give you one example of this. This is real data. This is MNIST. And we're using it for a kernel here, not for neural net, because for kernel, much easier to analyze the eigenvalues of this matrix. So it turns out critical size is m star [AUDIO OUT] so this is 10,000 data points. So you see that acceleration factor is about 1,000, or maybe even like 10,000 for full MNIST. And what you see is that, indeed, m equals one is optimal. This is kind of computation per epoch. So notice that epoch is a fixed number of computation in the sequential model.
So what you have is m equals star is optimal, and then m equals 8, which is it's critical value. It's very close to optimal. And then when you increase the minibatch size, you get much worse performance. So it's pretty consistent with our theory.
But the interesting thing, well, if [AUDIO OUT] is 8 and you have 60,000 data points, right? You have about 10 to the 4 acceleration. And that's kind of almost like Moore's law, right? If you have a really big data, like 10 to the 6 data points or 10 to the 7 data points, you can get acceleration, which is basically 10 to the 6. So potentially gigantic acceleration.
So we know this is happening in kernel method. We don't know for sure this is happening in neural networks, but it sort of seems that efficiency of SGD indicated, and the fact that it actually is convergent to this interpolated solutions, indicate that probably something like that is happening.
Now we can, in fact, learn kernels for parallel computation. Maybe I'll skip this part. But we can modify the spectrum specifically to address this parallel computation and to optimize it for GPUs.
OK. So that was the power. That, that was explaining why SGD was so efficient. But now, what about generalization, right? Where does generalization stand on it?
So, first I showed you that, well, this overfitted or interpolated classify work very well so that if you are in this regime, your SGD will work very well. But why, actually, why does overfitting work?
And here, we're actually kind of have very little help from theory. And here is what, most theory is something like this. If you take a machine learning class or if you teach a machine learning class, you'll probably see something like this. You'll have training errors goes to zero, and you have a validation which goes down and then up. And this is kind of bias wherein straight off, and you are supposed to look for this point. So that's the theory. That's what Moore's theory suggests.
Now, what is practice? Practice is like this. There is low variance, and there is no bias. Basis is zero in our example, so numerically zero.
What is going on? Well, there are a couple of explanations which you can suggest. One explanation is kind of a classical model explanation, is the following. Well, suppose your data is linearly [AUDIO OUT]. Then, the fact that you have a lot of data is sort of OK. Because the space of classifiers, or this line, separates the data perfectly. But yes, it's is a low VC dimension sort of thing. You can analyze it using classical results. If this is the case, then most of the theory already works.
An alternative is something like this one. The data is, you know, those data points are kind of crazy. And your interpolated classify looks something like that. And we don't have much theory for this. What is going on here? Which one is real data? You can test it.
You can test it in two ways. First, you can do synthetic data. Second, you can do real data and label noise.
Here is how we would test it. You take data, and suppose this is linearly separable. Well, can I modify this data to make it non linearly separable? Very easily. I just flip some of those points at random. This is my random point. I just toss a coin and flip some of those labels. If the first was a true explanation, then this would really break my classifiers.
Well, let's see what happens. These are two datasets. This is two Gaussians in 50 dimensions. And they're linearly separable, more or less linearly separable. So I run this kernel method, and I get zero error. So zero error, of course, that's what you would expect. Now I flip some of the labels, 1% of the labels. You see that classifiers are now still at exactly the Bayes optimal, so they still work perfectly. Maybe 1% is not--
AUDIENCE: Do you fix sigma [INAUDIBLE], what do you choose?
MIKHAIL BELKIN: Sigma is fixed, yeah, in this. Sigma here, I think, is just fixed throughout those experiments. I think it was on the original. I'll need to check, but I think it's on the original dataset without noise. But really makes very little difference.
AUDIENCE: But if sigma is [INAUDIBLE].
MIKHAIL BELKIN: But it's not very small. I know it's not very small.
AUDIENCE: Yes, but it's small compared to the dataset.
MIKHAIL BELKIN: No, it's not 1-nearest neighbor. For any other sigma, I mean, we don't really have theories, so it doesn't really help. But I know it's not 1-nearest neighbor classified. It's very far from 1-nearest neighbor.
AUDIENCE: Yeah, I know. But how do [AUDIO OUT]. [INAUDIBLE]
MIKHAIL BELKIN: Yeah, but they don't change sigma on it. It wouldn't change noise.
AUDIENCE: Two sigma based on the noiseless data.
MIKHAIL BELKIN: Noiseless, yeah. And yeah, I can, so when I add 10% of the noise, it gets slightly worse. But it's not much worse. It's still very reasonable result. So clearly, I don't have some sort of crazy explosion of the loss here.
I can do the same thing with, you can look at the norms and see that the norms increases dramatically of those classifiers, but it sort of doesn't seem to affect anything. A lot of generation models are based on the norm. So you see that there is no performance model complexity here.
We can do it with Gaussians, which are overlapped. So there is some sort of natural noise because of their overlap. Basically the same thing. I don't want to go over it.
Here is a really interesting example. We're adding large amounts of noise, here. So we're adding 20%, 40%, 60%, 80% of labelled noise [AUDIO OUT] noise. Even at like 60% of label noise, the green is Bayes optimal. So that's the best. The kernel does slightly worse than the optimal, because it's a synthetic dataset. So I actually know what the best is. The kernel does slightly, did slightly worse . But only a few percent worse. There is no, it doesn't, you know, it doesn't produce nonsense. It's still very good. So clearly, even for very, very noisy data, this interpolated classifier worked remarkably well.
In fact, we can prove some theoretical bounds. Maybe I'll skip that.
But here are some [AUDIO OUT] explanations. One is we see dimension and Rademacher complexity type of bounds. I don't think they can work, because they cannot deal with interpolated classifiers when Bayes risk is nonzero. For zero risk, they work fine. But for non-zero risk, they cannot bound the gap between the empirical risk and the expected loss.
Regularization-type analysis, they, all of them diverge as lambda goes to zero, the regularization parameters. Algorithmic stability has a similar issue.
There is some really work on classical smoothing methods like Nadaraya-Watson. Most of them don't support interpolation, but there is one, there are really two interesting examples. One is just 1-nearest neighbor classifier. And one is something called Hilbert Regression Scheme. Let me skip this.
1-nearest neighbor classifier is particularly interesting. It's really, it has zero loss, right? 1-nearest neighbor, you just take the nearest neighbor. But it has [AUDIO OUT] performance guarantees. It is worth twice the Bayes risk. And this is sharp bound due to cover, you know, in the 60s. And interestingly, the analysis does not based on margin assumptions, not based on uniform bounds, and it directly estimates generalization, doesn't try to bound the gap. So that's sort of where we're, we're trying to look for a way forward, that type of analysis.
And let me give you an algorithm which actually is, has some of these guarantees. It interpolates the data, but it can be shown to be almost optimal. And this is joint work with Daniel Hsu and Partha Mitra. Basically, what you do, you take your data, you're triangulate it. It's not a practical algorithm. You wouldn't want to triangulate data. That's a very expensive process. But let's not worry about this. You triangulate data, and then you just linear interpolate the values.
It looks something like this. If these four points are red and this is blue, I get this. This value stays at 0, and this is 1. So I have 0.5 around this point and higher on this point. And then I just threshold. So this will be now blue, and the rest will be red. Very simple classification algorithm.
It turns out that you can show that this is nearly Bayes optimal in high dimension. And in fact, while this is under some margin, this is [INAUDIBLE] margin condition, but you can prove it without margin condition. You just won't get an exponential here. What you have is that, basically, in high dimension, this is very close to the true Bayes risk, the risk of that classified given sufficient amount of data. So it's like 1-nearest neighbor, but the instead of a factor of 2 here, I have something like 1 plus 1 over 2 two to the d. So this is very tight, at least one dimension is high enough, and you have enough data.
And [AUDIO OUT] have a blessing of dimensionality, that this algorithm actually becomes better as you have more dimensions. So that's an unusual property of this algorithm. So that may be something which gives a way forward on this algorithm. How do you view interpolation? Maybe something like that is happening.
Well, let me now kind of summarize a couple of slides. First, I think, clearly, this is kind of the take home message. Overfitting allows for very efficient optimization by stochastic gradient descent. Generalization, which we don't fully understand why that happens. And it's not a trivial issue to actually overfit the data. When you have a lot of data, solving these systems, or doing this optimization, really requires thinking about what kind of algorithms are involved and how to deal with that. And the question, from a practical point of view, can we actually scale those kernels? Because kernels we understand much better than we understand deep networks.
Now, kind of you can maybe think a part of deep learning, you can maybe understand that in this form. So first, you have this over-parametrization which naturally leads to interpolation. And surprisingly, that generalizes well. But once you believe that this generalizes well, this maps onto fast results in faster GD. And faster GD sort of map naturally to GPUs, because that's really, like, almost like designed. SGD really is designed for GPUs in some sense. Or GPUs designed for SGD, I don't know. I mean, it was developed independently, but it's an extremely good match. So I think there is a lot of serendipity in that, how the things worked together. And a lot of sort of surprising aspects that.
There is one aspect [AUDIO OUT] learning which I didn't talk at all about, this convolutional structure. I think that's a really important aspect which is kind of orthogonal to what, to those optimisation and generalization aspects of that. And so that's one.
And finally, I think, I think with, I think it's really a fundamental question of high-dimensional inference. Why do those classifiers generalize? It has been observed for deep neural networks, we are now observing it for kernel machines. Random forest was observed, maybe 15 or 20 years ago. For AdaBoost it goes back to '95. People said that AdaBoost doesn't overfit. And clearly the margin explanations are not the complete story there. And I think it's actually time to revisit high-dimensional statistics, maybe get some new understanding of these problems.
Thank you, I'll stop here.
[APPLAUSE]