The topology of representation teleportation, regularized Oja's rule, and weight symmetry
Date Posted:
April 3, 2019
Date Recorded:
April 2, 2019
Speaker(s):
Jon Bloom
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Speaker: Dr. Jon Bloom, Broad Institute
Abstract: When trained to minimize reconstruction error, a linear autoencoder (LAE) learns the subspace spanned by the top principal directions but cannot learn the principal directions themselves. In this talk, I'll explain how this observation became the focus of a project on representation learning of neurons using single-cell RNA data. I'll then share how this focus led us to a satisfying conversation between numerical analysis, algebraic topology, random matrix theory, deep learning, and computational neuroscience. We'll see that an L2-regularized LAE learns the principal directions as the left singular vectors of the decoder, providing a simple and scalable PCA algorithm related to Oja's rule. We'll use the lens of Morse theory to smoothly parameterize all LAE critical manifolds and the gradient trajectories between them; and see how algebra and probability theory provide principled foundations for ensemble learning in deep networks, while suggesting new algorithms. Finally, we'll come full circle to neuroscience via the "weight transport problem" (Grossberg 1987), proving that L2-regularized LAEs are symmetric at all critical points. This theorem provides local learning rules by which maximizing information flow and minimizing energy expenditure give rise to less-biologically-implausible analogues of backproprogation, which we are excited to explore in vivo and in silico. Joint learning with Daniel Kunin, Aleksandrina Goeva, and Cotton Seed.
Project resources: https://github.com/danielkunin/Regularized-Linear-Autoencoders
Slides: Jon_Bloom_CBMM_02April2019.pdf
Short Bio: Jon Bloom is an Institute Scientist at the Stanley Center for Psychiatric Research within the Broad Institute of MIT and Harvard. In 2015, he co-founded the Models, Inference, and Algorithms Initiative and a team (Hail) building distributed systems used throughout academia and industry to uncover the biology of disease. In his youth, Jon did useless math at Harvard and Columbia and learned useful math by rebuilding MIT’s Intro to Probability and Statistics as a Moore Instructor and NSF postdoc. These days, he is exuberantly surprised to find the useless math may be useful after all.
PRESENTER: Hello, everyone. I'm [INAUDIBLE]. I'm a post doc at CBMM working with Tommy Poggio, and it's my pleasure to introduce Jon Bloom today. Jon is an Institute Scientist at the Stanley Center for Psychiatric Research at the Broad Institute, which is part of MIT and Harvard. He co-founded the Models Inference and Aggregates Initiative and also a team called Hail that's building distributed systems that are used throughout academia and in the industry to uncover the biology of disease.
In his previous life, Jon was a mathematician at Harvard and Columbia and then also a Moore instructor and NSF postdoc at MIT in mathematics. And somehow, he has come back to doing some math now, and he's going to be telling us about the topology of representation teleportation, regularized Oja's rule, and symmetric weights. And without further ado, I'll just leave it to him, and let's have fun.
JON BLOOM: All right. Well, I hope this will be fun for you. It's definitely going to be fun for me. I was not expecting to be giving a talk to neuroscientists if you had asked me a couple of months ago. But you'll hear why I'm here today, and I hope you find it interesting. So the heart of this talk is the following observation and then question.
So the observation is I don't think very controversial that the way that artificial neural nets predict is inspired by how neurons connect in the brain. Now, another question you could ask is whether the way that the brain learns has anything to do with how we train neural networks in a computer, namely with gradient descent through back propagation. And if you look to Stephen Grossberg's fascinating paper from 1987, this is figure eight.
He's showing back propagation in a diagram, and he says that in this diagram, you can see some necessity to transport weights from some neurons to another set of neurons and that because of this, the algorithm cannot represent a learning process in the brain. And if you look a little further in the paper, he says, such a physical transport of weights has no plausible physical interpretation, violates basic properties of locality, and therefore the BP model is thus not a model of a brain process.
So the answer is no, and I hope you enjoyed my talk. Thank you very much. All right, so yeah. This is a problem. Let me see if I can break down a little more clearly what he's talking about. So on the left, I've drawn a pretty simple neural network. There's an input layer, x, an output layer, y, there's one hidden layer h.
We have some weights, a matrix w0 that's going to map us from x to h, another set of weights, w1 that will map us from h to y, and then in the context of back propagation, we're going to need to send signals backward through this let's say linear network. And to do that, we need the transpose of w1, which I've also drawn.
And so the actual update equations are shown there. The important piece is looking at how w0 is updated. It involves the transpose of w1 in the formula. And that's really the whole point that in order to do back propagation in the computer, we reuse the weights that were used in the feed forward pass when we're doing back propagation going backwards, except when you go backwards, you have to transpose the linear map.
Now, in the brain it makes no sense to send the signals backwards over the same neurons. I don't know much about neuroscience, but I know that the axons go to the den-- see, I don't even know that. The dendrites go to the axon. And so that's a problem. Neurons are unidirectional. On the other hand, there are plenty of neurons that might go the other direction, but they're different neurons.
And so the idea of weight transport is, how would you physically take the weights encoded in synapses between neurons going forward over here and induce the same weights in these other neurons over here? That feels very non-local, and there doesn't seem to be any plausible physical biological mechanism. That's the problem.
Now in 2014, and I think this was published in 2016 in Nature Communications, I think Timothy Lilicrap actually accidentally discovered that you could just put random weights, so fix a random matrix b and don't update it and then use exactly the equations on the last slide. So this avoids needing to use the transpose of w1.
And when you do this, it turned out you trained it on MNIST to recognize handwritten characters, and it works. It works pretty well. It learns. And on the right, what you're seeing is a flow showing how w0 and w1 move around in the scalar case and get to that curve in the upper right, which is where you want to be to have good prediction. So the point of the figure on the right is to show that most of the places that you initialize your parameters if you descend through this feedback alignment algorithm, you will get some amount of alignment of w1 with b, and you will learn.
But not always. A lot of people have run with this idea and tried to come up with other forms of learning rules, update rules that aren't gradient descent but can learn nevertheless. I think Yoshua Bengio just won the Turing Award. He has a number of papers on this issue, Hinton has some, Lillicrap, and also Tommy Poggio. And so there's a paper on whether you actually need weight symmetry, or is it enough if just the signs of the entries are the same? And they show that the signs are plenty good. From that paper, they say that the weight symmetry problem is arguably the crux of the implausibility of back prop.
And then last year at NeurIPS, Hinton and Lillicrap and others-- also Blake Richards, by the way, who's speaking in the next session of the seminar-- put out a benchmark where they asked, well, do all these biologically plausible algorithms, do they scale? They work on MNIST. Do they work on Image Net, for example, so much deeper networks?
And their main conclusion is no. So something called target propagation and also feedback alignment perform significantly worse than back prop, especially for networks composed of locally connected units, opening questions about whether new architectures and algorithms are required to scale these approaches. OK, so why am I here?
So I proved with a couple of colleagues a math theorem, and it's really just straight up elementary linear algebra in the end, and it's about a super simple machine learning model. It's just a linear auto encoder. So a matrix that encodes and one matrix that decodes. You might know that this is basically principal components analysis.
And what we proved is that if you regularize the problem, if you apply weight to k, then you have the property that at all critical points of this loss landscape, the encoder and decoder are the same, just transposed. So they're symmetric. And that's different than if you do not impose regularization.
If you don't regularize, then they're pseudo inverses. They try to be as much inverse as possible. When you regularize, the behavior is quite different. If you double the first, you would also double the second in the transpose case rather than have it. So we go to preprint. We talked about related algorithms, we've talked about some topology, Grassmannians, things like that, and then we got an email from this guy.
Thanks for reaching out. This is interesting. The question of obtaining the transpose is actually pretty important for research on a biologically plausible version of back prop, because if you obtain approximate transposes, then several local learning rules give rise to gradient estimator analogs of back prop. Cheers, Yoshua. So that was exciting.
All right, so I guess that was seven weeks ago. So there are some things to look up on Wikipedia and then in the archive, and all those papers, for example that I showed you, and it's a really fascinating history of people thinking about how symmetry might arise. But in the face of it not arising, do you really need it? Those are the kinds of questions people have been asking.
OK, great. So we got that email from Bengio. So now the answer is yes, by pure logic, the brain must do this. We're done. Is that allowed? I'm a mathematician, is that good enough? I'm looking at my NSF supervisor from back in the day. You're convinced, right?
No comment. Are the neuroscientists convinced? How about the neuroscientists in the room? I mean, have we solved that? Are we done? Probably it's not very clear, and I want to hedge now. So I would say the answer is, well, maybe. Maybe we don't have to be as pessimistic as Grossberg.
And what would be really fun is if we could actually build statements that could be tested. And tested not just in Silico, but in Vivo, and that's part of why I'm here today is just to think if there might be some conversations around how one might do that. All right, so what I'm going to tell you about today are a family of ways that symmetry might emerge dynamically without needing some spooky physical auction at a distance.
And I'm going to coin these as not necessarily biologically plausible. I don't like that term, because I don't think any of these are remotely biologically plausible, but maybe less biologically implausible. And so the two that I'll focus on specifically are on the right. They're called information alignment and symmetric alignment as of two days ago, and the b, that random weights, instead we're going to actually have parameters called w2. And those are going to get updated in learning, but they're going to be updated by local rules that are no less plausible than in feedback alignment.
And what we're going to see is that symmetry is an emergent property. In other words, these are networks that learn much more like back propagation without requiring weight transport. So I mentioned I'm a mathematician by training. That was all the way through a postdoc at MIT, which is the lower left. And the math I used to do I didn't think would be quite relevant to biology. Lately, I'm figuring out maybe it is, specifically because it has something to say about optimization.
And we do a lot of modeling at the Broad, we have a lot of data. And so that's where I moved about three years ago. This is my very fancy animation across the street, which begs the question, why did the mathematician cross the road? Which I could talk about wanting to do more applied things, although the math that I used to do with this fellow is very, very exciting and beautiful.
Mostly it was that my wife is an eternal grad student in neuroscience, and I wasn't going to be able to move with my five-year-old in the picture as well, so I needed a job. That was a big part of it. And also that neuroscience and biology seem really fun and exciting, and I was teaching the introductory probability and statistics course at MIT. So I checked it out and went across.
But of course, that's all not true. The real reason is, the intermediate value theorem. Because you can't actually cross the road-- continuous-- never mind. That's why I crossed the road. That's the theorem. OK, great. A couple of weeks ago, I gave a talk over at Google Brain I guess which is now Google AI. That's another corner of maiden names, and then here I am at CBMM.
It's very exciting. I think there should be some ribbon, something like a grand slam. Like you go around the-- anyway. OK, but let me take you back to the Stanley Center. The Stanley Center is for psychiatric research. They think a lot about the brain, although less from the cognitive aspect and more from the genomic and molecular and therapeutic aspect. And so does anyone know what is in that picture there?
It's a trick question. No, that's a microscope. So yeah, they're sequencers, but the way we use them is sort of as the 21st century microscope. We can use them to actually look inside cells and count RNA and understand what are the molecules that cells contain at any point in time. And so whereas in the past you might need hundreds of years of human kind looking through microscopes in morphology to distinguish different kinds of neurons, we have the opportunity to actually just measure all the molecules in neurons and then just cluster that kind of data.
So that data looks like this, and this is really the starting point for the biological problem that led to this work about auto encoders and then back to the brain. This is thinking about data that looks like a matrix where you have the columns are cells, the rows our genes, and an entry is a count of how many RNA molecules for that gene are in that cell. So it's just positive integers or non-negative integers.
And if you do non-negative matrix factorization in a clever way, you can do things like get a picture. This is probably [INAUDIBLE], but it's colored according to factors that are learned in non-negative matrix factorization, and they represent things like cell types and activity programs in the cell, which is pretty exciting to just pull all that out of data.
So that kind of taking high dimensional data and looking for latent structure, so representation learning in cells, in machine learning, one approach is called an auto encoder. So you have data on the left, and you take a column which is a cell represented as features, which are these counts across genes, you shove it through an encoder into some lower dimensional space, and then you re-expand it, and you compare.
And you're trying to basically get back what you started with, right? And those functions, f1 and f2, could be arbitrary, deep nonlinear neural nets. We were interested in something a little-- well, I don't know if I want to say more complicated. We were interested in not just learning about cell types. We were more interested in what is the programming language of the cell?
How does a cell take inputs from its environment, do processes internally, different kind of genetic modules operating in different ways, respond to that environment or develop and change through time? And so that would be more like the cell plus plus language. And here, what you see are two encoders, one encoding genes, one encoding cells, and we want to take those representations and bash them back together to recover the entry, how many RNA molecules.
And this is really just a nonlinear singular value decomposition, kind of a deep matrix factorization model. And so we were playing with this for a while, and it was kind of interesting. One thing that's very challenging this space is that everything is super unsupervised. You have no idea if the results you're getting are good. There's no benchmark except what's in the biologist's head.
You talk to them and they say, this kind of looks reasonable. So it's a hard iterative loop. Keep that in the back of your mind through this story. But one thing we then really do care about is, are the representations we learn, are they meaningful? Are they interpretable? Because that's maybe the best measure of success.
They mean something. They mean a type. They mean a state. This is in the g-phase of mitosis. This neuron is stressed. This kind of thing. And so we focused back in on the simpler case of a linear auto encoder, and this is the main mathematical object of the talk.
So linear auto encoder is very simple. You don't have these deep neural nets encoding or decoding. You just have a matrix, w1, that maps you down, and another matrix, w2, that maps you back up. And so you take your data x whose columns are your individual points on the left, and you get this other matrix, w2 w1x, which is the reconstruction. And the loss function is measuring the difference, x minus w2w1x.
That's the reconstruction error. It's the squared residual error. And we can describe one solution, one minimum to this loss function in terms of the SVD, and that's the Eckart-Young Theorem, 1936. It says that if you take x and you represent it as an orthogonal matrix times a diagonal matrix of singular values times another orthogonal matrix and those are the left and right singular vectors, then the singular vectors can be used to approximate to give you the best low rank matrix approximation.
And so here, the xx transpose is the covariance, and the left singular vector is u, which you might also call the principal directions in the data, can be used to construct the encoder, UK transpose, and the decoder, UK. Those are just the top k principal directions. And then the product, UK UK transpose, that map is going to be the orthogonality projection of your data in let's say m dimensional space onto some k dimensional subspace that is closest to the data in this squared distance sense.
It's an orthogonal projection. But it doesn't give you any special basis for that projection. In particular, it doesn't actually tell you the individual principle directions or eigenvalues. And to see why, here's a little more complicated picture. Now, I'm going to tell you what all the minima are.
So notice, if you had w1 and w2 as parameters, you could compose w1 with g inverse, some invertible matrix g inverse, and then you could compose w2 with g. And if you took that product, you'd get exactly the same loss, because you'd get exactly the same composition. And so we knew that uk and uk transpose where one solution.
But here is the general solution of minima here. You can multiply w1 by an invertible matrix and w2 by its inverse in this way. So it's a whole manifold of dimension, I guess k squared that constitute the minima of this problem. It's very over parametrized. But there is one nice property that all these minima have, and in fact all critical points of this model have, which is that they are pseudo inverses.
They're not necessarily square matrices, so they might not be convertible. But a pseudo inverse is sort of the closest thing to an inverse for a rectangular matrix, and you can get it by truncating the SVD. Yeah, so as a former topologist, when I see something like this and there's just all these solutions and it just feels like they're all the same solution, I want to quotient it out. I want to get rid of this symmetry.
And to do that as a mathematician, I ask myself, well, what is the actual domain of this problem? Is it really these encoders and decoders? Because I'm thinking about PCA, right? Let me put it this way. If you want to study the distance between points on the earth, you could make a map. And it's very arbitrary. There's lots of ways you can make a map, and it's actually quite hard with a flat map to measure distance.
And if you're only allowed to work with matrices, with sort of Euclidean space, you really can't do anything else. But if you're allowed to work with a sphere, with a globe and you're comfortable with that manifold, well now, everything is very natural. And in the same way, I'm just going to argue that for PCA to make it a very natural learning problem, the domain is not a pair of matrices or a matrix factorization. It's just exactly the thing that the problem asked for.
Find the k plane closest to the data. So what is it? It's the manifold whose points are k planes through the origin. And that thing has a name. It's called the Grassmannian. It's a smooth manifold, and it can be understood in terms of its actual shape, its topology and its geometry, like distances and geodesics, by thinking of it as embedded in a very high dimensional Euclidean space.
So in this case, the Grassmannian of k dimensional planes and m dimensional space, you can think of it as living in the space of m by m matrices as a subset. And that identification is effectively identifying a k plane with the matrix that does the orthogonal projection onto that k plane, which is like the w2w1 that composed map from the last slide.
So I don't want to go deep into the math of this right now. I just want to speak intuitively by giving a couple of pictures. All right, so in this picture, we're considering lines in the plane. So our data is two dimensional, lives in the plane, and we're looking for the first principle component. And through the origin, there's a whole family of lines. In fact, there's a circle's worth of lines, because you can take one line, and you can move it to nearby lines.
But eventually, you go 180 degrees, and you're back at the same line. So it's a one dimensional manifold. It's a closed manifold. It's a circle. It's a circle. It's weird, but a point on the circle means a line. And there's a very natural smooth function on that circle, on that space of lines, which is just the squared distance from the line to those points. As you move the line, those distances will change and their squares will change, and you'll just add them up.
And the top principal component is going to be the line that minimizes those sums of squared distances to the point cloud. And in this picture, it's u1. And you can think of it as the minimum of a smooth function on the circle of lines. So this red point is the top principle component. The blue is the bottom principle component. And moving around the circle is like rotating from one line to the other.
And notice, you can go two ways down, and that's because if you start at the blue, you could rotate clockwise or counterclockwise. And so as a smooth function, you see there's a min and a max. And actually, the values of this function at the min and the max are the eigenvalues of the data, it turns out.
OK, and we can also think about the gradient flow. You're trying to find the minimum value that will be the top principal component. So if you start somewhere at some line, you want to do gradient descent on this manifold. You want to flow down in terms of the loss function until you get to the top principal component, and that's just going to be wherever you are on the circle, you're going to flow down.
You can think of it as the height function. And we can maybe represent that information by saying, well, there's a minimum, there's a maximum, and there's these two flow lines under the gradient connecting them. One more picture. A little fancier, but it just gives you the sense of how this generalizes.
Here, it's lines in three dimensional space, and they're going to be three principal components. And if you think about it, the top principal component, which is closest to the data, if you're in three dimensional space, you could move it like this or you could move it like this. So you have two dimensions to move it around.
So this is a two dimensional manifold. It's actually called the real projective plane. And that means that when you're at the minimum, as you move away, the loss will increase, but it's like a cup. It's like a paraboloid. And so it looks kind of like this. And then similarly, when you're at the bottom principle component meaning the one that's farther from the data, as you move it away, you're going to drop down in the loss, and so that's like a cap.
And it turns out that the middle principle component, if you imagine rotating it toward the top principle component, you'll also go down either way you rotate it. And if you rotate it toward the bottom principle component, you'll also go up either way you rotate it. So that's going to be a saddle.
It's going to have directions you can move to go down in the loss and also directions that can go up. But you can't actually connect all these together, because it turns out the real projective plane can't be embedded in three dimensional space without self intersections. Now, we're way in topology. But without going into too much more detail, you can actually draw it as a disk with opposite points identified, and you can begin to draw what these gradient flow lines are, and you can begin to encode how adjacent index saddles-- this is sort of index two, index one, and index zero, how many sort of negative eigenvalues of the Hessian.
You can draw these gradient trajectories between adjacent indices, and it turns out that this actually encodes pretty much all the topology of the space. It's kind of remarkable, and this is where algebraic topology and Morse theory and Morse homology go, if you've ever heard those terms. And I'm not even going to bother with planes and r4, but other than to show you it gets more complicated.
So the big picture of this story is if you think about the Grassmannian, there's a loss function. It's Morse, meaning non degenerate at all critical points at the singular values that are distinct and positive. You get a critical point for every k plane, which is the span of k principle directions in m dimensional space. So that's m choose k different critical points.
Critical values are sums of eigenvalues, which makes this a really good toy model for random matrix theory. And furthermore, you can get between these critical points by rotating one principal direction to another sort of fixing the rest in the plane you have.
AUDIENCE: [INAUDIBLE] that you apply, right?
JON BLOOM: The critical point was--
AUDIENCE: [INAUDIBLE]
JON BLOOM: Yeah, here there's no g. There's no symmetries left at all.
AUDIENCE: [INAUDIBLE] that somehow the [INAUDIBLE] here you say that it is discrete.
JON BLOOM: Here, no. In the first story, in fact almost every critical point is a whole manifold. It's not isolated. There are actually even more of those connected components than there are points in this story, because in this story, it's only the full rank k. You'll see the picture, but what I was describing before was just the minimum in that over parametrized problem was an entire manifold.
Here, the minimum is just one point. There are also manifolds corresponding to all these other critical points.
AUDIENCE: [INAUDIBLE]
JON BLOOM: Because when we think about the plane itself, we're not defining any basis for that plane. That's the main reason. We got a lot of coordinates. It's a geometric object. The point is a plane. It's not a plane with a basis.
That's more or less what's going on. So that story, I know I went kind of quickly through it, but the good news is that the seminar that I organized on Wednesday mornings a few weeks ago, we did a two hour presentation on some of this work, including a full hour on the topological story.
So if you're interested, you can go to the website. You'll see what it is at the end and check that out. But I want to give you a flavor for that, because it gives you a sense not just of what the critical points are, what the minimum is, but how actual optimization is happening. You can feel how these principle directions are moving around, and you can actually say a lot about how critical points are going to relate to each other through gradient trajectories, which more generally is relevant to deep learning when you have, say, any loss function and you want to understand how different minima might relate as well. And
And more of that story is at actually the Google Brain talk. OK, so here's where we left off in our over parametrized story. And I want to claim we can do better even without moving to manifolds and Grassmannians than this symmetry group of invertible matrices. I want to claim that we can reduce to orthogonal matrices. And I'll tell you how to do that. Orthogonal matrices mean that they preserve lengths.
The rows and columns, they're paralyzed perpendicular and unit length. And really all this comes down to is a connection between regularizing a problem, meaning penalizing big weights, and orthogonality. And at its essence is the problem of, you've got to build a fence, it's rectangular, and you want to enclose as much area as possible. What do you do? You build a square.
So here are two fancier versions of that statement. One is that if you say that your matrix has to be volume preserving or that the square determinant is one and you ask for the minimum Frobenius norm, meaning the minimum sum of squares of entries, those are exactly the orthogonal matrices. In other words, the cube has the least perimeter in a sense of all shapes parallelpipeds that hold unit one of volume.
And the proof is that the left hand side is the sum of the eigenvalues. The right hand side is the product the eigenvalues. And so you're just trying to find the smallest sum fixing the product. And that will be when everything you're multiplying together is the same. So a second cool fact is that if you require two matrices a and b to be inverses, so you multiply by a and then b like an encoder and decoder, and you want them to actually be inverses, ab equals the identity, and now you're going to try to minimize the regularization, the sum of the Frobenius norms, that actually turns out to be solved by orthogonal matrices, a matrix 0 and it's transpose.
And that's because the left hand side is the sum of the squares of the eigenvalues and the inverses of the eigenvalues. Because inverted matrices have inverse or reciprocal eigenvalues. And so minimizing that, you want all of them to be plus or minus one, the singular values. All of the eigenvalues should be one.
That means length preserving. That means orthogonal. And in particular, a is b transpose then. So this begins to hint at how we might reduce the symmetry to orthogonal matrices. We're going to do it by adding another term, which is penalizing the squares of the weights. And now, if you act by a matrix, it has to preserve the lengths of the columns of w1, so it can't be any old matrix g, it actually has to be orthogonal.
And the beauty of that is it does another thing. It solves this problem of not being able to learn the individual principle directions, because it ends up giving you a composed map at the critical points which has this form. It looks a lot like what we had before, but there's some shrinkage. What happens is, you learn a map that projects you into that same principle subspace, but then it shrinks.
It shrinks the data along principle directions, and the amount that it shrinks relates to the eigenvalue but also how much regularizing. The more you regularize, the more it shrinks. And if you regularize so much, if lambda gets bigger than an eigenvalue, it shrinks all the way to zero. You lose all information about that direction.
And that's because you end up getting complex solutions. You just go into complex land. And so the theorem that I mentioned before is that if you regularize, then at all critical points including the minimum, you have this nice transposed relationship instead of pseudo inverse. And we can observe this empirically.
So here, the green line is for the sum regularization, the standard way you regularize. And you can see if we train a model, that w1 and w2 transpose attract each other. And that's a logarithmic scale, so they attract each other pretty darn well. OK, so that's one theorem. But if you look at the form of this minimum here for the decoder-- well, let me tell you something else about the loss landscape, actually.
This is something that from the PCA story, that topology, the gradient descent story, you can start to feel why it's true. But the loss function without sort of obvious, but with the regularization too, it's strictly saddle. It means all the minima are global minima. There's just this one manifold. The minima now is an orthogonal group instead of a linear group, and you always get there.
If you use basically any optimization algorithm with a sufficiently small learning rate, you find your way to the bottom. You never get stuck. And that means that if we train the model, then we can get the top principle directions of x that have eigenvalue greater than lambda just by taking the svod of the train decoder. Because the only uncertainty in that decoder is the orthogonal matrix on the right.
That's arbitrary, but no matter what it is, you still have the SBD form. And so the singular values and the left singular vectors of w2 are well-defined. They're the same no matter which minima you get to. And they're the same, left singular vectors are the same as the top principle directions of x, and you can back out the eigenvalues basically through the formula there. And that means you actually get an algorithm for PCA which involves two steps-- train irregularize your auto encoder, and then do SVD of the decoder.
And the second step is sort of no time at all. I won't have time to go into this today, but there's a really cool connection here, which is that if you don't add regularization, it turns out that this auto encoder is mathematically identical. It's update rule through gradient descent is Oja's rule, which is sort of a stabilized version of Hebbian learning. That's quite old.
So the neurons that wire together fire together. Actually, the auto encoder model is that. Regularizing is not that. It's lazy Oja's rule or sleepy Oja's, or I don't know what to call it. But in any case-- you can think of--
AUDIENCE: [INAUDIBLE]
JON BLOOM: It's when you add l2 regularization. So without regularization, mathematically the update for gradient descent is equivalent to Oja's Rule for learning PCA, or the first principle direction, let's say. And what this algorithm is different. You also irregularize. And it has lower complexity to do PCA than Hebbian learning. So I'll come back to the algorithmic side a little later.
I think in order to really give a feeling for why these results hold-- and then we can go back and see what they imply for learning in the brain, at least in the computational neuroscience sense, meaning maybe not the real brain-- let's just look at an even simpler model, which is just a scalar auto encoder. So a scalar auto encoder, the dimension is one at beginning and in the middle and at the end. One, one, one.
And my data will just be a number, x. And my encoder decoder are just numbers, w1 and w2. So these are just, you can use now-- it's a fun exercise. You can just solve this problem using first year calculus. Well, you need to be able take to take partial derivatives. But this is actually how we did it.
We wrote this down, and we understood what was going on this way, and then we took awhile to prove the general statement. But it's a good trick. Just take the simplest thing. And the simplest thing has the other advantage that you can visualize it. So here is the lost landscape of the unregularized auto encoder. And you see that you have these hyperbole that are the minima.
I can even increase the x squared and make that maybe even more pronounced. Those are the places where w1 times w2 is one. Then, they won't change the value, right? That's the best you can do. If I go to the sum regularization, we still have those hyperbole. But if I increase lambda, they're going to move in toward each other and eventually connect.
And it's easier now to see what's going on from this perspective. So without regularization, I have this one dimensional sort of hyperbole of minima. When I add regularization-- let me increase that as well-- look what happens. See how I now get these two isolated minima? Because they also require that w1 and w2 are transposes, which actually means they're the same if they're just numbers.
So we're intersecting some shrunken hyperbole with a line y equals x, basically. And that should give you a good sense of what's going on. You can play with that on the project repository. So these are some pictures. For time, I'm not going to talk about the product regularization, but that's more like ridge regression.
In the paper if you're interested, there's a folk theorem describing a smooth parametrization of all critical points, all critical manifolds for a series of different loss functions. And also, just empirically illustrating that the theoretical results when you train them on real data match up exactly like you'd hope. And also, a theorem that explains why the variable selection phenomenon we see here of basically when lambda is big, losing information about certain directions, that's very reminiscent of a model called probabilistic PCA, and in fact there's a formal statement of how those relate in the paper if you're interested.
So now, let's get back to learning in the brain. Since we understand auto encoders pretty well, we're going to consider four different pieces or building blocks for loss functions that we might put on this model. So one of them is the predictive loss, except I think I wrote it wrong. I apologize. So the prediction should be w1w0x, so replace the w2 with w1, right?
And you're going to compare that to the expectation y that you hoped was the answer, and then you have some error. Then, there's something we'll call the information loss, which is where you start here with some h. You map it up, but then you map it back. And you compare. So that should remind you of an encoder and a decoder.
And if you go up and back and get the same thing, and that works for all values of h coming in at least approximately, that's like saying you captured most of the variance. You did not lose information when you went forward, because you could recover it when you went backward. This is just the regularization loss, which biologically you might think of as corresponding to not having the weights be too big because of reasons that, say, myelination is expensive, and you might have evolved to not want enormous amplifiers between your neurons or maybe in the sense of sparse coding.
So the first three really do I think correspond to well studied areas of computational neuroscience, the idea that the brain wants to make predictions for certain tasks. It also wants to take in environmental stimuli and represent it in the brain without losing too much information. And then the idea that we want to do that efficiently.
Then, the last term is a fun one. What would this guy mean? So here, if we have a neuron in h and we passed it through w1 and then w1, the trace is going to have an element in, let's say the one one element is going to tell us how was that amplified? So if the signal was a, what was the coefficient on a when it came back?
Was it doubled? Was it multiplied by minus seven? That would be the value of the one one entry, and so this is adding up the self amplifications. This is something that this neuron could learn without needing to communicate with the other neurons. That's an important point. And if you think about it, these two would trade off against each other, because this one says you want the weights to be small, and this one says we want the diagonal elements of this product to be really big.
So what we're going to do is just put these building blocks together in different ways. So if you're going to do back propagation, then you just care about prediction. But maybe you might regularize as well. And what I'm omitting here are coefficients. So in practice, you would have some lambda in front of prediction, some lambda in front of regularization. You'd have to weight these.
Those are some knobs you could turn. Those lambdas, by the way, you could also think of them as the bounce of learning rate. A bigger lambda just means you're going to take a bigger step for that version of loss. Totally equivalent to think about putting lambdas in front in the surface, or having different learning rates for these underlying landscapes.
And so what we're calling the information alignment is adding one more term, which is the idea that in addition to making the prediction and keeping the weights from getting too big, we also want to incentivize preserving variance or information as we flow it up through the network. You could imagine that you'd do this maybe more at earlier stages than at later stages, as you'd want to specialize maybe to different tasks.
And then the symmetric alignment uses that self amplification concept instead of the information concept. But it's still a quite local concept. All of these are very modular. The information loss is saying something about how going forward and back and comparing should give you something similar. It doesn't really ask for much more than the notion that you could compare the prediction of mapping forward and comparing to y.
It's the same problem. They're not easy problems to answer how the brain does that comparison. But I would say it's no harder to imagine it doing it here than it is doing it there. And if you go back to that visualization, you can see that each of these elementary building blocks for a loss function have their own characteristic kind of surface, and you can see how when you're adding them together with different coefficients, you're building a loss landscape that in some cases would have a global minimum, but in other cases might have a local minimum.
And part of what we're interested in here is whether by adding these kinds of terms, we can do better in terms of learning in a kind of robust and stable way than the feedback alignment, which just has some random matrix that you hope that over time, the forward weight will align with it. But I'm going to show you evidence now why that doesn't actually work. Oh, and here's a really fun fact.
Although I tried to motivate these terms biologically, it turns out that their sum is mathematically identical to symmetry, encouraging symmetry. Which if you go back, you can see why if you were to take w1 minus w2 transpose square and FOIL it out, you get four terms. First term with the first term, last term with the last term, and these are the cross terms.
AUDIENCE: [INAUDIBLE]
JON BLOOM: I'm saying that the last term is more biologically plausible than that, but because it's something that a neuron could learn locally. I'm a neuron. I fire a signal. Something comes back to me, and I want to ask, how did it compare to what I sent out? And I want to see how that is over time assuming the other neurons are sort of randomly doing their thing.
I'm not claiming I have a deep understanding of how the brain could implement such a circuit, but to say that the brain implements this, that's exactly the weight transport problem. The brain can't do this comparison very easily. That's exactly what Grossberg, thank you, argued. That that's not plausible. And what I'm pointing out is, it's the sum of two things that maybe are.
AUDIENCE: [INAUDIBLE]
JON BLOOM: Because if each thing has a process that can do it--
AUDIENCE: [INAUDIBLE]
JON BLOOM: Right, so the only claim I'm making is that I'm not introducing any mechanisms that aren't already acquired by feedback alignment. I'm not claiming that feedback alignment is plausible either. But if you solve feedback alignment plausibility, then I claim these would be strictly better. Because they can learn.
That's the next step, right? The question is, can these actually learn? And unfortunately, my talk was supposed to be the 16th, and then it got scheduled two weeks earlier. And so I actually haven't done the image net test yet. But let me show you some encouraging results which suggests that it's probably going to work.
And then I'll let you know what happens. Maybe it won't. We don't know yet. So first here, we're just going to do basically the model I showed. Just one hidden layer. And we're going to try an MNIST, and we're going to see how these four models learn. It turns out they all learn very well.
But what's interesting is to look at the alignment between w1 with its transpose, which is always perfectly aligned. That's why the blue is just flat. The next one is what you get from the symmetric alignment, which is really trying to push the angle between w1 and w2 expanded into vectors to basically be the same. The information alignment reduces the angle to 20 degrees, and the feedback alignment only gets down to about 42.
So gradient descent is when it's zero. So now, we're going to add nonlinearities. This is now empirical. There's no theorems here, right? And we're going to look at six hidden layers. And what you find is that the learning actually-- the back prop is the best. This is only run once, by the way.
This isn't phishing, because I don't have time-- so we got back prop, information alignment, then symmetric and then feedback. But what's really nice is when you look at how does this alignment work within each of these six layers. And what you find, well of course in back prop by definition, perfectly aligned in every layer.
In feedback alignment, only the last layer actually manages to align much at all. And even that layer only gets down to about 70 degrees. It's like this. With the information alignment, the first layer actually has the best alignment, but the other ones, they all get to around 50 to 60. And with the symmetric alignment, it's just boom. Everything down to zero.
So obviously, there are a lot of knobs one could begin to turn here. So I'm going to talk about some of those in a second. So I'm getting pretty close to the end now. Because this slide is really meant to represent a sprawling set of things one could imagine doing next. So in a computer, now of course, we should be doing the architecture search. We should take symmetric and information alignment, we should run them on Image Net.
There are lots of knobs. You could have different lambdas. You could schedule learning rates. You could say, maybe in information alignment, I want to incentivize the flow of information more in the early layers and then less in the later layers. Maybe it's related to the Information Bottleneck idea. Maybe I want the auto encoding loops to be at a different time clock than the feed forward, for example.
Nothing set. There are multiple brainwaves at many frequencies. Lots of things could be happening, and there are loops at all scales in the brain, within cortical columns, between brain regions, and so on. I suspect that the information loss actually protect you somewhat against adversarial attacks if you're into that, and it's interesting to think about multi test learning.
Now, in terms of mathematics, we studied auto encoders, that means mapping x to x in the linear case. And we have this connection to PCA. What if you used x and y? This turns out to be really nicely related to models like partially squares and canonical correlation. And in particular, we're thinking more about how to write that down.
I also think it's really interesting to think about these problems that are often done in sort of discrete updates and ask, what happens as you take the learning rate and you shrink it to zero? So with back propagation, that gives you literal gradient flow, the gradient flow that you saw in PCA on the Grassmannian. And on these other ones, you'll get a different flow on the same landscape.
And you could ask, what is that? And you could study its stability. Are you going to end up at a good place? Under what conditions? What is sort of the phase diagram of that? So I think that wouldn't be too bad to work out. Algorithmically, we're actually exploring whether the PCA algorithm I described is competitive with state of the art, meaning randomized SVD. And so that's a collaboration with some friends of Google Brain now to say, can we compete with what Google does internally?
That probably has a low probability of success because everyone's worked on this problem for centuries, but on the other hand, we have the whole deep learning tool kit to throw at the optimization problem, and there's lots of specialized hardware. It could be interesting. And we have some early evidence that in some cases using some combination of first and second order methods, we can do as good and sometimes better.
In Vivo, this is the one where I have the least to say, but I'm very, very curious. I think that it would be really fun to think about what experiments one might be able to do either now or in a year or in 10 years that could begin to falsify or verify some of these ideas as actually happening in loops in the brain. So more specifically, if you could measure functional connectivity through time under stimulation, you'd get dynamics, and you could see how they correspond to what's predicted by this kind of theory.
Now coming from the Stanley Center, I'm also quite interested in how connectomics and cognition might factor through genomics and molecules in development and cell types, and that's I think a story where there isn't a lot of communication happening yet, and that would be exciting to see what might happen there. And then the looping all the way back to my methodical days, I just think there are a lot of these ideas that come from relationships between algebraic structures and topology and geometry that have not been exploited much at all yet in machine learning and optimization.
And I come at more on that in the Google Brain talk. I'm happy to talk to people about it, but some key terms here are morsomology, and topological quantum field theory. And so I'll leave you with three pictures that kind of convey what I mean by bringing some these ideas in. So three pictures of an auto encoder.
The first is what it usually is thought of as, which is a compressor. You take some high dimensional representation with signal and noise, you compress it, bring it back, and you hope that what happened was you compressed out the noise and kept the signal. And that's what I was talking about at the beginning when I had cells represented across 20,000 genes, and I wanted to compress it to something meaningful, hopefully.
But the same model could be thought of as a transporter. If you have information, let's say, in one part of the brain, like in the cortex, and you want to move it to the cerebellum, in the cortex it's manifested through firing in some set of neurons. That's a representation. And you have a whole bunch of signals coming through that give some variation. Now, you want to take that representation of that information, and you need to represent that information in some other neurons, like maybe in the cerebellum for some behavioral task.
How are you going to get it there? Well, it's not enough just-- you can just have neurons going there, but how will you know you actually preserved the information? What you need is a loop. You need neurons going back as well. And you need to ask, if I map forward and map back, do I get approximately the identity? And if so, then the information definitely got there.
And so I'm interested in whether that interpretation of an auto encoder might be actually more relevant to what the brain is doing in terms of moving information around. And then the final picture might be as a stabilizer. And that was in the idea of in layers of this neural network, sort of stabilizing in the sense of making sure you don't collapse your representation too soon in the interest of just one task and in terms of adversarial training and so on.
And hopefully, when we look at how crazy the complexity is of all these loops in the brain and I just get totally overwhelmed, it's like looking at all the gene names, I just hope that some of these ideas, from Morse Theory, for example, will help us decompose these problems into little pieces, little primitives. That just like a good piece of software, you can compose them in layers of abstraction and then build up all that complexity from rules we can actually make sense of with our feeble minds.
And so I think that I had a thank you slide, but I moved it to the way end, so we're going to skip through a bunch of stuff that was only going to be bonus material if there was time. I was going to make fun of Deep Mind. Sorry. Don't get to do that. Talk about how they need everything to be supervised. But what we really need to do is unsupervise Kaggle. Anyway, thank you.
[APPLAUSE]
PRESENTER: So I guess we have time for some questions.
AUDIENCE: So I have a question about the transporter idea, [INAUDIBLE] idea. Is that sort of that my brain is actually somehow reliving the experience, basically seeing my memory of it as accurate? Or how does the f2 part happen?
JON BLOOM: Yeah, so I wasn't thinking in terms of memory, but there might be something there. I was thinking more in terms of, for example, suppose through my eyes I take in sight, and I have layers of visual cortex. And they are measuring different kinds of representations of the visual world. Maybe at some level, there are edge detectors and contrast, but these are getting put together into some higher level more abstract representations, which might actually be objects and faces and so on.
At some point, that representation of the information has to move to some other part of the brain, maybe where memory is, maybe where the decision that if I see this person lunging at me, I want to jump away. How are we going to have an efficient process of taking the information in one representation and ensuring that we can send it into this other physical part of the brain without losing the sort of volume of information that-- we don't know what the next signal is going to be, but we want to make sure we have fidelity on it getting there, that it's not all collapsed.
And to me, the only way that can be done-- I can't relate vectors in different vector spaces without maps. There's no notion of comparison until you have maps. And if you want to really know, if you really want to identify two spaces, you need maps in both directions and they need to be approximately inverses. And so that's the transport idea.
It's that I can learn over time how to make sure that the representation I have here-- not on the whole vector space, but at least where the data is living-- that has to make it over here, and it will look totally different, maybe. But it has to make it there.
AUDIENCE: [INAUDIBLE] would it be a flashback, would it be deja vu? What would be the sort of thing that happens in my brain during that process consciously or subconsciously? I mean, I'm just curious.
JON BLOOM: Yeah, so I don't have a good answer to that. I think it's important to distinguish between thinking about development, early brain development, and sort of learning as an adult on a task. I think what I'm describing is probably more on the level of how the retinal waves move to this seven other parts of the brain or something like that than it is to how in the course of an hour, you learn to do some math problem. But I'm not sure beyond that.
AUDIENCE: Can you tell us something about the computational convenience of working with something like Grassmannian? I mean, yes, if I can do PCA, then I can also write down these topological surfaces corresponding to the Grassmannian. But is there a sense in which I can just shortcut [INAUDIBLE] Grassmannian? Is that somehow the computationally the easier thing to do?
JON BLOOM: So great. It's a great question. So the answer is the following. The primary motivation of the Grassmannian in the story for me is that it is conceptually clarifying. And I think in my brain. I don't think about a basis when I think about a plane. I just think about the plane.
And so there's a sense in which the Grassmannian is natural both in terms of the natural domain of the optimization problem but also in the conceptual way I want to think about the problem. It's hard for me to think about matrices of numbers. Now, coming back to algorithms, the typical trade off is that you could try to write down gradient equations on the Grassmannian using information geometry and Romanian geometry.
Typically, what happens is that the update rules get so much more complicated that it's not going to necessarily be much benefit.
AUDIENCE: And it's impractical right?
JON BLOOM: Well, no, you can write it down. It's just that it will take fewer steps to converge. But each step involves a lot more flops than in this auto encoder case. The advantage of the writing down the over parametrized auto encoder is you can choose to initialize it, say, with orthogonal matrices. You can choose actually to tie the weights.
I told you that at all critical points if you regularize, then the critical points are always tied in this way, right? They're symmetric. We also prove that you can just make that assumption a priori. So you don't have w1 and w2. We go back to the original pack prop. You just have w2, just w2. And that's the one that's like Oja's rule.
You're going to update w2, and you're just descending. You're just descending. And that sucker is pretty darn efficient. You can also fix w1, and then it turns out it's an exact convex problem to write down the solution for w2. And then you can get that w2, you can go back. That takes far fewer iterations, but of course each one is more intensive. If you do that twice and then you average the results and you then use that average as both the w1 and w2, it converges super fast.
And it basically is because if you're ill conditioned, you're worried about learning rates bouncing you across a valley. And if you can average, you actually find your way to the middle pretty fast. So these are the kinds of algorithmically things we're exploring. But the other nice thing about collaborating with the same engineer at Google is they have a whole bunch of in-house proprietary hyperparameter search that they can try basically everything that they do in deep learning in an automated way and ask the question, under what circumstances is this beating their in-house tool?
And that would be interesting, because that's like now we're talking about billions of dollars in completely unpatentable economic value. Yeah.
AUDIENCE: [INAUDIBLE]
JON BLOOM: Yeah, we didn't use any convolutional structure in this.
AUDIENCE: So [INAUDIBLE]
JON BLOOM: Yeah, I mean honestly, the feedback alignment story certainly does. I haven't yet even gotten to the step of sitting down to think about, but yes. I'm just going to say yes.
AUDIENCE: So this is probably going to be a really superficial question and probably it will be annoying a little bit to you, but I will go ahead and ask it anyways. You take the back propagation algorithm, which has a number of really nice properties. It's a generic procedure to apply to sort of any problem and it is powerful in a lot of ways.
But it is also, you know, like frustratingly slow, like limiting in critical ways, including the pace by which you can learn things and the fact that it is unable to take into consideration the structure of the domain in which you want to do learning.
JON BLOOM: Yeah, it's local.
AUDIENCE: But then, you want to say, and the research program like the literature that you showed at the beginning, like [INAUDIBLE]. The research program is saying, what I will do is I will try and make these algorithms more biologically plausible, or as you noted less artificial, but based on some perceived implausibilities based on some rough glances over the neuro anatomy or function in the brain.
And, then combined with this literature and some of the results you showed, the algorithms that we get to often like back propagation seems to be still like-- you hope to-- the program is hoping to achieve something at least as well as back propagation, or back propagation is still the golden standard. And if you are approaching that you, are good and you will you will-- like if your algorithm continues to work--
JON BLOOM: OK, this is not a superficial question. You're asking a deep question about why couldn't the brain do something even better than back propagation.
AUDIENCE: Well, let me get to the end of it. It does feel like there is this gradient which start from back propagation, make it less artificial, and maybe that will give us the biological learning algorithms. But I am really suspicious of that program. Perhaps and the alternative is perhaps, you may want to still start from back propagation or some place else, but you want to take a more engineering perspective and get the learning algorithms that are just much better than back propagation in a lot of different ways.
And perhaps that gradient will lead to-- perhaps that has a higher probable that it will get to that biological learning mechanism.
JON BLOOM: Yeah so can I try to simplify the statement? So if we have, for example, some kind of convex problem, we can do a Newton's method. It's much more efficient, right? And then gradient. So if you're going to propose to me that we're going to have some algorithms in the brain that are more efficient than gradient descent, you must like you said be worried about in what context? Because it's hard to imagine another universal logarithm.
The gradient descent is just the go down by a local rule. So do you have in your mind examples of lost landscapes in the brain that would be more amenable to a more global understanding of the potential?
AUDIENCE: [INAUDIBLE] moments of insights, like if you look at it from outside it is just-- [INAUDIBLE] from just [INAUDIBLE].
JON BLOOM: Yeah, so there's a field in mathematics-- Zeeman, Tom, catastrophe theory, where you end up having these almost face change bifurcations in what looks from one direction to be a very smoothly changing space, but from another direction, there's this sort of discrete jump. That's very vague, but I think I'm trying to say that it is possible that discrete changes arise from smooth processes.
And we understand so little about how that's happening in the brain itself that I wouldn't even begin to speculate about anything more myself. Other people might know a lot more about the brain.