A Conversation with Dr. Jon Bloom
Date Posted:
April 8, 2019
Date Recorded:
April 5, 2019
CBMM Speaker(s):
Andrzej Banburski Speaker(s):
Jon Bloom, Broad Institute
All Captioned Videos Scientific Interviews
Description:
On April 5, 2019, CBMM Postdoc Andrzej Banburski took the opportunity to sit down and chat briefly with Dr. Jon Bloom of the Broad Institute.
ANDRZEJ BANBURSKI: Hello. I'm Andy Banburski, and I'm a postdoc of the Center for Brains, Minds, and Machines. And today I have a pleasure to talk with Jonathan Bloom, who's a scientist at the Broad Institute. Hello, Jonathan.
JON BLOOM: Thank you for having me.
ANDRZEJ BANBURSKI: As I understand it, Jonathan, you previously did your work in theoretical mathematics at MIT as a postdoc. And what did you actually do?
JON BLOOM: So at MIT, I did the kind of math that I also did in graduate school, which was called Floer homology and also something called [INAUDIBLE]. And both of these are ways to construct algebraic and variants of things like three and four dimensional manifolds and also knots and links embedded inside three dimensional manifolds.
So for example, you might want to know if a knot that you draw in some diagram on the page as just some curves that are kind of going over and under, is it actually knotted if it were a rope tied up in three dimensional space. That's a hard problem. But it turns out that the same kinds of techniques that were used to show, for example, that there are multiple different smooth structures on the same topological four dimensional manifolds also connect to ideas like is this not knotted.
And these ideas go back to people like Simon Donaldson, Cliff Taubes, and then Andreas Floer. And my advisor, Tom [INAUDIBLE] at MIT in particular works on a brand of this called Seiburg-Witten Floer homology. And that is using physics equations actually to think about these ideas. Equations on bundles of connections on four manifolds and so on. But at the end of the day, what you're really interested in are sort of smooth functions on these infinite dimensional spaces and their critical points and the gradient flow that will connect different critical points. And then you want to build things like groups, like algebraic groups, out of these and then do things like category theory to get real structure.
ANDRZEJ BANBURSKI: So this is exactly the kind of stuff that I was very interested in when I was doing my PhD in theoretical physics. But then maybe this is the same kind of question that is actual now. How as a mathematician doing these kind of things, how did you end up in Broad, and what do you exactly do there?
JON BLOOM: Yeah. So I get asked that a lot. So toward the end of my math postdoc, I was getting interested in more applied things, partly because I had been teaching the introduction to probability and statistics for several years. I flipped the classroom and with a partner wrote a new curriculum. And it was really my introduction to applied math was teaching statistics to the light scientists. This wasn't the high level, rigorous course. And then from that, I got more interested in machine learning. And I thought when I went to the Broad, I thought I was going to go there and do a bunch of machine learning and data science.
But I actually managed to recruit a friend of mine to go with me. His name's Cotton Seed. He did math with me. Except Cotton had spent a decade after dropping out of MIT earlier doing high performance computing and compiler design. And so he was really an engineer who then decided later in life to do theoretical math. And so we were both postdocs at MIT at the time, and we both went to the Broad and we started an engineering team. And it's an open source project called Hail. And the idea was to build software that could handle the scale of all the genomic data, which has been tripling every year for a very long time.
And so we have data sets now that are maybe 100 terabytes compressed consisting of tens of thousands of whole genomes. The next one will be hundreds of thousands of whole genomes. And it becomes very hard to do interesting things with that data with, say, even your laptop or just one server.
And so what we've been doing and what my main job has been at Broad is building distributed systems and compilers to give a usable experience sort of in Python to the computational biologists and statistical geneticists who want to learn things about disease by connecting it to genomics where the back end is really a warehouse scale computer, meaning in the cloud at Google or Amazon. And so that's been largely building infrastructure rather than doing mathematics or even real data science. It's just building the tools that others can use.
And that's largely been because we didn't find there existed the right tools for these kinds of problems in the shape of data. So we kind of thought we would come and we would take off the shelf things from the tech industry that are open source now and we would just write something high level on top. We'd be done. But it turned out that nothing off the shelf was quite the right thing. And so we spent about three years just getting to the point where we had built kind of infrastructure that was the right thing.
And then what's happened in the last year is there's been more mix of infrastructure and now interesting new models or mathematics that you could start to build on top. So in a way, very little of the work that I used to do connects to most of what I've been doing at the Broad directly. But yeah, recently it started to connect in ways I hadn't expected to some of the things I'm thinking about now.
ANDRZEJ BANBURSKI: So picking up on that, is this theoretical work that you used to work on something that you're seeing can come back into the research studies you're doing, the stuff you're interested?
JON BLOOM: Yeah. And that's been kind of a surprise. I would say that it's less the things I did in the last five years of my math career where it was getting increasingly far from reality. But the underlying ideas of what I used to do were that there were these deep connections between geometric structures on the one hand, like manifolds and maps between them, smooth maps, or manifolds and corbordisms between them, meaning a higher dimensional manifold whose boundary is, say, each of these lower dimensional ones. And then you can sort of glue these together and then you get these interesting structures. So that's all geometry and topology.
And on the other hand, you have things like algebra. So groups or vector spaces and maps between them. And so at its heart, the work I used to do was thinking about ways of mapping between these two worlds in ways that were kind of not just how the objects relate but also how the relationships between the objects relate.
And I think one way that this is now tying into the things I'm thinking about now as I think more about machine learning and deep learning is ultimately in many of these deep learning problems, you're trying to fit a very high dimensional loss function, you're trying to minimize it. It's an optimization problem. And that means there's going to be at critical points you're looking for minima in many cases.
And the part of the world of, say, algebraic topology I used to work in, topological quantum field theory and Morse theory, it's a way of constructing algebraic invariance on spaces by using smooth functions and looking at the critical points and the gradient trajectories of that smooth function that flow between those critical points. And you can from the critical points and those trajectories and even just the counts of those trajectories, you can build these groups, these sequence of groups called homology, which is traditionally defined in other ways using [INAUDIBLE] complexes or other versions.
But Ed Witten realized, actually, that you could do it in this other way. And what that means is that, say, in the case of PCA, like in the talk, I talk about the Grassmannian, and that has very interesting topology. And we can think about a smooth function that has a bunch of critical points which correspond to the cell structure of this manifold.
But the converse situation is in deep learning your domain of the loss function is just some Euclidean space. You just have a bunch of real numbers. And then your range is R. And so now you know there's no topology there. But a loss function will probably have a lot of critical points and probably a lot of minima. And in particular, then, you know that somehow all these minima have to cancel each other out algebraically.
In other words, they have to be connected to each other through saddles, index one saddles, in a connected way, because RN is connected. In fact, RN is simply connected. It means it has no holes in any dimension. You can say even more interesting things about how the saddles should connect to other saddles and so on. And this might also tell you strategies for finding lots of minima, like in ensemble learning. So that's one way that I've gotten kind of excited that thinking more topologically, geometrically about this kind of optimization could be useful.
ANDRZEJ BANBURSKI: So is this where you came from to basically start thinking about the problem that you talk about in your talk?
JON BLOOM: So no, not exactly. It was more that I started with a biology problem, and then that led to the machine learning problem, which then got me thinking enough about machine learning that the connections to the math I used to do started to sort of flow in and get very exciting. So the actual origin of thinking about the linear auto encoder in this talk was looking at single cell RNA sequencing data. So we have a bunch of cells, a bunch of neurons, actually, in the brain. And we've measured how much RNA each gene has expressed as a proxy for kind of the molecular activity in that gene. In that cell, rather.
And so if you have a matrix like that of cells and genes, then you can try to do dimensional reduction. And you'd like to represent a cell by a smaller number of features. And ideally, those features would not just capture the signal versus the noise, not just be some sort of manifold learning. Ideally the coordinates themselves would be biologically meaningful. They would represent cell type or an activity program in the cell. And so we were thinking about that. Dimensional reduction in that context.
And when you think about not the prediction, a prediction task, but about learning a meaningful representation in biology, you get really worried when you realize there are symmetries. Symmetries in the loss function. Because that means you could find one of a bunch of minima that are all sort of equivalent. And if there's no basis, if there's no special basis, then the coordinates can't be meaningful. Yeah?
And so we got sidetracked to then think more about, well, how can we make these coordinates more meaningful? And the thing we recognized was that by adding L2 regularization, you could make them more meaningful. You could reduce from any invertable linear symmetry to just orthogonal symmetries. And when we realized that we said, oh, but this actually now gives us an algorithm for doing PCA.
And then we thought, well, OK, the fact that the critical points are symmetric is pretty interesting. That predated having any indication this had to do with neuroscience, actually. We put up the preprint. There was nothing in the preprint. This is relevant to the brain. So that's where it started.
ANDRZEJ BANBURSKI: I see. So could you summarize maybe the main results coming out of this work?
JON BLOOM: Sure. Yeah. So I think from the perspective of before realizing there was a connection to neuroscience, or at least computational neuroscience, I view the main result as trying to just give a complete theoretical rigorous foundation for a very simple model. A regularized linear auto encoder. So it's a theoretical paper that lays out a smooth parameterization of all the critical points, tries to tell an intuitive story. I care a lot about not just saying what is the final result, but what was the process that led us there.
These results don't just come because you write the final paper in order one day. It's because you thought about much simpler problems for a while at a board, and eventually the simple things start to build a bigger story. And then if you just give people the final answer, you have left them stranded to have any sense of how they should build their own mental structures to appreciate it, to have the intuitions. In some ways, the intuitions are more important than the proofs.
In fact, in this case, I would argue that the final proof we came to in the general case is just simple manipulation that's inscrutable, and there's a whole other year of exegesis, like what does this actually mean. And our reviewers, they all admitted none of them read the proof. Which coming from mathematics I'm just like, what does it mean to review a paper if [INAUDIBLE] checked the proof? But in any case, that's just a cultural difference. So I think that's sort of the core contribution of the original intent of the paper.
And then what happened was we started to, A, take more seriously that this algorithm might be useful. And then we got an email. And we had sent out the word to a few people who had done related work. So we had cited things on denoising in auto encoders, for example. So we sent we send it to Yoshua Bengio and a few others. And he responded very quickly.
And his response was, by the way, the fact that you get symmetry in the linear auto encoder actually suggests that some of the existing biologically plausible architectures for learning might be improved upon or might have more to them. We can maybe explain what they're doing better by understanding this symmetry result. This was the only correspondence we had with him so far, but that was the hint that there might be some connection to neuroscience.
And so now I would say there's a second piece that I'm excited about, which probably won't go in the first paper. I mean, we'll say in the update in the V2, yeah, there's a neuroscience connection. There should be a separate paper. But the idea that there's a problem going all the way back to 1987 about weight transport and that, yes, probably there isn't a biologically physical plausible mechanism for taking all the synapse strengths here and just imposing them over here.
That seems crazy. But I would say the key idea on the neuroscience side is that as long as the weights here and here can change through time, then there could be an emergent dynamics that is biologically plausible by which they do align even though there's no direct physical interaction between them. And then what we do is in the talk propose some of the ways that could occur.
ANDRZEJ BANBURSKI: I see. So what's the next steps that you're kind of most interested in following from this?
JON BLOOM: There are a few. So I mean, sort of sprawling, but let's say maybe there are three categories of things. One, I think I'm interested to know whether this approach to computing eigenvectors can be more efficient than existing approaches in certain cases. And so that direction is more thinking about the various ways you can optimize this regularized linear auto encoder.
You can use SGD. You can have learning schedules on the rates. You can do first order sort of gradient methods. You can do second order methods and so on. And so there's sort of a space of things that will be interesting to search. So that's one direction. We'll see what happens.
In the neuroscience direction, the next step is obvious. It's will this architecture, the information alignment or the symmetrical alignment, how will they perform on ImageNet or other very deep learning tasks where Hinton and [INAUDIBLE] and Richards and others wrote in [INAUDIBLE] last year that the existing biologically plausible models were not scaling. And so my intuition is that we've introduced a kind of stabilizing dynamics that will allow these to scale in situations where the others have not. And we need to empirically test that. It's an empirical question because there are going to be nonlinearities and our theorem is only about linear things. So that's the next exciting [INAUDIBLE] step.
The longer term in vivo step is maybe there are people who are developing technologies where you could measure functional connectivity through time in loops in the brain and understand if those dynamics are well explained by the kinds of dynamics coming out of this theory or whether it can be falsified. And so that would be interesting.
And then maybe the third category is more the broader connecting back to your first question, the math I used to do. So I think that in biology in particular, it's exciting to think about not just ensemble learning in a problem that's non-convex, and the solution you get will depend on where you started. So you might want average a bunch of models.
I think it's really exciting to think about consensus learning. So the difference is in ensemble, typically you're fitting the model many times and you're hoping that the predictions that you'll make from each model when averaged together will do better than individual predictions because the biases will be independent and law of large numbers will kick in and so on.
The problem is that in ensemble learning, even if you could find more efficient ways to find many models, you start to store those models and you still have to make predictions from all those models just to make one prediction for your task. So it's not very efficient. I think the ideas at first order coming from sort of topological quantum field theory and l Morse theory, these ideas can give us ways potentially to find many highly uncorrelated minima more quickly than linear, maybe logarithmically. That would be cool. But you still have the same problem on storing and predicting.
In biology, though, we're often much more interested in the representations than we are in the predictions. And so if we think about, say, non-negative matrix factorization on the single cell data, we might want to do it hundreds of times and look for those factors that show up robustly. And this is called consensus NMF. And then you could take those factors and kind of for each group that shows robustly, average it together. You'll get some basis, and that basis has a better chance of being meaningful biologically not because everything that's robust is biological.
There's probably technical and batch and other effects. But certainly if it's real, if it's real biology, then it should show up every time. It should be robust. And now there's no prediction left to do or storing all these models. You're interested in the consensus model. So at some level, I think this is a really important aspect of how machine learning when focused on representations becomes very relevant to a lot of our unsupervised biological questions.
And then if I want to be a little speculative more broadly, it would be kind of cool if instead of ensemble predicting by using and storing all the models and making a prediction, there was a way to actually average these models together as well, even in the deep learning context, that would allow us to just throw away the many models and have one much better model. That's more speculative, but that would probably be pretty useful.
So in addition to the Hail project, which is really an engineering project at the Broad, I also founded with another friend, Alex Bloemendal, an initiative called Models, Inference, and Algorithms. And that's really a learning community for the computational folks, both in the Broad but increasingly also outside the Broad. And not just in academia but also in industry. So we come together every week on Wednesday mornings. We usually have a primer, a seminar, a discussion, and all of it is open and all of it is on YouTube.
So at broadinstitute.org/mia, Models, Inference, and Algorithms, we have over 100 hours of video at really the cutting edge of machine learning intersecting with biology. And it's been really exciting to see the community coming together really excited about what's possible now that we have all this data but also being very principled about how we might model it and have that iterative loop with the biologists.
ANDRZEJ BANBURSKI: Well on this note, I would like to thank you for joining me for this amazing interview. And if you would like to see more, you can see Jon's talk on the CBMM website.