Neural Representations of Language Meaning
Date Posted:
September 30, 2014
Speaker(s):
Tom M. Mitchell
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Tom M. Mitchell : E. Fredkin University Professor and Chair of the Machine Learning Department, School of Computer Science at Carnegie Mellon University
Abstract:
How does the human brain use neural activity to create and represent meanings of words, sentences and stories? One way to study this question is to give people text to read, while scanning their brain, then develop machine learning methods to discover the mapping between language features and observed neural activity. We have been doing such experiments with fMRI (1 mm spatial resolution) and MEG (1 msec time resolution) brain imaging, for over a decade. As a result, we have learned answers to questions such as “Are the neural encodings of word meaning the same in your brain and mine?”, “Are neural encodings of word meaning built out of recognizable subcomponents, or are they randomly different for each word?,” and “What sequence of neurally encoded information flows through the brain during the half-second in which the brain comprehends a single word, or when it comprehends a multi-word sentence?” This talk will summarize some of what we have learned, newer questions we are currently working on, and will describe the central role that machine learning algorithm play in this research.
TOM M. MITCHELL: So it's very good to be here. I was an undergraduate here. And so it's always good to come back. Let's see. I guess the title kind of says it all. Suppose you're interested in the brain. Who isn't? And suppose you're interested in how the brain processes language. Then, the question that I really want to consider for the next 45 minutes is if you had access to some brain imaging devices, the brain teaser is how would you study language processing in the brain?
And let me just start there and point out that the work that I'm going to present is due to a team of people, quite a few people, including notably my main collaborator for the last dozen years, Marcel Just, who's a professor in our psychology department and who introduced me to the idea of brain imaging. And so we've worked together, at this point, quite a bit.
And there are a lot of ways you can try to organize yourself if you're interested in studying language processing in the brain. But one way is this chart, which is a way that I now kind of think of, if I had to summarize our own research agenda in a simple picture, this is it. So we've been presenting to people who are sitting inside brain imaging devices sometimes individual words, like computer or table or bottle, and get pictures of their brain activity. Sometimes, instead of individual words, sentences. Sometimes, instead of sentences, stories. And we'll talk about all three of those today.
And then, kind of trying to understand what are the spatial patterns of neural activity that encode the meanings of those kind of language. And when, in terms of time, do those different patterns of activity appear? And how do they evolve at the millisecond level from one millisecond to the next? And of course, we're fundamentally interested in the question of how the brain produces those.
As you'll see in the talk, we know a lot more, at this point, in 2014 about where in the brain these neural activities occur. We know something about when they occur. And we still have a very shallow understanding of how the brain is producing those. So to kind of tip my hand, that's where we are. But this, I think, is a reasonable way to think about what I'm going to present in the talk.
The other thing to think about, since a lot of what I'm going to talk about is really using machine learning methods to analyze brain image data, the second theme of the talk is really, what is the role that machine learning can play in studies of the brain? And I think we're just in the very early days of exploring that. There's a lot more to be done.
OK. So let's start at the beginning. I think most people here are probably familiar with functional MRI. But if you wanted to study how the brain represents language meaning, you might start by putting people in this kind of scanner, showing them stimuli like this, which we do-- sometimes words, sometimes pictures, sometimes both, and then getting them to think about those things.
And what you'll find if you do that is-- for example, if you show them a stimulus like this, then here is part of the three-dimensional picture of fMRI activity of one person thinking about the stimulus "bottle." This is actually four, horizontal slices out of the three dimensional fMRI image. And this is the back of the head posterior. This is the front of the head anterior. And you can see this person has blotchy red activity toward the back of their brain when they think about "bottle."
So the first question, of course, you have to ask yourself is, is it any different if they think about mom or apple pie? And I don't have those pictures with me. But I can show you the mean, the average, activity when they think about 60 different words. And that looks like this.
So if you squint, you might think, OK, this looks a little different than the average word activity, although quite similar, too. And maybe, this is noise, or maybe this is real. But in fact, if I subtract this mean out of this, then I can see what's the difference between when they think about "bottle" and when they think about any average other word. And you can see there are some differences.
So unless I say otherwise, in the stuff I'm going to present today, at least for fMRI, we're primarily working with that difference image at the bottom. We're looking at the signals that are different from the average of a number of other stimulus words.
OK, so if you were given this instrument to use and you decided to collect this kind of data, one of the first things you might want to do is find out whether you could train a classifier to decode which of the words a person was thinking about based on the brain image that you collected. And so that was the first thing, in fact, that we tried. And it does work. Many people have done this, too. We're not the only ones.
But we basically train a classifier by showing it examples. Here's a brain image when they're thinking about "hammer." Here's another brain image when they're thinking about "bottle." Here's another one where, again, they're thinking about "bottle." We showed a number of training examples like that. And the program does its best to learn what are the patterns of neural activity that are common to "bottle" and to "hammer."
And then, given that, we show it more examples and ask it to tell us which of the two words the person's thinking about. And when you try that, it works. In fact, here, in this case, we were training a classifier for words about tools, like "hammer," "chisel," "screwdriver," or buildings, like "house," "palace," "apartment." And you can see that, for example, for participant number 4, we get very accurate classifier in terms of its ability to decode subsequent brain images to tell us which of the tool or building stimuli the person that's looking at. And for other participants, we get not quite as good, but still better than the 0.5 chance that we would expect.
OK, so this is good. This means fMRI actually has enough resolution and enough signal to distinguish the neural activity associated with at least some types of words. And so this kind of opens up a whole ocean of questions. Because see, what we just did when we trained this classifier is we went from one revolution, which was caused by the invention of fMRI which suddenly let's us look at the neural activity inside the brain. We went to a different style of work where, instead of asking what is the neural activity in the brain, we can now ask what is the information content in the neural activity.
And you think of the trained classifier as exactly that. It's a virtual sensor of which of these mental states that brain is in, given the observed neural activity. And so we can use this virtual sensor in a lot of interesting ways. For example, I could train it just on the left side of my brain. And I could ask, can it decode which word I'm thinking about, using only the left or only the right or only the front or only some other little piece?
And to the degree that I could train it and it could successfully decode additional brain images, using just that region of the brain, then I know that the neural activity there actually does encode the information that the classifier is producing as output. So those classifiers that we can train are virtual sensors of information content in the neural signal. So that's a very enabling idea.
We can use it in a lot of ways. One way we can use it is to ask the question, well, are these neural representations the same in your brain and mine? And of course, the way we turn them into a machine learning question is we just ask, could we train on your brain, and then use the trained classifier to decode mine? And if the answer is yes, then we know there must be some correspondence between the neural code your brain is using and the one mine is using.
So we tried that. And in fact, this works. In fact, the black bars here are just like in the first figure. How accurately does the classifier work if we train and test the classifier just on data from a single individual? But the white bars here are, how accurately does the classifier work to decode one person's brain activity if we train, not on any data from that person, but instead train only on data from all the other participants in the study?
And you can see that, on average, the white bars are about as high as the black bars, meaning that, on average, we can decode the neural signal in your brain to tell which of 60 words, in this case 60 concepts, you're thinking about. And we could do that as well if we trained on everybody else in the room as if we trained on your brain.
So think about that for a minute. That's a remarkable statement about the consistency across you and me and all of us, in terms of the way in which our brains encode meaning, at least of concrete nouns. That's what these were. And of course, we're all very different. But despite that, when it comes to distinguishing "hammer" from "house" and "airplane" from "dog," from "eye," and so forth, we have remarkably similar patterns of neural activity that encode those things.
OK, so we actually did a number of experiments of this type to ask similar questions, not only, is the code the same in your brain or mine? But we could ask, is the code the same across languages? We got bilingual people. And we trained when we presented words in English and decoded when we presented the same concepts, but this time in Portuguese. And in the English-Portuguese bilinguals, the answer is it doesn't really matter.
We also did it with words versus pictures. And to first order doesn't really matter. Although, we do find we always consistently get better decoding accuracy when we present pictures instead words. I think that's because there's more signal in visual regions.
We also tried a number of different types of words. We found emotions are actually at least as decode-able as concrete nouns. So we gave people words like "anxiety" and "fear" and "love" and "hate." We also found that if we gave abstract nouns like "democracy" and "justice," those are much more difficult to decode. And so are verbs if we present them in isolation.
In fact, one of the first things we tried, once we saw that we could do this with nouns, was we just gave a list of verbs to people-- "enter," "push," "eat," "hug," "run." And we found we couldn't classify them at all. It was kind of a surprise.
Later, we found that we could classify them if we put them in a context. So instead of saying "heal" and "cut," which we could not distinguish which of those verbs the person was reading, we tried instead giving them the two-word phrase, "surgeons heal," "surgeons cut." And those were totally discriminable. So verbs have this interesting property that, once you ground them in some context like that, suddenly we can see the difference in the neural code in a repeatable way. But without doing that, we didn't.
So there are many things like this you can start to ask and try to study by training classifiers. But if you think about it, this is really just opening the door to the possibility that we could develop an empirical theory of some kind, a model, of how brains represent word meanings. And if the only thing we do is train classifiers for different words, then what we end up with is a list of what the neural activity pattern is for "hammer" and what it is for "house" and what it is for "cheese" and what it is for "wine" and so forth. But that's just the list. It's not a theory.
So it would be interesting, instead, to have a theory of word representations in the brain. And what is a theory, anyway? A theory is really a logical system that makes predictions about phenomena that you have not yet observed in experiments. So we became interested in the question of whether we could build a system that we could give it any noun, and it would predict the neural representation for that noun.
And we found, after some struggle, that, in fact, this is possible. And the first version of this model that we trained actually predicts the neural activity in a two-step process. If you give it, as input, a word like "telephone," the first step is that it will look this up in a trillion word collection of text from the web and represent that word as a vector of corpus statistics about "telephone," in particular, how often does telephone occur with the verb "listen" versus the verb "eat" versus the verb "run" and so forth. And that's simply a look up. The second step is, then, to predict the neural activity at each of 20,000 locations in the brain as a function of that vector of semantic features that approximates the meaning of the input noun.
So to push on that and show you an example-- here, for example, if you use this noun "celery," you see that it occurs a lot with the verb "eat," a lot with the "taste," not very much with the verb "ride." And on the other hand, the noun "airplane" occurs a lot with "ride" and not very much with "rub" and "manipulate." So you can see that these verb frequencies, they kind of jive with your own intuition about what these semantics of the noun are. And so that's step one of the model.
Step two, then, is to predict, say, from the corpus statistics for "celery," to predict the neural activity at each location in the brain. And the particular prediction is that the prediction at voxel v, any given voxel, location in the brain, is just the sum over those 25 verbs-- "eat," "taste," "fill," et cetera-- of how frequently verb I occurs with the input word "celery" times some coefficient that has to get learned during training that indicates the degree to which verb I contributes to voxel v.
And in fact, this plot here is a plot of those coefficients for the verb "eat" and for the voxel shown here. And so in the end, the model is making an assumption that we could predict the neural activity for any noun, based on adding together a weighted linear combination of-- we had 25 verbs-- 25 different patterns of neural activity corresponding to those 25 verbs, weighted according to how often that noun occurs with that verb. So that's the model.
Does it work? Well, one way to answer that is to show you. On top here, when we trained the model on 58 words, not including "celery" and "airplane," then its prediction for "celery" and "airplane" are shown on top. The actual observed images not shown to the programmer are shown on the bottom. And so you can see that, even though the predictions are imperfect, they're capturing some of the structure that's actually in the observed image in a kind of approximate way.
OK, another way to look at the accuracy more quantitatively is we could ask if we leave two words out during training and then we present them for the system to make predictions about and then we show the system two images, but we don't tell it which one is "celery" and which one is "airplane," can it tell us which one is which, given that it didn't see these two words during training? And the answer is, yes, it can, with 79% accuracy. So three times out of four, given two words it hasn't seen before, this model can distinguish which of these two brain images corresponds to which of those two words.
So this is-- well, this is interesting for a couple reasons. But fundamentally, with this model is assuming what it's based on is the idea that neural encodings for nouns are actually not just random hash codes. They're built up out of more primitive semantic components and that, furthermore, whatever those semantic primitives are, we span them pretty well with this 25 dimensional space of 25 verbs we happen to make up.
So to the degree that the model is successful-- in three times out of four, it can tell you, for two novel words it hasn't seen, which brain image goes with each. To the degree that it's successful, it's confirming these two assumptions on which the model is built. So really, the main lesson here is that, wow, these neural codes, they're not just random hash codes. They really are composed. There's a structure to the neural codes for meaning that's built up out of more primitive semantics.
OK, so since then, we've pushed around by asking, well, we just made up those 25 verbs, that's probably not the best set of features we could've used. And so we've experimented with alternative corpus statistics and found, for example, our current favorite set of corpus statistics is based on doing a dependency parse of tens of billions of sentences and representing each given word like "mango" by how frequently it occurs with a particular dependency edge attached to another word.
So for example, how often is "mango" the subject of groves, or how often does it have the adjective "ripe," and so forth. So this is a very high dimensional set of features. But it tends to be more complete and more accurate than the model that we originally used.
We also found that we could replace these corpus statistics features, the original verbs, not only by these more detailed parse features, but also by behavioral features. And in fact, Dean Pomerleau made up a wonderful set of 218 features that will remind you of the game "20 Questions" and show that, if we got people to answer these questions for each of the words, then we could use those 218 features in place of the verbs, meaning that people's subjective judgments about whether nouns satisfy these different features are also a good basis for predicting neural activity that encodes different words.
OK, but the reason I wanted to push on this is it turns out the best, most accurate model that we found so far is actually one that was trained, because you could actually reformulate the question of, what's the best set of semantic features? You could frame that as a machine learning question. And in fact, Indra Rustandi, one of our PhD students, did that and found a model that works even more accurately. And I want to talk about this for a moment because this is actually an indicator of the direction that we're heading more and more in terms of trying to use machine learning methods to build up models.
One thing that Indra did was he said, well, in the original model, we trained a separate model for each person. I'm going to, instead, train a model simultaneously, using data from 20 different fMRI sessions. In fact, half of those sessions, people saw the 60 stimuli as words only in text. And in the other half of the sessions, they saw both a word and a picture together. But it was, nevertheless, the same 60 stimuli-- "hammer," "house," et cetera.
So in building the model, his first step was to use a method called canonical correlation analysis to learn a latent vector representation that captures the variation across the 60 stimuli in each of these different data sets. And canonical correlation is actually just a fairly basic method that finds one component at a time out of those 20 latent features. Each component, what it does is that it learns a linear projection from each of those 20 brain images, learns a linear projection, a linear function, that gives you a single number from the brain image.
But for each of those 20 data sets, it learns a different linear function. And it learns those 20 linear functions in a way that maximally correlate across the 60 different stimuli, what those individual scalars are. So you can think of it in the following way. If I have, say, only two brains, the green and the red brain, the red brain, each column in this matrix represents one fMRI image. And there are several images corresponding to different stimuli.
Similarly, for the green brain, which might be a different size brain for example, different number of voxels, we just learned two vectors that represent linear functions and that project this down to just five numbers corresponding to the five brain images. And WX and WY are learned to maximize the correlation between those two. And we just do this repeatedly.
First, we do it and get the first latent component. Then, we find the second latent component in a similar way. But it has to be uncorrelated with the first component. So this gives us a latent representation that has the interesting property that there's a linear mapping from that latent representation to each of these 20 subjects.
OK, then the second step in training the model is actually to connect up the words to it. For that, what Indra did was he used those 218 behavioral questions, like is it hairy and is it bigger than a bread box, and he just learned a regression to predict each of those 20 features from the behavioral data known for the stimulus words. And then, in a final step-- these linear projections are going the wrong direction if we want to go from a word to a brain image. So in the final step, he just inverts those.
So now, he has a feed-forward model that, for any word, predicts the neural activity in each of 20 different people. But it predicts it, and it's trained in this way that, I think, has a very important feature. Some of the parameters learned in this model are subject independent. The parameters learned here are not specific to any subject in the data.
Other parameters are very subject specific. And they show how we go from the person-independent, abstract characterization of neural activity to project down to the person-specific, brain-specific activity that we'll actually observe. And so to me, the key property of this model is that it shows us how we can start to integrate data, not just across different subjects in the same experiment-- in this case, actually across slightly different variants of the experiment, where we have, in some cases, words plus pictures as stimuli, in other case, just words.
And thereby, it gives us a chance to start bringing to bear more and more data. We could just keep adding more and more data sets here and get more and more reliable estimates of what this latent representation should be. And this is actually one of the keys for, I think, the future in using machine learning methods to train useful, reliable models. We have to find ways to train them using data from multiple experiments and multiple people.
Otherwise, we're going to be fundamentally data starved. The brain is complicated. We don't have enough data from one, simple experiment, from one, single person. And so the only way we're going to be able to build sophisticated models is if we can gang up the data from many experiments and many people. And this is just a step in that direction. There's a lot more to be done. But the net result here is that the system, in fact, is our most accurate system for modeling this kind of data.
It also leads to an interesting answer to an interesting neuroscience question. So one question you might be curious about-- I was; I still am-- is what's the difference really in the neural activity when you see a picture of a hammer versus just the word H-A-M-M-E-R? And it turned out that one of the subjects here in this word-and-picture study was also one of the subjects here in the word-only study.
So one thing we could do is we could look at these, let's say, for the dominant feature in this latent representation, we could look for that person. And we can look, what's the mapping from that latent feature to neural activity when they saw a word with a picture versus the same person seeing only a word. So let me show you what that looks like.
Here is the latent feature in the case that-- well, let me say it this way. The first latent feature, if you ask, what does it mean, what I can tell you is that, out of the 60 words, the five that most heavily weight this latent feature are "apartment," "church," "closet," "house," and "barn." It's kind of interesting. You might think, oh, those are things I can take shelter inside of. Those are things I can walk inside of.
If you ask me what are the most negatively weighted, so non-zero, but negative weights, it's these words, to which you might think, oh, those are things I cannot walk inside of, but that I can actually hold in my hand or pet with my hand. And in fact, if look at the projected activity associated with this latent feature, you can see evidence of that.
So for example, this little, red blotch here is a parahippocampal gyrus, which we often see when we see words about buildings and things you can walk inside of, whereas this blue activity over here is premotor cortex. Blue means negative weight, which means that it's accentuated for these words. And so premotor cortex activity is actually predicted by this model if the word happens to be one of these, that is something that you use your hand with. So you can see this kind of structure in the data.
But the reason I wanted to show you this in detail is I can now show you-- this is the projection from that first latent feature onto the neural activity when the person saw the stimuli as a word with a picture. I can show you the same diagram, but when the stimuli was a word without a picture. How do you think that will differ from this?
Here's how it differs. OK? I'll go back and forth. This is a word only, word and picture. Interesting. So what I see here is that when there's a word with a picture, we get a lot of posterior. Remember up is the back of the head, close to visual areas, including fusiform gyrus. That's what this red strip is. Whereas when I go to a word only, I see much of the same neural activity, but not as much in the posterior regions of the brain.
On the other hand, once you get forward from this place, it's pretty much unchanged. So the neural activity is actually quite similar. Like if you look at this spot and I go back and forth, the neural encodings are actually quite similar, except in the back part of the brain. So this is a way of starting to look at what's the difference in the salient neural encoding of the same stimuli when it appears either as a word only or as a word with an accompanying picture.
OK, so that's all I really want to say about words and where the neural activity is associated with those for now. But another kind of interesting question is what is the temporal component of this? So with fMRI, the nice thing is we can get nearly one millimeter spatial resolution, but we cannot get very good time resolution. With fMRI, you might take an image every second. But the neural signal is actually-- the impulse response to an impulse of neural activity in fMRI blurs things out over several seconds.
And several seconds is way too slow to study your comprehension of words. Do you know how fast you comprehend words? So when you read the morning newspaper, most people read about three words per second. So with fMRI, we can't look inside and see what's going on. But with MEG, another brain imaging method which we've been using quite a bit recently, we can get a one millisecond time resolution movie starting when the word appears on the screen and watch what happens in your brain.
And in fact, by most people's measure, the nominal amount of time it takes to comprehend a word is 400 milliseconds. So I want to show you a movie of one person's brain looking at the word "hand." And this movie starts at 20 milliseconds before the word "hand" appears on the screen. And it goes up to around 600 milliseconds. So I'll just show you. And so that you can watch the activity here, I'll just read off the times for you. OK, so here we go.
What? This is the strangest-- OK. This person is dead.
[LAUGHTER]
I'm going to try that again. That was pretty funny. Not intended. It was supposed to be a real movie, not-- oh, man. How sad. OK, well, my apologies. But--
AUDIENCE: Can you play it from the source instead of the embedded version?
TOM M. MITCHELL: Yeah, I bet I can. OK, don't look. There's highly private references coming out here. Let me try it this way. It worked when I gave this talk recently at Williams, so I'm going to play it off of that set of slides. Oh, no. That's wrong. Let me try this one more-- darn.
OK, here's our last best chance. Let's hope for the best. OK, my apologies. It appears to be running, but there's nothing going on. OK, so never mind. We're going to skip that movie. And I will summarize it verbally, which is a great way to-- there's a lot of dynamics. There's activity running all over the brain.
[LAUGHTER]
I have no idea what happened there. OK, so if you saw that movie or if you trust me that there's a lot of neural activity running all over the brain, it leads to an obvious question of, well, how is it that the nice, pretty, static picture of spatial distribution of neural activity evolves over time?
And so there are a couple ways you can get at the answer to that. But what Gus Sudre did, one of our PhD students, is he decided he would train about 1 million different classifiers, each one looking at a different piece of the brain-- he divided the brain into 72 anatomically defined regions-- and each one looking at a 100-millisecond time window in the one-millisecond resolution movie of brain activity.
And so he would ask, well, if the word appears on the screen at time 0, what about the interval from to 250 to 350 milliseconds over here? Can we decode the word there? But that was even too crude for him. So instead of asking, could we decode the word, he asked, could we decode any of those 218 features that were the 20-questions style features? Like could we decode whether this stimulus word is bigger than a bread box by looking over here at that time window? And could we decode whether it's graspable by looking over here at some other time window?
So he tried all time windows, all 72 brain regions, all 218 semantic features, trained a lot of classifiers, found out with cross validation how accurately they would work successfully decoding subsequent stimuli not included in the training data. And most of them failed miserably because most of your brain does not encode most of the semantic features most of the time.
But some of them succeeded. And so he produced a kind of movie that looks like this that shows-- here's, in color, some of the brain regions he was looking at. This movie shows when and where different features of the stimulus are encoded. So for example, here at 100 milliseconds, you see that the classifier could successfully decode the number of letters, the word length, the number of letters in the word at 100 to 150 milliseconds. Sorry, these were 50-millisecond time windows. 100 to 150 milliseconds.
And if it was a line drawing, it could also decode some of the perceptual features of a line drawing, but nothing about the meaning of the word. Same at 150. Once you get up to 200, it could decode its first semantic feature. Is it hairy, which I now think is actually a synonym for, is it an animate object that can threaten you? Later on, as time passed, there are more and more features that could be decoded. And you can see information flowing around the brain in a rather interesting way.
So this answers, for the first time, a couple questions. Like before I showed you this, I should have asked you, well, what do you think goes on in there when you look at a letter string, H-A-M-M-E-R? Do you think that your brain works and works and works, finally figures out that it's H-A-M-M-E-R, indexes into something which then activates the semantics simultaneously? Well, apparently not.
Apparently, instead, these semantic features trickle in over time, some earlier, some later. And they have a particular duration. They last for a certain amount of time. And then, they disappear, or they show up elsewhere in the brain. And so I just want to show you in a little bit more detail some of that data so you can really see what's going on.
So the way to think of this plot is it shows here horizontally different brain regions. And it shows here, vertically, time since word onset. And the color here indicates how accurately the classifier in this brain region at this time could decode the number of letters in the word, word length.
And what you see very clearly here is that, at around 100 to 150 milliseconds, suddenly in half a dozen different brain regions, it's possible to independently decode the word length in each of those half dozen brain regions. And also, you can see that, by the time we get down here to say 350 milliseconds post word onset, none of that information is decodable anymore.
OK? What if we look forward? So this is basically at 100 to 150 milliseconds. If you go forward to animacy, which turns out to be the earliest feature we found that's a semantic feature coded. That is, is the stimulus word you're looking at something that's alive? Again, here you see, with the same kind of plot, several different brain regions that simultaneously, suddenly support decoding, whether this stimulus word is animate or not.
And furthermore, that animacy feature hangs around, that is it can be decodable later at, say, 400 milliseconds. So it hangs around for hundreds of milliseconds, generally past the 400-millisecond time mark where most people think that the semantics of the word has been understood. And by the way, there's this interesting one kind of region, the left supramarginal gyrus that happens to encode that feature, not at the same time that the others did, but only at 400 milliseconds. Kind of interesting.
OK, we can look forward to other features. Grasping happens even later. This is primarily at between 200 and 300 milliseconds. But again, half a dozen different brain regions simultaneously, suddenly encode, is the stimulus word something that's graspable? And interestingly, less supramarginal gyrus again, at around 400 milliseconds, encodes it, just like it did the other feature.
And we can go on. Size-- is it bigger than the car? And again, several regions encode this at the same time. And again, supramarginal gyrus encodes it only at 400 milliseconds. So with this kind of a technique, we can start to build not just an activity flow map of the brain, but an information flow map of what is being encoded in that neural activity over time, over space, during the brain's comprehension of that word in the 400-millisecond window.
So there's a lot more to be discovered here. But I think this kind of suggests a way of starting to try to understand the information flow. We still don't know what the algorithm is that the brain is using. But at least we can start to see what some of the variables are that are being encoded in neural activity at different times and places.
Let's skip over a little. OK, so really everything that I've said so far is really about single word comprehension. And we looked with fMRI primarily at the spatial distribution. And the key result there, the key surprise there was that the spatial codes actually have a structure. They have a substructure. They're built up out of those more primitive, semantic codes.
And then, with MEG, we're starting to be able to look at those semantic primitives, like-- is it an animate thing? Is a graspable thing? Is it an edible thing? And see where those different semantic features are showing up and when they are showing up during word comprehension.
We still don't know much about how. And I'll leave that for now because I want to just wrap up by telling you a few things about some newer work looking at phrases and sentences and stories. Because of course, what makes language interesting mostly is that we speak in sequences of words, not single words. And there's much to be asked about what's going on up there when we read phrases or sentences or stories.
So let me start with some work by one of our students, Alona Fyshe, who's looking at adjective-noun phrases. The question she's interested in is, if I show you a sequence of words like "tasty tomato" or "rotten tomato," how is the meaning of that phrase assembled out of that sequence of word perceptions?
And so she's been presenting to people, in MEG scanner, words where she'll first present the adjective, then the noun, 500 milliseconds between words. And of course, when the noun is presented, your brain, like when we say "tasty," your brain presumably comprehends the word "tasty." Then, it gets "tomato." Now, it's comprehending the word "tomato," but it's also somehow building up the composed "tasty tomato" meaning. And so she's interested in studying this.
One way to study it is to, again, go back to training classifiers. And you could train it over the entire time window of starting when "tasty" appears on the screen and "tomato" appears on the screen. And in fact, that's the first thing she did. The first thing she did was she said, let's try to train an adjective classifier, and we'll give it the entire time window from here through here.
Well, actually, she said, first, let's see if we can decode adjectives. We hadn't done adjectives before. And the answer is you can decode adjectives with about the same accuracy as nouns. So that's fine. Then, she said, what about here? Can we decode the noun? Yes, you can, about the same.
But then, she said, what if we look here? Can we decode what the adjective was? And if so, where and when is the neural activity during this time window, encoding what the adjective was?
So I could show you the answer by building a plot. On the vertical side is going to be the 3006 MEG sensors. So each column in this plot will correspond to one snapshot in the MEG reading. And the horizontal axis will be time, starting with the onset of the adjective, which is on the screen for 500 milliseconds. And then, there's 300 milliseconds of dead space. And then, there's a noun for another 500 milliseconds.
And what you're looking at here is each column corresponds to one point in time. But the colors here are not reflecting the raw MEG activity. These are, instead, the learned weights of a classifier that's trying to decode, from the MEG activity, what the adjective was. And so you see that this classifier is using-- red means high positive weights. Blue means high negative weights. This innocuous color means weights close to zero.
So you can see that it's using a lot of the information during the time when the adjective was on the screen to decode the adjective, but that it also can use certain aspects of the activity while the noun is on the screen to decode what the adjective was.
AUDIENCE: Clarification?
TOM M. MITCHELL: Yes.
AUDIENCE: Can you do the same if the two words do not compose, like if you just have--
TOM M. MITCHELL: Oh, like if I say "tasty cloud"?
AUDIENCE: Sure, or just two nouns--
TOM M. MITCHELL: I have no idea
AUDIENCE: --the preceeding noun.
TOM M. MITCHELL: Oh, we did not try that. The question was, what if you have two nouns that don't really compose? Can still decode? From other data, I think the answer is probably no. But we didn't try the nonsense combinations. We've tried other combinations.
So the interesting thing here is that the neural activity is actually modulated during the time the noun is on the screen by what that adjective was and that we can decode, with some accuracy, looking just at this time window, what the adjective was. So it's in there. OK, so now, this leads to kind of an interesting question about what about the encoding scheme itself?
See, what this is telling us is that we can decode what the adjective was here if we train a classifier on this part. And separately, we can train a classifier on this part and decode what the adjective was. But is it the same neural code? Or is the brain using a different code here than the way it was encoding the adjective here?
So if I wanted to answer that, again, I can turn it into a machine learning question of the form, what if I train my adjective decoder here at this time? Can it successfully decode what the adjective is at this time? And if the answer is yes, then I know it's using the same code. If the answer is no, apparently not.
So Alena did this. And in fact, what she did was she did it for all times. So I could show you a plot of-- on this axis will be what time we train at, from 0 milliseconds forward. On this axis will be what time we test that trained classifier at. And then, the intensity of color will be how accurately do we decode if we train at this time and test at that time.
AUDIENCE: Is it trained at a fixed time or of a rate for a sequence?
TOM M. MITCHELL: In what I'll show you, it's a 100-millisecond window. And it looks like this. OK, so now, this is a little-- requires getting your head around it a little to think about it. But what is this saying? Again, here's when the adjective's on the screen, then blank time, then the noun's on the screen. This axis is telling us when we train. This axis is telling us when we test.
The color is telling us-- if it's solid blue, it means we don't get a statistically significant decoding. If it's non-blue, then it tells us that we're getting some decoding. So the main thing that you notice here is this big blotch. What that means is-- well, let's look at the diagonal. The diagonal, it's just put here in blue, but it's like hyper red.
The diagonal is, what if we train and test at the same time, again, with cross validation, using different repetitions? But then, we can decode the adjective pretty much any time in the adjective window. And similarly, we can decode the adjective early on during the noun window, a little bit less well here, and then very well again here, at the end of when the noun window occurs.
Furthermore-- this is interesting-- after the phrase is gone totally, the noun and the adjective gone, we can still decode what that adjective was. So it appears that there's a kind of consolidation, wrap-up burst of neural activity that we see in the brain after the adjective-noun phrase, during which the adjective is decodeable.
The other thing you notice here is that if we train on at any time when the adjective is actually on the screen, even though we can decode other times when the adjectives is on the screen, if we try to use that to decode what the adjective was when the noun is on the screen, it totally fails. It totally fails. That means even though we can decode the adjective during the noun, it's not using the same neural code that we would get if we trained when the adjective was there. It's a different--
The adjective is modulating the neural activity during the time the noun is the screen. But it's not modulating it in a way that encodes that adjective with the same pattern of neural activity. Otherwise, we'd see some decodeability here.
On the other hand, if you look up in this corner, it's actually quite interesting. That corner says that if we train after the phrase is off the screen, during this burst of wrap-up neural activity and then test on MEG activity when the adjective had been on the screen, we can still decode it to some degree. So there is some sharing of the neural encoding in that burst of neural wrap-up activity after the phrase disappears from the screen that's somehow using the similar neural encoding, similar enough that we get some decodeability there.
So this lets us start to look at these questions of not only when and where is it decodeable, but to what degree is it the same encoding. And that helps us tease apart one of the problematic things in this whole methodology, which is the fact that we can decode the adjective might or might not mean that the brain is using that neural signal for a particular purpose. It just means that the adjective is modulating the activity in a systematic way. But here, we can at least ask if it's the same systematic way if it's the same encoding.
AUDIENCE: On top of that diagonal, in that upper-left corner there is still this diagonal track. Is it just because that's how we measure, or is that a real, diagonal track?
TOM M. MITCHELL: Oh, those are real-- well, it's both actually. So this sequence of four dots could be-- we're using 100-millisecond time window, and we're sliding it. And so if there's a lot of neural activity at this particular time, we'll pick it up for several sliding windows in a row. And that's one reason we see the diagonals.
But I have to tell you, these diagonals, they're much more substantial than that. And I could show you other pictures that look much more like this, where it appears that there's a alpha 10.5 Hertz variation. We see this across most of our subjects, that if you train at one time and then you want to do as poorly as possible, you should test 50 milliseconds later.
And if you want to do well again, you should test either at the time you trained or 100 milliseconds later. So it does appear in the data across most of our subjects that there's some weird interaction with 10.5 Hertz variation here. So what that means--
AUDIENCE: What were you asking your subjects to do as they read these?
TOM M. MITCHELL: Oh, wow. I don't remember for this particular experiment to be honest. For some of them, we were asking them to do a one-back test to tell if there were repetitions, just so that they would attend to it. And for others, we were giving them a more semantic task, like is this a sensible or a nonsense combination? Although, all the data we trained on here were sensible combinations. I don't remember is the short answer.
AUDIENCE: Have you tried this in Portuguese or Spanish? In Spanish, it could be first or second--
TOM M. MITCHELL: Yeah.
AUDIENCE: --might be different things.
TOM M. MITCHELL: Yeah, "casa blanca" and "white house." That would be perfect. No, we haven't, but that's a wonderful idea.
AUDIENCE: But in Spanish, you could put it in either order. It means slightly different things, but they both make sense together.
TOM M. MITCHELL: Oh, I didn't-- I see. I didn't realize that, in Spanish, you could go either way. But yeah, that would be a really good idea. One thing Alena is trying is she's trying some adjectives like "large" and "small," which have this interesting property that if I say to you, "large mouse," "small car," you still think that the small car is probably bigger than the large mouse.
And so those kind of adjectives, whose final meaning is partly relative to the typical noun-- right? Small car is still bigger than a large mouse. Those kind of adjectives allow Alena the opportunity to try to decode both what was the word, was it large or small, and what is the actual size. Is it three meters or five meters?
And so by setting it up that way, she's trying to tease out the difference between when the neural signal is encoding the token itself, "large," versus the composed meaning of "large car" or "small mouse." So she's got some interesting experiments that she's working on now of that type. But I really like your idea of flipping the noun- adjectives order.
AUDIENCE: Question?
TOM M. MITCHELL: Uh-huh.
AUDIENCE: Early on, you presented some data of some people who were particularly easy to predict what they were saying. I suspect you haven't gotten there yet. But it might be interesting to ask if you can pick out people who are like that. In some sense, they're really clear thinkers in that sense that you can--
[LAUGHTER]
It would be very interesting to see what behavioral or personality characteristics make some brain waves more predictable.
TOM M. MITCHELL: The highest order term is how still they are inside the scanner.
[LAUGHTER]
Which might be correlated with the property you're interested in. I don't know.
AUDIENCE: Oh, that's terrible.
[LAUGHTER]
TOM M. MITCHELL: But you're right. So let's see. I have another six or seven hours, but I don't want to do that. But I do want to show you-- let me stop here on the adjective-nouns. And let me see that we have found a very similar kind of effect to this when we provide people not just adjective-noun phrases, but simple sentences, noun-verb-noun sentences like, "The student attended the lecture."
What we find is we can decode the individual words from MEG data when they're on the screen, just like we could, here, decode the adjective and the noun. But very interestingly, we also find that after the sentence is over, during the next 800 milliseconds, there's a burst of neural activity that is actually larger than the neural activity at any point during the sentence. And during that burst of neural activity post-sentence, we can decode the verb that was in the sentence with almost, not quite, the same accuracy as when the verb was on the screen.
So it appears that not only for these adjective-noun phrases is there a burst of neural activity a kind of wrap up, where maybe the brain is thinking about the composed meaning, but we see the same thing in these simple, noun-verb-noun sentences. And it's a very robust effect that we see in those sentences, provided you actually give people a second or two after the sentence and don't just give them another word right away after that.
So we don't really know quite what's going on there. But what I'm hoping is that obviously, somewhere along the line, the brain is assembling the meaning of the proposition, "attend student lecture." And somewhere in there, there must be a neural code for that proposition, or we couldn't have thought that thought.
And what I'm hoping is that, maybe in that wrap-up period, we'll have a chance to decode the proposition. We have not been able to do it yet. But we have been able to decode what the verb was, which is kind of the predicate of the proposition. Yeah.
AUDIENCE: Have you performed any experiments with varying the modality at which the stimulus is presented, verbally versus visually?
TOM M. MITCHELL: Not verbally. We've only done text and pictures in the case of single words. We have not done an auditory. That would be a very interesting thing to do. Yes.
AUDIENCE: Tom, I'll wait.
TOM M. MITCHELL: That's good. We can--
AUDIENCE: OK, so let's do--
AUDIENCE: Have you done any--
TOM M. MITCHELL: OK.
[APPLAUSE]
AUDIENCE: OK, questions?
TOM M. MITCHELL: Yes.
AUDIENCE: Have you done any experiments on blind people, on blind subjects?
TOM M. MITCHELL: We have not done blind people experiments. I think Rebecca Saxe has done some blind people experiments. You might talk with her. Yes.
AUDIENCE: So I was really interested in that first set of MEG studies that you showed different classifiers at different times because one of the things I've been interested in the course, in language, we have lots of different levels of representation, [INAUDIBLE] probably more [INAUDIBLE] and massive amounts of top-down [INAUDIBLE]. Right, I mean, [INAUDIBLE] given a sentence may have several thousand possible interpretations, most of which we know to be-- basically, none of which we know is the one that we use.
TOM M. MITCHELL: Right.
AUDIENCE: And we've been able to establish something about the time dynamics of that. And you are activating some of the other possible interpretations, but they're gone within a few hundred milliseconds. We know, in some cases, where we can measure through like eye tracking or something, a little bit about shifting interpretations, is the sentence coming through. But there's not been any real good way of getting really fine grained detail of what on Earth is going on as we're trying to work out what the exact interpretation is.
And I was curious if you thought about using that method for exactly this sort of thing, so with, "The card players were in the midst of a game of bridge." And you apparently initially get the bridge over troubled water interpretation. But that's gone fairly rapidly. You could, for instance, build classifiers from sort of neutral contexts and neutral words and then, maybe learn something about how the correct interpretation is-- what interpretations are activated and the dynamics that suppress the wrong ones.
TOM M. MITCHELL: Yeah. Those are great questions. We haven't had time or wisdom to look at any of them yet. But I agree with you. They're dynamite questions. We are in the process of collecting a lot of data on a lot of sentences. And so probably, even inadvertently, we have data that might be interesting to look at from that angle.
And yeah, so it would be really interesting. Maybe we could talk. And I would love to get what you think are prototypically great sentences for exhibiting this kind of temporary confusion or multiple meanings and then resolution, because it would be really interesting to see if we could get any evidence of that from the encoding.
AUDIENCE: I just have a technical question about this. So one of the things you focused on about the MEG data, that frequently you have decodeability emerge in many regions simultaneously. And MEG doesn't have good spatial resolution. And any subdivision you make of MEG data is an artificial subdivision imposed on a much smoother data set. So I just wondered, how do you establish the independence of your estimates of those different brain regions that gives you the confidence to say that is, in fact, many different brain regions representing that information simultaneously, rather than many artificial subdivisions imposed on one brain region--
TOM M. MITCHELL: Yeah.
AUDIENCE: --at one time.
TOM M. MITCHELL: Yeah. So you're right. So the point here is a very important one. What MEG is a bunch of sensors around the outside of your brain listening for magnetic fields coming out of the brain. Reconstructing the three-dimensional set of neural activity that gave rise to this two-dimensional precept is an underdetermined problem. So we have to make assumptions, like minimum norm assumptions, which guess what that three-dimensional volume of neural activity was, based on the assumption that it was the smallest total energy in the brain, for example, which is what we do.
So then, that leaves open the question of, did you get this right, or are you just incorrectly assuming that these six regions are coding it because you misassigned what the actual three-dimensional neural activity is. So we don't-- let's see. I don't have anything like a proof to go with that. I do say some of these brain regions that simultaneously encode the activity are very distant from one another. So I take that as some kind of good sign that maybe this is real phenomena.
And on the other hand, we also see that several of these brain regions that are implicated in, say, simultaneously encoding one feature will also simultaneously encode some second feature. And so that could be actually negative evidence that could be due to the fact that, in fact, those regions do work together in some interesting coordinated way. Or it could be a sign that we're consistently misassigning the actual internal neural activity incorrectly to those six regions.
AUDIENCE: Let's see. I think the food will evaporate.
[LAUGHTER]
So in order to avoid this phenomenon--