Neural decoding of spike trains and local field potentials with machine learning in python
Date Posted:
April 3, 2019
Date Recorded:
April 2, 2019
Speaker(s):
Omar Costilla Reyes
All Captioned Videos Computational Tutorials
Description:
Speaker: Omar Costilla Reyes, PhD
Neural decoding has applications in neuroscience from understanding neural populations to build brain-computer interfaces. In this computational tutorial, I will introduce neural decoding principles from a machine learning perspective using the Python programming language. The tutorial will be focused on data preprocessing, model selection and optimization for decoding neural information from spike trains and local field potentials. The studied dataset contains neural information from six cortical areas of the macaque brain spanning from the frontal to the occipital lobe.
Speaker Bio
Omar Costilla-Reyes is a postdoctoral researcher at the Miller Lab, MIT. Omar’s research focuses on developing machine learning methodologies to understand neural dynamics in cognitive neuroscience.
Additional Info
To follow along with the tutorial, bring a laptop with Anaconda Python 3.7 installed. Download for Windows/MacOS/Linux: https://www.anaconda.com/distribution/.
Computational tutorial references and videos can be found on our stellar site (https://stellar-mit-edu.ezproxyberklee.flo.org/S/project/bcs-comp-tut/index.html ) or on the CBMM learning hub (https://cbmm-mit-edu.ezproxyberklee.flo.org/learning-hub/tutorials#block-views-learning-hub-blo... )
PRESENTER: Today, we have Omar from the Miller Lab who's going to be talking about neural decoding of spike trains in local field potentials with machine learning in Python.
OMAR COSILLA REYES: Hello, guys. Thank you. Do I need to speak here with my--
PRESENTER: Yeah, you do.
OMAR COSILLA REYES: All right. So thank you all for coming. So today, I'm going to present to you some tutorial on neural coding. We're going to do that with the spikes and LFPs. So we're going to be using Python, and also machine learning. So let's get started, first.
So first, I'm going to give you some introduction on neural decoding, what neural decoding is, and also some background. And then we're going to jump into a Python tutorial with LFPs and spikes. And then if we have time, we are going to play within all those and change some parameters and change the data, change the model, and then see what happens. We might even break the code. But it's OK.
Can you guys raise your hand if you can have experience with Python, have experience with Scikit-learn? Who's doing only MATLAB? All right. OK, so we have a majority of Python. But there is a couple also doing MATLAB.
So Python is actually very similar to MATLAB. But if you don't understand some of the commands we're going to present, don't be shy. Ask questions. Anything is about it. So this tutorial is for everybody. So you have any questions or feel that you don't know the background or anything, I'll be happy to explain it for you. And if we run over time, we'll just explain as far as we can get.
So just an introduction on a spikes and local field potential, the example that you can see here is that we are recording from a certain areas. Of the brain. And then we extract a wideband signal. And then the signal can be high-pass filtered to obtain spike signals. And then if we low-pass filter, we can obtain the local field potentials.
So this recording is very special because we can listen to the neurons that are nearby the probe that we're inserting in the brain. And then based on that, we can discriminate the spikes by the way of the waveforms, and also by the way we set this structure here. And then we have also the local low-pass filter that can help us to obtain the summed electric current flowing from multiple nearby neurons. So the local field potential is more like a summation of what's going on around this probe here. While the spikes are used like single neurons that we can isolate based on this wideband.
So who's working with the spikes or LFP here? OK. That's great. Yeah, great hear, guys. So you're going to learn a lot today.
So this is usually how spikes look like. So we have here a spike train going over time. So just here, you can see we're listening to some neuron in the brain. And then the way we can process these spikes is to obtain some fire rate here. And we obtain that by binding over time.
So what we do is we take some bind we know. And then we go over time and then we count the spikes in a certain period. So it's basically like doing an histogram here. But sometimes, we want a continuous signal. So something we can do is to approximate the firing rate by using a Gaussian window, for example, for 50 milliseconds.
So first, we get this, the binding time. And then from here, we can pass to the firing rate like this. So [INAUDIBLE] continuous signal will be useful for certain types of processing that we can do with the signal. And that's actually the signal that we're going to use today for this tutorial.
So here is an example of the spike rate that we're going to analyze today. So each color here represents one area of the brain. So we have PFC and V4. And you can see here that in some cases, the spike rate is very high. And for other cases, it's very low, depending on how the neuron is reacting to certain stimulus. So that's the example of the signal.
And here, we have another example for the LFP. So you can see here it almost looks like noise here. But there is actually some frequency components going on here that my lab usually analyzes to try to infer what's going on in the brain, And also, there's two areas, which is PFC and V4.
So I'll explain a little bit what neural decoding is. We do neural decoding in range. It's not only the animals or other types of special types of species. So every single brain that exists today, there's some kind of encoding.
So we have here, for example, this image. This is very famous in computer sciences, the image. And then we encode it in our brains with a certain pattern of activity with neurons over time. And then we have different conditions. And then we obtain different patterns.
So this happens automatically in our brain. So it goes, the image, from here. And then our brains express this image like that. So our job as a neuroscientist, and also as engineers, is to go from here and try to go back here by somehow trying to predict what happened from just reading the stimulus here in the neural activity.
So being a little bit more specific, here, we have the type of stimulus that we're going to analyze today. So it is a color or motion of a stimulus here. And then we have arrows that represent also how these kind of patterns are moving in a screen when a monkey is watching this. So during this tutorial, we're going to analyze this type of stimulus. And then going from here, we will try to go back and see how much of these we can decode, based on the activity in the brain.
So expanding the analysis and expanding what I've been saying about decoding, for example, suppose that those are two neurons of the primate brain in a certain area. And then we show a spider. And we can see here that we have certain selectivity of neuron 2 for the spider. So we can see that there is activity here.
And then when the Leaning Tower of Pisa is shown, we can see here that neuron 1 is more selective to this one. So there is a certain pattern on here. And our job is to say in which cases in which time this happens.
So something that we can do is just to plot the firing of the neuron 1 by the firing of the neuron 2, and then try to come up with a [INAUDIBLE] of what's happening here. So this is a very easy way to do that. And then we can express the trials this way.
So each dot here is a trial. And then it's represented in a two dimensional space. So then we can use a machinery model that can process this data that then come up with some classifier and I can actually identify these two types of activities. And then we can say with certain accuracy how well we're doing to classify these two types of patterns when they're happening in the brain.
So also, I want to say why neuron decoding is important. So in neuroscience, the first types of analysis that were made were like single neural analysis. So Professor Miller built his career with single neural analysis. And so that's a few decades from now. And now we are evolving to systems that can actually record many, many neurons at the same time with a high level of accuracy.
So and example-- what I'm showing here is the neural pixels. So it's a special type of probe that can acquire many neural recordings at the same time. So for example, we have here 741 neurons that were acquired at the same tie for the mouse brain. So here, we have the visual cortex, the hippocampus, and the thalamus.
So look at how many neurons we can acquire today. So how are we going to be able to make sense of this data? So we have all this huge data. And the larger it is, the harder it gets to analyze. So we have an advanced advantage today, which is that we also have some advance computing systems. So computer systems that are getting faster and faster. And they are also coming up with new ways of analyzing this data, things to open libraries like Python.
So what's the pipeline of decoding analysis? So first of all, we have the data. So I have explained before. Here, we have the LFPs and the spikes. And then we need to do some type of preprocessing with the data. So we can do dimensionality reduction. Or we can average the signal. Or we can also do sampling the signal.
So actually, the data that we're going to use today has been sampled. So you can actually play with it and it's easier to process. And you will be able to run it in your computer. So that's some type of preprocessing they I have already done for you. But you also will need to do it for your datasets.
We're not going to go over that step today because we have limited time. But feel free to reach out to me and I will also explain to you how to do it. But this is one of the stages that we have to do for neural decoding.
Then we need to go to model selections. So we can choose from linear, nonlinear models, or ensemble models. And we're going to go in some detail with some of these models. And then we can go to model validation and evaluation. So we have our model, we have the data, we have done some processing. And now I want to obtain some accuracy scores, or how well I'm doing to predict certain behavior of some data that I have. It can be spikes or LFPs which is going to be the focus for today.
So to explain a little bit more on the model selection, I'm going to go over a little bit on how that's done. So imagine that we have a fish/cat dataset. So what has happened here is that a monkey was shown with a picture of a fish or a picture of a cat. And then we were able to record five neurons' responses, as you can see here. I don't know if you can read it. But this is the neural population responses.
And so we have here from N1 to N5. And then the shade of the color here says how much activation I had for a certain neuron. And then my dataset goes from here all the way to here, so having either a fish or a cat, all the way down here.
So then my job is to, if I want to predict a certain variable, if I want to build a model that says there is a color for fish, I have to divide my dataset it into training and testing, to train my model and then test if my hypothesis is correct, what I was assuming. And then I need to select this model here. And then I have to pass the training labels and the training data to this training classifier. But I have to select it from the family of linear, non-linear, or simple learning. And then I have a training classifier at the end.
So once I have this guide here, then we can pick up on the test set. So here is, again, our training classifier. And then the test data goes here. And then the test label goes out here because it's after the prediction is made. So I'm waiting for the prediction to be made. And then I have the ground truth here, and then the prediction labels, and then I can make some statistics on how well it's doing.
So in this tutorial, we are going to go over all this process, but in a better way. And how is that a better way? We're going to do cross-validation. So how does cross-validation work? Instead of having just training and testing data, I have several folds where I can train and test my assumptions.
So going the same with the cat and fish dataset, here, I can separate by folds. So I have one fold, two folds, three folds. So I can pick up neural data from anywhere here. and i can view my splits here.
Then once I have them, I select two of them for training, one of them for testing-- two of them for training and testing. So I do the same for the three folds. Then at the end of the day, I'm going to get three accuracy scores of three folds. And then what I can do is averaging the [INAUDIBLE]. And then I can obtain an average score of the representation here.
One of the problems with this kind of approach compared to what I showed before is that here, we need to train three models instead of just one. So it's more competition of the dataset, rather than just training one model. So that's one of the disadvantages. But this one offers more. The classification accuracy that we get is less biased than just training and testing with one classifier.
So now we are jumping to the dataset. Any questions so far? Yes.
AUDIENCE: Yeah. I see you're splitting the dataset completely into multiple cross-validation sets.
OMAR COSILLA REYES: Yes.
AUDIENCE: Is there any advantage to that over, for example, random selection, multiple times, even though they're meant to overlap?
OMAR COSILLA REYES: Yeah. No, we don't do any kind of overlap. So we separate the folds. And the data never sees each other. Because we want to create them so independent from them that the accuracy that we get is essentially independent. And then we can compare the overall accuracy there.
All right, guys. So now, we are going to go into the dataset that we're going to be using today and playing with today. So here is the task. So it's a motion and color categorization dataset. So basically, what the monkey has to do is to choose classify their motion or color when a stimulus is shown.
So this is the experimental execution. So we're going to go over this from the notebook that I have for this tutorial. You really need to understand this.
So first, the fixation goes from 0.5 seconds. So the monkey just needs to fixate at the center of the screen. And then a cue is shown after this fixation. And then the cue gives some indication that the monkey has to categorize either motion or color here. And then once the monkey makes the decision after the cue is shown, the stimulus is shown, he has 3 seconds to make a decision of if it was a motion or color sample. So that's the entire experiment.
I'm going a little bit more into detail on what's going on in the experiment. So after the fixation has happened, the cue is shown. But there are four types of cues. There are two for motion and two for color. So when either of the two of those are shown, the monkey already knows that it's going to be a motion trial and a colored trial. So depending on the color, on the configuration here of the color, and the motion, then the monkey has to decide either to look left or look right.
So to understand this a little bit more clearly, here we have the response. So supposing that a motion cue was shown. Suppose that this X was shown. So and suppose that this pattern of the arrow up is shown here. So then the monkey has to categorize this as left/right because it's right here, and then go back here. And then if you look at the up arrow here, it says left when it's a motion trial. So it's the same, depending on the side of the screen here. So you're going to be left or right, depending on the configurations here.
But the stimulus here is always shown at the center. So this is not actually-- it's only for you to guide what's going to happen with the response. But the type of arrows and the type of color is always shown at the center of the [INAUDIBLE]. Any questions so far? Because we're going to go here and there.
AUDIENCE: So what is the point of using two different kinds of rule keys for the same rule?
OMAR COSILLA REYES: To disassociate the stimulus.
AUDIENCE: I mean, you have the cross and the star shape for the motion.
OMAR COSILLA REYES: Yeah.
AUDIENCE: What was the point of using two for one kind of rule?
OMAR COSILLA REYES: Well, the reason is that it's less probable that the monkey is going to do infer that from just one and one cue. So we have more types of cues that we can show to the monkey. So it's four instead of just two. So the monkey has a harder time to memorize the patterns. So it's more clear when we have two cues. Because the monkey has to make a better effort to learn that they see either a motion or categorization.
All right. Anything else, guys? Yeah.
AUDIENCE: So in this kind of cue, what's the interpretation? If the monkey does the wrong choice, is it because it took one cue wrong, or is it because the monkey took both cues and then couldn't decide which? Say the monkey thinks both the motion and color are important, but then I don't know which way I should go.
OMAR COSILLA REYES: Yes. So for example, so the cue is always shown. So there is always one cue shown in any trial. And then if we combine this with the pattern that is shown, we have this plus the pattern. Then, the monkey has to decide either to look left or look right. And if the monkey doesn't do it right, as the pattern is showing here, then that's an incorrect trial. We label it as incorrect. Because it's not doing as it was trained to be.
So this is after a train has been performing the monkey. How long does it take to train these monkeys?
PRESENTER: 2 years.
OMAR COSILLA REYES: Yeah. So it takes 2 years for them to learn these kind of rules. So it's hard. It's hard for them. And it's a lot of effort that they have to
PRESENTER: It's hard for the post doc, too. It's hard for the post doc, too, who's training the monkey.
OMAR COSILLA REYES: Yeah. so this is the way the experiment works. But we're going to focus entirely on categorizing, on the coding, the motion, or the color stimulus, based on the correct trails. So we're going to eliminate the wrong trails here.
PRESENTER: Just answer your question, the goal of the Mill Lab is to talk about task switching. That's why they have these really more complicated set of cues. So it's not just picking up a cue and doing it a decision. They want to see if they can mix and match a few different types of sensory inputs and come up with a decision. So they have to, during the task, switch from one position to another. So that's why they have this kind of more complicated set-up.
OMAR COSILLA REYES: Yeah. That's true. So we like to perform very complex experiments, which, at the end of the day, is harder for us to do it. But it pays off at the end if we can get this kind of paper published. But anyway, moving on a little bit.
So if you remember, I showed you this image at the beginning, showing you these are the spike rates and things like that. But now we can understand what's going on over time. So now we have labels of what's happening in time.
So first, we have the fixation. This is in milliseconds here. So you have fixation from 0 to 500 milliseconds. Therefore 500 milliseconds to one 1,500 is the cue shown. And then the stimulus is shown after here.
So if the stimulus is fired off here, than the monkey has 3 seconds to respond. Let me show you the response very quickly. So it was like 200 milliseconds after that.
So these are the brain readings that were recorded for the task that I just showed. And we're going to focus on PFC and V4. So this is probably one of the biggest data sets that my lab has. We are generating more at the moment. But that we have finished recording, this is probably one of the biggest. Because it has several areas of the brain.
And I'm doing some research with this dataset right now. And I'm very soon to publish something about it. But for the case of the tutorial here, we're just going to focus on two areas so we can have time to discuss what's going on, and for you to also run the experiments in your laptop if you want to do it.
So what are the objectives of this tutorial? So as I have explained to you, we're going to do spikes and LFPs decoding. And we're going to focus on two areas, V4 and PFC. And we're going to do that in a single experimental session. So it's a very small dataset compared to the 44, our total for all the data that we have for this task.
And then we're going to perform cross-validation. And then we're going to evaluate how we are doing with the cross-validation with a metrical f-score. If you want to know more, this f-score is a combination of the position and recall. It's an harmonic mean. And if you want to find out more, I have all these links so you can check out. And we're going to share with you the slides on the code so you can do this at home, as well.
So as the title of my talk suggested, I wanted to focus on Python. And I really want to justify why I think Python is the best language to do this kind of analysis. So here, we have the different types of packages that we have in Python. So something very important to acknowledge is that all these packages are being developed by people like me, or somebody that is not being directly paid to develop these packages.
So we have here something like Astropy or Sunpy. This is for astronomy recording. And this is for physics. So all these packages that we're going to jump in in a second, all these are very open source. So you can go and see what's going on in the package. So you can just look at the source code and everything is open.
But there is a very lively environment there of people that are developing all these tools. And they're always looking forward in a way to improve the analysis of anything. So for example, Astropy is very active. So they have a lot of version going forward, in terms of how to process in parallel. If they want to use EPUs, that is people working on different aspects. And I think that this is the best way you can construct environments for that data analysis, compared to other packages, like R or MATLAB, that also open source packages, but it is more limited, the type of operations that you can do. Or, the type of scope of the language that it offers, for example for Python, that has lower flexibility on the things that you can do.
For example here, for the tutorial that we're going to do today, we're going to focus on those libraries. So Matplotlib is for plotting. Panda is for processing. PsiPy is for processing, as well. We have Scikit-Learn, Jupiter, and you can see that just for this tutorial we're using all these libraries. So we are making use of the ecosystem just by doing more research and by doing neural decoding.
So I also wanted to jump in a little bit before running the code. I wanted to explain to you some of the basics of the machine learning model, in case that you don't have much experience about it. So for this tutorial, we're going to focus on Scikit-Learn, which is the most widely used machine learning library in Python.
So we are going to start by doing modeling with our linear support vector machine. So here, what you can see on the screen down here is that we have two types of classes. One class, you see in white. And one class, you see in black.
And then the way the SVM works is that we construct an hyperplane that can separate these two classes. And then it takes the training samples that are very close to the boundary to create support vectors that can isolate the cyber plane and separate the data as much as it can. So this is a very important advantage because the way the cybering works is that it lowers the generalizational error of the classifier. And it will just generalize better.
So this algorithm is widely used in many applications and is fast. It's linear and it has been proven to be one of the best in machine learning. So that's the reason why we want to focus on that.
But I also wanted to show you an ensemble learning algorithm, which, in this case, is called Extra Trees. This is very fast compared to another ensemble learning algorithm that's called Random Forest, that is also very widely used. If you have never seen any ensemble learning algorithms before, the way it works is that it constructs trees like this.
And then we pass the data to the trees. And the trees have to make some decision on the data. And then we obtain the predictions out. And then by some voting mechanism, we can take what these guys came up with. And then the majority decides what's the final prediction.
So if we go back to SVM, just a single branch without the trees just goes, the data, into a single model. And then we get a single accuracy score. But here in ensemble learning, we're taking several trees. And then at the end, we generate some voting function that can take the prediction from each of the trees.
So going back a little bit with cross-validation, I'm going to use a cross-validate function from Scikit-learn. This is very new. So if you weren't using Scikit-learn before, I will recommend you to update your library because this is a brand new library for this validation.
And the way it works is that we pass all the data, we pass all the labels, and then some scoring function, in our case, some motion labels and color labels. And then we can split the data into some strategies, for example, two folds, three folds, or depending how many folds we want. And for example, one of the advantages of doing these kinds of things is that you can set in this parameter and jobs the number of CPUs you want to use for computation.
So it's just as easy as putting minus 1 in this parameter here to say that I want to use all the processors in my computer. And this is very good because each of the folds, as I said, you need to train a model for each of the folds. So if you train parallel, where you have a five processor computer, you can send each folds to one of the processors in your computer. So this is going to isolate more the way this works.
And this is being done automatically. So there is a library called Joblib. But this is running under the hood. So you don't really know how it's working. But you're sending the jobs to the processors in your computer.
So now it's time to do the tutorial. So can you show hands if you're going to follow the tutorial with me? OK, great. So who has not done all the files yet? OK.
PRESENTER: So there's a link in email that went out right before this. There's a link to a [INAUDIBLE] page that has a link to GitHub with the data. So that's one way to get it. Or, you could go to the link up there.
OMAR COSILLA REYES: Yeah. So I have this link here. I mean, whoever is not yet with us downloading the files, please follow this link. It's just bit.ly/spikes-lfp-decoding.
AUDIENCE: What kind of system is best for running Anaconda? What kind of system would you [INAUDIBLE] Windows, Linux, Mac?
OMAR COSILLA REYES: Well, the first system that was developed-- well, Anaconda is Python, right? So what Anaconda has is they just compiled all these packages together. And so you don't need to install all the scientific packages.
So I think Python was written first in Linux. So I think it will be better optimized in Linux because there was the core of the library. And then they started moving to Windows and Mac OS. So if you really want to optimize your code and things that, I think Linux is the best way to go.
So you follow the link. You should be here in my GitHub web page. So you can click here on the links here from Download LFP Data. And then we have also the spike data.
So you can use any of the two links. I put two because sometimes it doesn't work. So you have two links you can follow for each of the data sets once you are in this web page. So don't type the links. Just go to the GitHub web page.
And then once you have the files-- for those of you that already have the files-- just go to the directory that you download the files to. And then just unzip the files. So if I have this file here, I can right click, then Unzip, Extract here for both of them.
So who's using Anaconda for the first time? OK. So if you are using Anaconda, you can just type Jupyter. Actually, if you type Jupyter in Windows, it's just going to open the Jupyter notebook.
The way you can do is Windows, is Python Notebook. Actually, let me just type what you have to do. So whoever is using Anaconda, you just have to go to the Anaconda prompt. So there should be a link in you either to your desktop, or if you type on Windows, Anaconda prompt. You should click on that. So then, it should open up in the terminal. It should open something like this, when you click on the Anaconda prompt.
And then once you are in the terminal, just type Jupyter notebook terminal. And then it should open something like this in your web browser. So is everybody here? Is anybody having any problems so far?
So if you are in Jupyter now, you can just go to the place where you downloaded the files. Go to that folder. Go to the folder in your computer where the files were downloaded. And then we're going to open first the LFP notebook. So once you enter the-- let me see if I can make this bigger. OK.
So here, I am inside the LFP decoding folder. right And I can just click here on this notebook to open it, once I have unzipped. So here is the C file, LFPdecoding.zip. And I can see both the LFP and the spikes. And here, I have the two folders. Then I Enter. And then I need to click this guy to open the first notebook.
Anybody having any problems so far? All right. So if you have never used Jupyter before, the way it works is that I have already generated the code for you. So what we're going to do now is we're just going to run what I have you written for the tutorial now.
So one way to do that-- so can everybody see? All right. Yeah. So what I'm going to do now is you can click Run here to start running the code. So when you click Run, you go one by one, one by one line. And also, a shortcut is to click Shift-Enter. So if you click your Shift-Enter, you will just run one of the cells here.
So whoever wants to follow is here already, everybody? Or do you need any help so far to get here?
AUDIENCE: Sorry. It's [INAUDIBLE] I see Import Random Forest Classifier. But it doesn't appear in the following code, which is actually used in your decoding. Or, was it--
OMAR COSILLA REYES: Yeah. Well, the reason why I am including this is because if you want to try it, we can try it later on, if we have time. But we're not going to use it right now. The reason why it's important is because maybe somebody can say, can you run the Random Forest? And we can go and do it. But it's not part of any of the code now.
Anything else, so far? All good? OK, guys. So let's get first in the LFP notebook.
So here, we have NumPy, which is the scientific computing library of Python, Matplotlib, which is the plotting library. OS is for some operating system things that we're going to do. And it's going to become clear what's going on. And all these guys are part of Scikit-learn. So all of this is the machine learning library that we're going to be using today.
So the symbol algorithms are here, are stored in this variable and this method. Then we're going to use also preprocessing and symbol matrix, model selection. Here is the cross-validate function that I showed before. So this is the way we imported from Scikit-learn. And then we also have here the f-score that I talked a little bit during the presentation.
So if you we want to import all of this, I just click Shift-Enter. And then it has executed this line already, as you can see here. All right.
So then we go and define our scoring function. So what does this mean? It means that we have the f-score as a metric that we're going to assign to the labels of 0 and the labels of 1. The labels of 0 are going to be the motion labels. And the labels of 1 are going to be the color labels. And then where it says here that we're not going to average anything, what this means is that we want a single score of the label 0. So we don't want the model to combine anything. So that's the reason why the [INAUDIBLE] says null.
So as I said, we are just going to work with one session of the dataset that I showed you. So this is the name of the session. And then we import it. This is just the way to specify a string here, for those people that are doing MATLAB. This is the same in MATLAB. Yeah, OK.
So now we go ahead and load the files. So as you can see here this is the corner file for PFC and V4, and for LFP. So we're going to decode local field potentials. And we are [INAUDIBLE] of the session. And here is the entire name of the file.
And then we load that into this variable called X_color. So if you go back here, those are the files that are in the directory. So we have color and motion, and color and motion.
So those are all data and all labels. And the way we do that in machine learning is we give an X variable to the data and then a Y variable to the labels. So we are just loading all the data right here. All right.
So for example, if I just bring here the variable, I can obtain the shapes of the array. So the way this array is organized is that I have these numbers of electrodes, these number of trials, then I have these number of features. So if we go back to the presentation, I showed you, here, the way the time domain is organized.
So here, the limit is 2,500 milliseconds. And you may ask, how did we go from 2,500 to 800? So this was one of the preprocessing steps I did. So I reduced three times the time domain because the data was very large for me to give it to you. It was like 600 megabytes or 700. So it was too large. So I decided just to go and sample the data so you can actually run it on your laptops.
So that was one of the preprocessing steps. We actually didn't affect much, the [INAUDIBLE] or any of the processing that we're going to do later on.
So this is the way the array is organized. And then we can checked if any of the data has any NaNs. So we check here if there are any NaN values here. And we said we can check that there are no NaNs in the data.
And then we have a look at the labels. So here, you can see that we have PFC and V4 labels. So we have 47 of them in total, for both PFC and V4. So for PFC, we have 36. And for V4, we have 11. So this is the number of electrodes that we have in this file.
So is everything here clear so far? So this is the number of electrodes for later. And then here is the number of labels for electrodes. And those are separated by PFC and V4.
So once I finish loading the color samples, we can just go ahead and load the motion samples. So I can load now the motion samples. And then I can print what's going on inside the array.
So here, I have my motion samples. And I have 47 electrodes, 370 trials, and 800 time features. So everything is compressed into a different axis. So I don't have to load one file for each of the data here. So I can just express that as a single array.
And then I can check again if there is any NaNs in the data. So this is the way to verify that you are not trying to model with any NaN values inside. And then I check, again, the labels of the motion data. And I can see here that I have 47 electrode labels. Then divided by PFC V4 here. And also, I have the same number of electrodes for motion because it was captured from the same session at the same time.
So I have now this plot function. So what it does is just takes the data, the labels, the number of electrodes, the number of trials, the rule on the time. And then he's going to generate me some nice plots of the data.
So for example, here is the way to find the time vector here, which is going to be your first axis. And then we can select which electrode I want to plot and the number of trials that I want to plot from the electrodes. And I have the rule for color and the rule for motion. So I passed the color here. And I passed the motion here.
So let's execute this command. And you can see, here it's the same data that I showed you before. So here, we have 10 trials for color. And we have 10 trials for motion. And this is the shape of the arrays. So we have the shape for PFC and the shape for V4. And this is the way the data looks.
AUDIENCE: You're making me dizzy, man.
OMAR COSILLA REYES: Sorry?
AUDIENCE: You're making me dizzy.
OMAR COSILLA REYES: So if we change here-- for example, for electrode 5-- and then we select, here, the number of trials to be 20, we can replot again. Let me see if this is better.
So we can see now that the data has changed. So this is a different number of electrodes and a different number of trials. For example, I can just put 1 here. If I change to 1, you only see one signal here, the plot for motion and for color. And this is all being done here with Matplotlib library.
I just called plot. And then I passed the time trials. And then I say the color. I assign a label. And then I just create some grids, legends, and titles. So we have accomplished something now, which is just getting some plots for later. Any questions so far?
So as I said, the code is online. So you can just play at home. And you can actually use your own data. So if you have something in this way, the array in this way-- so for example, you can have the trial, some sort of trials, the next rows, and the time. You can use this function to plot your own data.
So we can keep going now. So here, we are taking the label array of the next road for color, and then we are printing the unique values that are inside here. So we can have either PFC or V4. We charted two areas of the brain that we're interested to analyze.
So what I said first is I want to focus on PFC. So I execute PFC first. And then I said that I want to try my classifiers and do my cross-validation in a time window of 50 milliseconds.
So what I do here is I start sampling from time in 50 milliseconds windows. And that's the way I'm going to pass it to my classifiers. So I take 50 milliseconds of that window and then run the entire algorithm, and see if I can decode from that window. And that's the plan of the cross-validation and the way we do that learning here.
And so for evaluation, I can do five folds. As I said before, we can change these to 10, or to two, to three. So we can try that later on when we're done with the modeling. But for now, let's just try five, which is the default.
And then I want to use support vector machines. So I can put here linear, as well, to say that I'm using a support vector machine that is linear. And then I want to create a folder. So inside this folder, I'm going to save all the data that I'm going to simulate here.
And the reason why I'm doing this is because sometimes it's very expensive to run experiments. So the reason why I'm creating these folder and saving all the results there is because I can just put it up later on if I want to generate some graphs or anything. And I don't need to run this again. And this is very useful when you're dealing with lost data sets. You don't want to run a simulation of 3 days again in again because you need to change your plot. So in this way, we can just save the data somewhere and then pull the data out whenever we need it in the future.
So here is where the operating system imported with it is being used. It's actually used to make a folder. So then we execute. And then our folder should be ready now. So if you look back to your folders, it should be there.
So it's been created, right? So we have now a new folder that has been created by our program. And you can Enter, and there is nothing inside there right now.
So then we define the support vector machine model that we want to use. And for that, we use this function call SVM. So if you want to go back to the beginning, you can see from Scikit-learn, we just import SVM.
So then where were we? Here. OK.
So then I can define my classifier that way. And then, for example, if I want to see, I can put this interrogation mark here to obtain more information of the classifier. So if I do that, then Jupyter is going to tell me this is the type of the variable that you're analyzing. And this is the format inside it. This is what it is. This is the package that you're putting out. And there is the description.
It says linear support vector machine. This class supports both dense and sparse input and the mulitclass support is going to be according to our 1 versus the rest scheme. So you have all of this information of all the parameters about anything about that classifier. So you can just pull that up by asking, here, the question mark. So when we leave this empty, we don't declare anything inside here right. If you see, back here, there are some parameters that we can set. But we have left them by the fold, which you just put in the parenthesis here.
All right. So keep going with our modeling. So now here, what I'm doing here is just printing what's inside the motion shape. So we have neurons, trials, and the time domain, as we have seen before. So now here, something very interesting has happened. Because we now have selected the current area that we want to analyze.
So we're interested in PFC. So then here, we declare the PFC function. And then we extract from the motion samples only the PFC function. And this is the resulting vector that we obtained from there.
So then we roll the axis. So what happens here is that we change these axis to here, and then these two here. And the reason for that is because we're going to pass this through our machine learning model. So we want this to be the first axis that's going to take the samples in the machine learning model.
And then we're going to collapse these two axes at some point. Because that's the way we pass vectors through our support vector machine. It has to be samples by features. And here, we have three dimensions. So we want to collapse it to two.
Any questions so far, guys?
AUDIENCE: Yeah. Well, you're making an assumption when you collapse those two dimensions. Because you have you have an autocorrelation function in time for any neural signal. And that autocorrelation function basically says that nearby points are related to one another. And you can actually see, as a function of time, how related they are.
And so nearby points that are next to each other within a trial are going to be strongly autocorrelated points that are not neighboring in time, i.e. different trials are going to be basically not correlated at all. So you're sort of confounding that, perhaps, by collapsing across those two things.
And in fact, correct me if I'm wrong, but does it even matter if those things are in time or out of time, the features are actually in time or out of time?
OMAR COSILLA REYES: We can do it in Word, what's going on. Suppose that they have a 50 millisecond window. That's the window I want to take for modeling. So then, suppose that I have four neurons. So what I do, I have four neurons. And suppose that I have 100 trials.
So this is the way my data is being organized. And if we go back to the format that I had before, the first axis of the trials, the second axis are the neurons, and the last axis is the time domain, right? So this is a window. So what I do there is I take four neurons and I multiply their time domain, which is by 50 milliseconds. So the total is 200 milliseconds.
So these 200 milliseconds are going to be the features of our model. So now I have in the form 100 by 200. So here are the samples. And then here are the features.
AUDIENCE: So again, the question is that there's a native meaning of nearby points and feature space.
OMAR COSILLA REYES: But when you say nearby points, you mean from the same neuron? Or, what exactly do you mean by that?
AUDIENCE: Yes, exactly. So different milliseconds that are from the same neuron are going to have a particular correlation function, so those measurements. That is very different from different neighboring time points from another neuron. So in other words, the question is that you have an autocorrelation. You have a correlation structure in your data. But the way that you're collapsing is not respecting that autocorrelation structure.
So then my question is, does that matter for the sorts of results that you might get? Or, maybe it doesn't. Maybe it doesn't even matter.
OMAR COSILLA REYES: Well, as far as I know, it doesn't matter for the experiments I've done so far.
AUDIENCE: So in other words, you've can randomize the order of that second dimension of the matrix, and it would come out exactly the same.
OMAR COSILLA REYES: What do you mean by randomize?
AUDIENCE: The 200, the features with 200. So right now, they're ordered from 1 to 200. And in that order of 1 to 200, the elements 1 to 50 have a particular autocorrelation. And the elements 51 to 100 have a different autocorrelation structure, and so on and so forth in blocks of 50. Because 50 is the number of milliseconds.
OMAR COSILLA REYES: Yeah.
AUDIENCE: So nearby points in the feature space between 1 and 50 are related. Nearby Points between 51 and 100 are related, and so on and so forth. And so what you're telling me is that that relationship doesn't actually matter for the performance of the algorithm. I could just as easily randomize, take a random input of those orders, and get the same classification accuracy. Is that correct?
OMAR COSILLA REYES: Yes. But as I said, you need to take in consideration the context that we're doing this. So first of all, it's one session. So the neurons here were acquired at the same time. So if we go back to the first image, they were filtered by the same wideband signal.
So I'm not sure if this holds on what you were saying. But if it does, if the autocorrelation also has a [INAUDIBLE] in that context, then it doesn't matter, I think.
AUDIENCE: No. This is not something that's a feature of the signal processing. This is something that's a feature of any neural signal. Neural signals in space and in time are related to one another. And across--
[INTERPOSING VOICES]
AUDIENCE: You have to design a specific algorithm that use [INAUDIBLE] correlation, perhaps you could use autocorrelation to make sense of these models. But these don't use that feature. So I don't see how you could-- I mean, if you can do SVM and gain further insight using the autocorrelation function--
AUDIENCE: No, SVM doesn't care if the order is--
[INTERPOSING VOICES]
AUDIENCE: It doesn't care at all.
[INTERPOSING VOICES]
AUDIENCE: The SVM doesn't care about what is the structure that you already know. It's saying it can extract whatever structure it sees. So if you already know what the structure is, then give it the way you want to give it. So it's OK.
So you might see differences if you structured it the way you want. But as for the rest, SVM is concerned in saying whatever you give me, I'm trying to find a structure. That's it.
OMAR COSILLA REYES: Yeah. So that's very particular of the algorithm. So it takes the feature space. And it can do any kind of transformation, according to the optimization algorithm that we have. The case here is to find the maximum hyperplane, the best hyperplane that can maximize the separation of the two classes.
But yeah, good question. Anything else, so far? OK, cool.
So we have now processed our motion labels. So now what we need to do is-- I have my motion array. And I need to generate some labels of that motion array. So the way I do that is to use this NumPy repeat function.
So I said OK. So motion is going to be my 0 label. And then I take this shape 0, which is the trials, and then I give it that many numbers of repetitions based on this number obtained from here. So now I can say that my motion label is this label, and then this many, according to the electrodes here.
So now, we go to the color processing. And we need to do exactly the same thing. So I don't need to go over explaining here. But we're just rolling the axis from here. The trials are going to be first. Then we have here the neurons. And then we have here, over time, what's going on.
And then we use the same function. We use the repeat function. I'm saying you have the shape 0, which are going to be our samples. And they're going to be the ones that I want to repeat.
OK. So I think something happened here. We then roll here. I think I might have skipped that to rolling the array. Or, I may have executed that two times. But let's go back.
The trials have to be first. So they are not first. So I'm rolling this again. So for example, if I execute this line two times, this is going to roll this and then go back again. So I might not have done that. I don't know.
But we need the trials at the first dimension. Then we generate the labels from that dimension. I'm not sure why it's not-- OK. There it is.
So here is our first dimension. And then we generate the labels from that dimension. And then we just keep going. And then we have our color, which is 1, our motion, that is 0 right there.
So here are the shapes for both of them. So we have 270 for motion, 272 for color. Those are the trials. And then we concatenate the array. So this dimension here gets concatenated here. And then since we have the same dimensions in everything else, then we can do the concatenation by this axis.
So this is something very important to consider, as well. When you use this concatenate function, the only different axis that you can have is the one to concatenate. Everything else has to be the same.
For example, one of the problems that we're having right now is that over several sessions, you will have different types of structures in the data. So for example, here, I have 56 neurons. But in another session, I might have 50 neurons. So what's the best way to concatenate those two things? Because the trials are changing, but the neurons are also changing.
So I have either to fill out with zeros, or I have to throw out some of the data that I have. So those are of things you also need to think about. So how can I deal with arrays that are not the same length when I'm trying to combine?
So yeah, this is the way the motion and color data look like now that it is concatenated. And then I also concatenate the labels. So I have motion and color labels that are this length. If we want just to see how the array looks, I can add here a print function. And then I can print the contents. So here are motion and color labels.
So I have created my own array of data. And I have created my own array of labels. And then I'm ready to start training my algorithms. Because I have my motion and color data and labels ready to be passed to a model.
But that's not the whole story. If we go back to the motion and color, to this array, we still have the 800 millisecond time frame. So that's from the beginning to the end. And I said, I want to take 50 milliseconds windows. So what's the best way to do that?
And one way to do that is to use this function called Array Split. So what happens here is it's part of NumPy. So what happens here is that I pass this data here. And then I say OK. So let me just print this so you can understand better what's going on here.
So I have a hard time window of 50. We defined this earlier. And then this shape, number 2, is the time domain. So this is the length of the array from 0 to 100.
So then I want to say to here that I want to generate splits by the time, the number of windows I have here. So if I pass this here this way, and then I specified to the-- I just pass these two and then also say by which axis I want the arrays to be a split. So that's all the information I need to pass through this function in order to generate the windows here.
So I pass that into the function. And then we can XMI what has been created in this new variable. So the variable is called Split Area Windows. And then we can update the information inside this variable.
So the type of variable is a list. And it's of length 60. So we can print the length here. And as you can see here, look at the element the elements inside. So it has already split our array evenly by the number of windows that we need.
So now we have everything ready to start training. Because I have my labels. All the labels are ready. And all my windows are ready, as well, to put into an [INAUDIBLE].
And then I define, here, my time vector. So here, it has to start at the firs-st window. So the first window is 50. So that's where I start because the first prediction is going to be after the first window, which is 50 milliseconds. And it's going to end at 2,500.
So then I generate evenly spaced numbers between this range. And the number is actually the length of our windows. So what I'm doing here is I want this many numbers of evenly space numbers to be created between 50 and 2,500, So we can see exactly at which point in time some stuff is happening or some events of the decoding. And this is the way the array looks.
So now we are ready to start training our machine learning algorithm. So we generate the lists. These lists are going to be where we store the f-score and the standard deviation. So we're going to plot decode over time. And we're also going to plot the standard deviation for color and motion. So I'm generating this at least in advance to say to the algorithm, I want to say, inside the list, the scores that the algorithm is going to return for me.
So now I'm going to iterate. So this are variable is, for each window that's inside the split area we know, do all of this. And what's happening inside here is the collapsing line that I was talking about. We had the discussion with Andre. So we take the first.
The way this works to collapse the window is to use that reshape function. So I take the array, which is the window. And then I say my first axes are going to stay the same. Both my second axis has to be reshaped. So what I do here is I collapse the first and the second dimension with this function.
So now after I execute that line, I have the data ready to be passed to the algorithms. And as I said, it needs an array of the form samples buy features. So once I have a 2D vector-- it has to be 2D. So once I have this array, which is X window here, we can say this is equal to X window. So once I have this, then I'm ready to cross validate with that array.
So what I do here is I pass the array, I pass the labels, I pass the scoring function that I defined at the beginning. And if you want to look at it, we can just go back up there. And the scoring function is here, the score for motion and for color. And then we assign the labels 0 and 1 for motion and color, which we already created. You saw how we created the labels for the two arrays.
And then we go all the way down where are we where, here. So this is the cross-validation function back again, here. And then I said here, I want five folds, which we assign to the [INAUDIBLE] variable before. I don't want to return the 20 score. So for us, the 20 scores are useless because they are not like our sample scores. So don't want them to be returned.
And then we assign the end jobs by all that I was talking before, to assign to all the codes in my computer. So when I execute this line, the program will automatically scan for the codes of my computer and send the jobs there for each of the folds. One fold's going to be assigned to one of the processors from my computer just by assigning that to minus 1.
So then what I do here is out of here, I get a dictionary. And it turns out that I have a dictionary of values. And the last two values of that dictionary are the scoring function that I declared before, which is the f-score for motion and the f-score for color. So what I'm doing here is for the last element and the last before the last, I want to save those values here, the motion f-score and the standard deviation.
And the way I do this, I average the values there. Because it is the result of the folds. So I have the test score of five folds. So it's five values. So I average over them. And then I also obtain the standard deviation there. And then I saved that into this list, f-score for the value itself, which is the f-score, and then the standard deviation, and then for the motion, and for the color, which are my two scoring functions that I defined before.
And then what I do here is I just transform the list to NumPy arrays because in order to do some metric manipulation with the results, I need to pass it back to NumPy arrays. The reason shy I use lists is because I wanted to append everything into a single array so those arrays are dynamic. At each iteration, there is a new value being input into the list. And then at the end when the algorithm has finished, I say here I just wind them back. I wind these lists back to be a NumPy array.
So as I said before, it's also very important to save all results. So what I'm doing here is just using the Save Text function part of NumPy in order to save the results from my simulation. So what I'm about to execute is going to be very fast to show it to you guys. But sometimes, as I said, it can take days for some simulations to run. So it's very important also to save the results so you can later on manipulate or see what's going on.
So the folder we created before is going to be used to save the score function here and the score STD for the f-score. And then they arrays for f-score for motion and color will also be return to keep doing more manipulation later on.
So this is going to be a function. And as an input, we're going to get the splits of the array windows. All the windows are here-- the labels, the CVE, the folds, the scoring, the model itself, the linear SVM, and then the folder, and then the area here.
All right. So we have the final function now. So I'm going to execute. And now this function is ready to be called somewhere else. So I can reuse this function anywhere else in my code now that I have defined it. So now let's pass our windows, our labels, our CVE, everything else that you see here will be passed. Let's execute this. So now we're going to train our model.
As you can see now, it's training. So let's take a look at the values. So what's happening here is inside the function, I have put a print function that iterates over the windows iteration and over the validation values. And that's the output that we see here.
So for window iteration 1, we can see the cross-validation value from 1 to 4. So the first two is the time it takes. So it's always going to return these values. So for our simulation right now, it took 0.5659 seconds to execute the first fold. And also, the second fold was around that time. This is the time it took for each of the folds to be executed. And then this is the time it took to test, once the model was trained. So as you can see, it's very, very small compared to these numbers here.
And then we have the value for the motion and the value for color as well here. And those are the ones that I'm actually saving into my array. So when you saw up there the average, I'm averaging over these scores. And then I'm also obtaining the standard deviation over the folds here.
So if later on you want to test a bigger window or more folds, we can do that as well. But right now, this is the way it goes for all the windows, all the windows here. So we go to the last one. You can see here, windows iteration is number 15. So this is the last one because I think it's the start. Yeah. Yeah. Here are the 16.
So here are the we 16 windows iterations, now ready. So as I have defined before, it returns the values that I wanted as NumPy arrays. So I have the score of motion. I have the score for color. And then I have also both the standard deviation for motion and color as an output. So I can use print the results. So this is for each of the windows, what was the score on each of the windows here and the length, the 16, which is the number of the windows that we have.
Any questions so far? Because now we're going to plot the results, so you can see how it looks. But if you have any questions, let me know.
So here, we have a plot function. So in order for you to see something, I'm just going to go and run it. But basically, it takes the mean and the standard deviation to plot the values that you're going to see here. So I use plot to just generate the average waveform and then they fill between function that fills in between the standard deviation of the values by passing the standard deviation.
And I also have to pass the color. So for example, green is here. And then green has to be here for both of them. And then I have magenta here and magenta here for motion and color. And there's more details about the grids, the labels, but they just execute so you can see how it looks.
So now we can finish and see our plot. So this is the way it looks for our model and for the LFP data. So we have first, before the cue right here-- I can use the pointer here. So before the cue-- actually, the cue exists at random, right? So there is no information about motion or color at this point because the monkey's just fixating at one point. So he doesn't know anything about anything.
So then I start showing the cue here. And as you can see here, there's a sharp increase here. And then it starts going down and going down. And then once the stimulus is shown, I can be able to record more about the motion and color information. And this is for one of the areas.
So for example, one of the research interests that I have is to investigate what these waveforms behave across areas, or what the [INAUDIBLE] will be like, the delays, or how much decoding I get. So that's something that you can do with this type of analysis.
But yeah. So this is the plot that I get for PFC. And now I can move on to V4. But since we already generated our functional for cross-validation and to generate the plots, it's just calling the functions and putting the array there. Because it's already defined there.
But do you have any questions at this point?
AUDIENCE: So from here, you use 50 data samples for each window. Is there optimal numbers of it? Does that matter, if you have less numbers per window?
OMAR COSILLA REYES: I think traditionally, people in the neuroscience field tend to use very small windows. So for example, they use 1 millisecond window. And I feel that that's pretty slow. And that's very precise. But you can actually decode information with 1 millisecond.
AUDIENCE: Well, it'll depend on the problem, right? So if you're Michael T, it'll be 1 millisecond. If you're people like us, doing virtual neuroscience, 15 milliseconds is a good time window. So it depends on your problem. And that will change the training time, et cetera, based on how many windows [INAUDIBLE].
There is no fixed number, as such. You can try a broad number of different-- you can go from 50 to 100 and see if the system is performing any better.
OMAR COSILLA REYES: Actually, that's a good question. So for example, something we can do right now is to take a smaller window and see what happens once we're done.
AUDIENCE: Sometimes, if the window is too narrow, you have maybe more like three data points per window. Will that give you an accurate performance at all, after training?
OMAR COSILLA REYES: So if you have 1 millisecond, what happens?
AUDIENCE: For example, in your case, you have 2,500 milliseconds. Assuming you have 2,500 data points, every 1 millisecond, you have one data point. If you narrow down that window to 1 millisecond, which means every window, I only have one number. So then when you're training the data using that one number window, will give you more accuracy? Because you don't have enough data per window.
OMAR COSILLA REYES: Well, I think that's a good question. But what happens is that what you're passing through the algorithm is not just one point. So you have one point, but then you have all these samples.
For example, if I have 200 samples, I can take one, two, three, or 500 points for a lot of features. So what the algorithm was going to do is just look at the relationship between the samples. Because you have 1,000 samples and then you have 1,000 features. So then there the model they will try to exploit that feature across all of them.
AUDIENCE: So basically, it's just determining how long it takes to run [INAUDIBLE]
AUDIENCE: Not necessarily. Because the example that he gave at 4 neurons and 50 milliseconds, it's 200 features. What you're saying is now four features. Right? That's what you're saying now.
AUDIENCE: Yeah.
AUDIENCE: The system might just perform so badly that you can say that at that level, you're you don't have a good classifier. That's another way to test the limits of the classifier, and what point you start seeing the system finding representations that are useful. So you can do a sweep of go from 1 millisecond all the to whatever is optimal. Probably, it's like a U curve of some sort. And then you can penalize. You can do so many other things.
AUDIENCE: The overall shape should remain at the same, though, especially for a linear classifier.
AUDIENCE: Yeah. But if it's just one feature--
AUDIENCE: Any one given feature would be, on average, poorer. But the shape should be approximately--
AUDIENCE: Yeah. But that's another way we evaluate whether the model is the right model.
AUDIENCE: Sure.
AUDIENCE: So if you're getting a U curve or something like that, then you know that the linear classifier is not the way to go. You'll have to do some other modification.
OMAR COSILLA REYES: Yeah. So for example, another way to approach your question is what I'm interested in. So for example, here, I have a really specific time where in the cue is happening. So for example, between 500 milliseconds and 600 milliseconds, that's exactly when the cue is shown. So that time frame, I think, is very important for me. Because I want to know what happens in that time window.
So I want to have some definition there. But as we have seen here, once that has passed, I might not be so interested after that. Because the stimulus has shown. For example, I think that a good answer for you will be take it as far as you can get. Because as I said, the smaller the window, it's going to be harder to train the models. Because you need to increase. If you're decreasing the window, you're increasing the models that you have to train, basically.
AUDIENCE: This brings back Andre's point, in a way, which is that unlike traditional ways that people do machine learning where they're assuming they don't know anything about the data that they generate, here, we kind of do have some handle. Because as you said, 100 milliseconds after the cue, I'm going to see my stimulus coming. And we know very well the PFC or V4 that you're going to see is the spike in the data, or not, showing the selectivity and all that stuff.
So it kind of matters, in a way, to say we don't care about the first time in milliseconds, or the next 100 milliseconds, or whatever. And If I don't see a rise in the firing grid, all that stuff, then it won't be reflected in terms of selectivity for color or motion.
OMAR COSILLA REYES: Yeah.
AUDIENCE: So can use the structure that you already have and the experiment that you have set up. And you know something about these neurons. But these machine learning systems don't care about those things. But you can impose it. It might be easier for you to handle what you think is the right time window or feature size. Then just run a sweep from 1 to 800 or 2,500.
AUDIENCE: Thanks.
OMAR COSILLA REYES: OK. Any other question, guys? OK.
So I think have been through the basics of what's going on in the code. So from now on, we're going to take these blueprints that we have developed. So we have a cross-validate function that can take some window and can take labels. And then I can generate some scores based on the time window that I had. So we're going to reuse those functions, from now on, to decode the rest of the tutorial.
This is going to go very fast now. So now I change the area to V4. Then I generate a new folder. I want to create a new folder. And let's put here linear, as well for five folds. and it has a time window of 50.
And then I generate my folder. And then I basically do exactly the same, what I've done before. Everything that we've done before is going to be exactly the same. The only change, the line is in the area here. And I'm going to show you where that it reflects. And it reflects right here.
So what we're doing here is based on the electrodes of one area, selected data from those electrodes that are from that area. And then I select that from that dataset. And then I do again all these repeats for motion and color. And then I have my split area window here, my time. I think I need to change this to 50 again. Because that's where our first prediction comes, at 50 milliseconds.
And so once the processing has been done, I can execute that cell here. So everything has been generated again with our split area window array, right here. And then I can just score the function that I defined up there. That takes my windows, the folder, the CVE, the area. And then I can run again. But now, the big difference is it's exactly the same blueprint. But I have designed it in such a way that I can just change the area label. And then I can rerun the algorithm. And then I can obtain my accuracy scores, but now from another area, following the same procedure that I showed you before.
So it has finished to iterating over the windows and over of the folds. So I have my results returned here, the f-score for motion, color, the standard deviation. And I can go and plot now for V4.
So if you want to go back to your folder and look at the folder we created-- so for example, one is for PFC-- you go inside. And you can see these nice plots there. Because I have saved it in the plotting function. And if you want to open the scores, you can see here the 16 windows here.
So here the scores. You don't need to run again. You just can pull the data up from these files, the standard deviation here. It says, Create a new program. Say, OK. I want to plot it again. You can just call these files without the need to run things again.
There, you have your image. Then you go back. And then for V4, so we can put them one to one. So the one on the left is V4 and the one on the right is PFC. And those are the quality of the decoded predictions that we can get from either PFC or V4 using just the raw LFP.
And so as I said, the next step will be spikes. But I have generated this framework in such a way that the functions that we analyze right now will be reused again, but now for the spikes. Because I can put the array in the same shape as the LFP. And then I can run again all my blueprints that I have generated before.
But any questions so far? OK. So let's just go to spikes very quickly now. So we go to the notebook here, go back, Spikes. Run Spikes. All right.
So it's the same thing, same libraries here, same scoring function here. The difference is in the data. So now, we are calling the spikes here, spike data and spike labels. So now it's very interesting here because we have now the same three dimensional vector that we had before. But now, the first axis are neurons. There are no electrodes. That's the big difference with the spikes.
So we have here each of the elements in the first axis. It's one single neuron, instead of the LFP that we had in the previous notebook. And we have the same number of trials. Because as I said, it's from the same session. So we obtain a wideband signal, as I showed at the beginning. And from that wideband signal, it can extract either then LFP or extract the spikes by spike sorting.
AUDIENCE: But this is based on a continuous measure of the firing, right? This is the Gaussian smooth [INAUDIBLE].
OMAR COSILLA REYES: Yes. I'm going to show in a second. So here, we have the neurons, the trials, the features. We check again if there is any problem with all the data, it has any NaNs? No. OK.
So then, we have our labels here, again, PFC and V4. Actually, what happens here guys is-- I'm going to be honest with you. So PFC here in the original dataset had around 100 neurons there. And I decided to cut it down because it was too big. And I couldn't show you the demo here, to make it a little bit more equal.
But there was more neurons, in this case. But I reduced it for you guys to be able to follow in the tutorial. But it doesn't matter because we have now the same numbers in both the PFC and V4.
So again, we now load the motion data. So we have loaded here the color data, and now the motion data right here. And now to address Andre's question, we can see here the type of the waveforms, right? So we plot the spike data here. So it's just spike rate. And those are the figures that they showed at the beginning of the tutorial.
And then we can play, as well, with the function. So for example, if I want neuron 4, then I want five trials instead of 10. So we can see here. And here's another example of selectivity. For example, neuron 4, we can see that there is more PFC activity than before. Let's just select another one so you can see. Number 1, so it's one. One and one, and then I can plot 20.
And as I said, all this code is going to be available for you guys. So feel free to try with your own data. And if you have any problems, just send me an email and I'll try to see how I can help you. But the thing is, I hope this kind of code and this kind of approach can really help you to start to think this way to solve your neural decoding problems, or anything that you want solved.
So here, we just plot the sample data. But the rest of the things are exactly the same. So we have area, and then we changed the name of the folder to Spikes. We generated the folder. We generated the linear SVM. We generated the original shape. And when we select only for that area, this is the key line because we are just changing the area.
For example, in the analysis that I'm doing, I'm not only iterating over the two of them. So I reuse this code in a very big for loop that can iterate over all the areas. So I have six areas I have shown before. So I can just change this variable, and then dynamically extract what I want from the array.
Then we keep going. We keep going. We keep going. We keep going. We keep going. We have done this before-- split the area windows. The windows are here. So now we have the trials here for the split area windows. We have the neurons instead of the LFP. And now we have the same window of 50 milliseconds.
And we keep going. Oh, sorry. Here, it has to be 50. The time vector for our predictions has to start at 50. Then we run the prediction again, the iteration over the classification function. So this is the cross-validate function that we XMI'd before.
And then here, the results are being generated. It's still executing here. As I said, it takes some time. But now our results are ready. I can print here the results. And now we can generate our beautiful plot, but now for the spikes.
So we follow the same approach. So just to summarize what I did, I did the same as the LFP. But now I have switched the axis of the electrode axis to the neuron axis. That's everything that has changed from here now.
Obviously, I had to do some preprocessing before. So spikes are more complicated than the LFPs because you need to calculate the spike rates per trial. So just imagine running a function for each of the trials of all your sessions, which is going to take some time. So it takes more time for preprocessing. But here, you can see the recording accuracy.
Now we go into the other area. And there we go, guys. So that's for V4. And then if we go back to our folders, it should be now [INAUDIBLE] decoding inside here. It should be the result right there. So we just open the folders. Here is PFC. Here is V4.
So here is PFC and V4 now, but with the spike rate. So if you want to compare to the LFP and the spikes, we can do that. You can just open the image for the PFC. You can see here, right?
So on the right is the LFP. And on the left is the spike rate. So that's the type of decoding that we get out of the signals we just input into our models that we have created in real time. Any questions so far, guys? Yeah.
AUDIENCE: So you used a specific decoding algorithm, right? What happens if you make a different choice of decoding algorithm from Scikit-learn?
OMAR COSILLA REYES: Probably, you're going to get a different type of prediction of the encoding. I think that the main differences are when we're changing the approach of the algorithm. For example, if I stay with the family ensemble algorithms, I'd probably get a similar shape of decoding, just a different level of accuracy. And if I stay on the linear classifiers, like the same kind of wave, but different accuracy levels as well.
So the more important thing here, I think, is the family of the algorithms where we're taking that out from. Yes?
AUDIENCE: Just a general question-- so you made the point that you recommend Python. But is there any reason to not use MATLAB for these types of analyses? Everything we did today can also be implemented using different packages.
OMAR COSILLA REYES: Well, that's a good question. So for example, in Scikit-learn, I can just change the model here. So for example, everything you can see in Scikit-learn, I can use input here. For example, I can change the linear SVM here. I can command this out. And then I can run with the extra trees classifier.
So those two things are two completely different algorithms. Because one is ensemble and one is linear, right? They're two completely different things. But the way this library is organized is that you can code exactly the same methods to the two algorithms. So you can just plug and play in different parts of your algorithms.
So for example, here, I can just pass this and then the cross-validation function is going to look for a fit method that's going to fit the data. And that is going to exist in both of them. I don't know if anybody here with MATLAB experience knows if that exists in MATLAB, but I personally don't know.
AUDIENCE: They're getting there.
[LAUGHTER]
No, I'm not joking. Especially with neural net stuff, they're getting there. They make it pretty simple.
OMAR COSILLA REYES: Yeah. Well, another problem I have with MATLAB personally is that these datasets-- for example, we play with a very small dataset right now. But the sessions that these neural data has are very large. I have one that is somewhere like 17 gigabytes. And in Python, that session takes me to load-- I don't know. Suppose that takes me 5 minutes. If it takes me 5 minutes in Python, it would take me 20 minutes in MATLAB, just to load the file.
Because that's actually what I do. My lab has all the files on MATLAB. So I have to load first the MATLAB file in my lab, to extract some information or anything. And then I have to put them in Python. So for me, that's the first problem I have with MATLAB, myself. The loading times for the files are huge compared to the way Python is.
AUDIENCE: Another question-- just practically, about how you have thought about putting sessions together. So you just mentioned that you have lots and lots of sessions. So would you get separate model fits for every session, and then average across sessions? Or would you take all of the data into consideration?
OMAR COSILLA REYES: That's a good questions. So for me, the best way to go is to take all the sessions together into a single model.
AUDIENCE: Into a single model?
OMAR COSILLA REYES: Right.
AUDIENCE: How would you do that?
OMAR COSILLA REYES: We're working on it. As I said, the two main problems with that approach-- one is the difference in the lengths of the arrays. As I said, if I want to concatenate two sessions, it's not as easy as just calling two axes to be concatenated because they are different lengths. Because they're different neurons in different trials, right? So I need to find a way to just either cut or fill with zeros, or some way. I don't know.
AUDIENCE: You've got to resample it, right?
OMAR COSILLA REYES: Yeah, some technique to resampling. I don't know. I have to think about it.
AUDIENCE: You just use it in certain ways.
OMAR COSILLA REYES: So suppose that I'm done with that problem. The second problem is RAM. Because now I have to fit all the sessions into a single model in real time. So first of all, computers have 128, the limit. I don't know if we have access to a cluster that has bigger per RAM like that. But that's my personal limit.
AUDIENCE: A lot of these steps are independent. If you're running a cross-validation, each fold is independent. So you could read that from memory. Each time is independent. You don't need all the time all the things in memory.
OMAR COSILLA REYES: Yeah. For example, just consider one fold. In one fold, I have to fit one fifth of the data, or something like that.
AUDIENCE: But there's also the time dimension. You can cut that into 1/20 and one fifth. And now you have 1/100 of the data in independent samples.
OMAR COSILLA REYES: Any comments, Harry, about this?
AUDIENCE: Two orders of magnitude reduction is not too bad, right?
OMAR COSILLA REYES: Yeah. Do we have any recommendations of how to optimize the--
HARRY: So I want to speak on behalf of the deep learning community. So while we typically, if we do want to train a neural network with a lot of data, we don't actually load all the data onto memory unless we're actually using it. Because we're training in a batch by batch fashion.
We only need the batch to be in the memory when we need them to be. So we just load the data on the fly, instead of loading all the data to memory. But it's true that we don't need all of the samples from those 2.5 seconds. We just need 100 of those things.
But that's got to depend on-- like for example, if you have some other set of experiments that would require you to load 1 second, if there's 2.5 seconds, then you're in trouble. So that's how we do it on--
OMAR COSILLA REYES: Yeah so as Andrew and Harry have said here today, there should be a smarter way to load partial data in memory and then, as I say, save your results, and then combine everything together. So that's another approach to that.
So just to wrap up, guys, this code is online. You can play with it. You can change the time windows. You can change the classifier. You can play with a lot of things here. So I have made it in such a way that you can just change parameters and see what happens.
For example, when we played here with the neurons plotting here, you can just change the neurons here, and the tries here. And you can see here, the change. So almost all the code is like that. So you can actually put your data, try to put it in the same format as here, and then try to get your decoding done.
So unfortunately, we didn't test anything else. But I encourage you to try that. For those people that are doing MATLAB, there is something called a MATLAB Decoding Toolbox. So please, go and check that out. Because that has been developed by Ethan Meyers. So if you go to that link there for the CBMM tutorial, there is more information about decoding. And he has explained also his research. He's doing a lot of research on decoding. So check his papers. Check those videos there. And so if you want to check MATLAB, just go to those links.
And also, if you felt that the Scikit-learn part here too heavy for you, the guy that invented Scikit-learn-- well, one of the initial developers-- he just released a higher level library, so easier to deal with. It's called the Data Analysis Baseline Library. So if you want to check that out, also. So go ahead, if you felt the Scikit-learn was too high or anything. But for me, I think Scikit-learn is easy enough.
But anyway, guys, thanks so much for coming. We have 5 minutes for questions, if you have any. But thank you.
[APPLAUSE]
AUDIENCE: So I do have a question. [INAUDIBLE] choosing the right model. So usually, you end up with perhaps more neurons than you really want to. Although, I don't understand what that means. So you have a lot of neurons. And you're trying to figure out, do I need all of them? Or, how can I pick and choose which ones I want to use? Or, say I come with a more complicated model than I actually need to fit the data. How do I in a systematic manner understand how good the fit is and compare to how many parameters will I need to fit? Do you get what I'm saying?
Well, what if you have the 30 parameter model versus a five parameter model. Even though five parameter model is doing 5% worse than 30 parameter model, you could perhaps make a case that five parameter model is better. So is there a systematic way to analyze that using Scikit-learn, something like the information criteria that people use, the [INAUDIBLE] or the base. Is there a systematic way of doing that here?
OMAR COSILLA REYES: You mean to check how the parameters affect the model and how that affects accuracy, or the ability to decode?
AUDIENCE: Yes. So you can imagine adding more parameters to your model is going to make your model potentially better, as long as you have enough data. So how do you systematically evaluate how good the model is, given the number of parameters you put into the model?
OMAR COSILLA REYES: Well, I think that's a good question. But just going to step back from what you said, for example, let's suppose that I just evaluate the accuracy. Let's just stay with accuracy. So some of the problems I've had in my research is that, for example, the graph that I show when the monkey's fixating should not be able to decode anything there. Because it's the fixation period.
On some of the models I've run, the accuracy goes 80% there. I'm like, OK. So I'm able to decode here? Why? So that's a clear example that the more it's over filling and not really understanding your data. So that will be an example of how would you can debug these kinds of questions.
So I have this very fancy model that can take all this data. It has all these parameters. But then once you test it in a part of your experiment that you're sure you shouldn't decode anything at all, and it's able to decode some stuff, I think that that's where you can say--
AUDIENCE: There are no other reasons, right? That means that [INAUDIBLE] anyone's ever predict what the kind of trial is even before the trial even start. And that's concerning for other reasons. But if you average enough, as long there's no information in the brain, no matter what you do, you shouldn't be able to get it out of the other side. Do you agree?
OMAR COSILLA REYES: Yeah.
AUDIENCE: In theory, as long as you're randomizing the trial properly--
OMAR COSILLA REYES: Yeah. Suppose that the pipeline is correct.
AUDIENCE: Right, assuming all of that, you should not be able to extract information from it. But I'm going a step beyond that. Once you have a family of models, which model should be the canonical "best model?" It can be a simple linear regression model that has fewer parameters than a tree classifier. So how do you choose that?
OMAR COSILLA REYES: Yeah. It depends what you want to get out of it. For example, if you want to explain something about the model, then you should stick to linear models. If explanation about the model doesn't mother, then you can do all these deep neural nets and all these things.
So for example, I went to my supervisor and I told him, I have this model and this is working. And he told me, so can you explain to me the model? Can you explain to me what's going on? I could not because it was a deep neural net. So how can I make any inferences about that? You can track with some attention systems. But that's more complicated.
But if what matters for you is interpretability, you should always stick to linear models. And then if the information you want decode is what is most important, then you can start looking at other kinds of families. But the rule of thumb is always to start small and start very simple. And then if you cannot solve it with that, then they start looking at other models or other things that you can solve your problem with.
All right, guys. Thanks so much for coming.
[APPLAUSE]