AUSTRALIAN FRONTIERS OF SCIENCE, 2008
The Shine Dome, Canberra, 21-22 February
Classification using gene expression data
by Dr Ian Wood
![]() |
Ian Wood is a Lecturer in Statistics in the Department of Mathematics at the University of Queensland. He completed a PhD in 2004 at the University of Queensland on learning in Boltzmann machines. He then worked for three years as a research fellow at Queensland University of Technology on problems in statistical genomics and genetics and Bayesian statistics, including collaborations with researchers at Genetic Solutions Pty. Ltd. and Queensland Institute of Medical Research. He then returned to the University of Queensland, working on the analysis of gene expression data in the Institute for Molecular Biosciences. His interests include analysis of gene expression and other genetic data, discriminant analysis, Monte Carlo methods, mixture models and machine learning. |
Today I am mainly talking about gene expression data, and the classification of that, particularly for cancer diagnosis and that sort of thing.
![]()
(Click on image for a larger version)
I will talk about gene expression and microarrays, which is the most common way it is measured en masse, at present anyway, and classification tasks you can do with it. Dominating my interest in it is estimating error rates. What you want your classifier to do is to put new data into classes that you have defined, with low error in a prediction context. You want to be able to do that accurately, and you want to know how well your classifier is likely to work in that sort of setup.
I am going to talk a little bit about variability, which is sacred to statisticians we should not give estimates without a bit of variability on them. And I will just talk about a couple of biases that can creep in, mainly due to the high dimensionality.
![]()
(Click on image for a larger version)
With regard to gene expression here the biologists will have to forgive me a little bit my understanding is that humans have around 20,000 protein coding genes. These are coming out of the DNA and producing a protein through a sequence of steps as shown here. You would notice that along those steps the DNA is getting transcribed; some mRNA is floating around and is heading out into a ribosome and getting translated into protein. The bit where we can measure gene expression most easily is at the RNA level, so the messenger RNA can be intercepted and converted into something more easily measured by using a microarray.
![]()
(Click on image for a larger version)
To do that, you have to convert the messenger RNA into something a bit more stable, cDNA, which survives a bit longer. It has got the same information in it; it is just a more stable chemical complement.
A microarray is basically, say, 20,000 spots maybe even more each in a little well that contains material representing each gene. For example, if you wanted to do a microarray across the human genome, you would have a spot for each human gene and a representative DNA probe attached to the glass underneath, I believe. What you would then do is to take a tissue sample from a tissue of interest, and squirt some of that into every one of these wells. (First of all you would have converted it from mRNA to cDNA and attached fluorescent markers.) Then the cDNA from your tissue sample will be hybridising to the representative probes in each well, and they will be keeping some of that fluorescent marker attached. Eventually you will clean the slide, wash out a lot of the stuff, and then you will be basically measuring what level of hybridisation has happened, using lasers. You will shine a laser on this, get some fluorescence and measure that, and take photographs.
At the right of this slide you see a couple of photographs taken in different situations. We are looking at the relative levels and that is producing colours, but a lot of data I am talking about is really just one colour and the brightness of the spot is a proxy for the level of gene expression. It is not a complete linear match, but it is close enough in a lot of cases.
![]()
(Click on image for a larger version)
Why would you want to measure gene expression and want to do something with it? There are a lot of medical tasks that may have something to do with gene expression levels in cells, such as difficult diseases to diagnose, like breast cancer, especially early on. You may be able to find some sort of pattern in a few of the gene expressions which let you classify people into those with breast cancer or not, or those who have got a good prognosis or not, over the next few years. Or you may have a known tumour but you are not quite sure what type it is and you want to put it into one of a few classes. So the classes are really human-defined, and they need to be things that humans can determine the class of.
You may not know whether someone is going to get cancer later on but you have made a measurement, and then you can wait and see whether they did or not. And then you would get your answer. Eventually you want to be sure which class they ended up in. Once you have done that for a few cases you can build up a training set and build a classifier and see how it goes. And for new data, you can just do the measurement of, say, the gene expression, using a microarray or something similar, and you should be able to get an instant prediction from that.
The question is: is that going to be good enough in quality? Hopefully, it will be cheaper than waiting or getting the clinicians to do an in-depth analysis.
So that is the goal, and it is quite dependent on the level of pricing of the microarrays.
![]()
(Click on image for a larger version)
There are other clinical indicators that may be more useful in some circumstances.
Gene expression levels: we don't really understand what all these genes are doing. You would just hope that there is some sort of signature, as people tend to call it, in, say, a cancer or one tumour type versus another, that is detectable if you have seen enough data and you know which class that data belonged to.
Microarrays: just to do one sample typically costs about $1000, including the use of reagents and labour. But that is falling, so there is hope that this will get a lot cheaper with time. But the problem is that that sort of cost limits the statistical samples available. There may have been only about 100 tissues measured, or less sometimes. But the number of genes being measured in each observation is, say, 20,000 or it can be much higher, 60,000, if you have a few probes per gene. That is the 'large p small n' problem, p being the number of different predictors the gene expression levels and n being the number of observations. That is not very amenable to standard statistics, so a lot of the work in analysing this is really just to cut down the number of predictors you are going to use, so that you can use more standard methods. And that has some risks.
Other than prediction, you may also be very interested in which genes were picked up by a classifier. So if you had 20,000 genes to start with and you have cut it down to, say, seven to 20 or something like that, and you get a good error rate, that means that those genes are discriminating well between the classes of interest, so probably those genes are involved. There may be other ways to find that out, but that may be worth further investigation. Sometimes biologists are more interested in just what can discriminate between classes, rather than how well you can do it for prediction.
![]()
(Click on image for a larger version)
We assume that we have that sort of data and it has been carefully labelled correctly by clinicians, and we just want to build a classifier and minimise the error rate for prediction of the class it could be 'cancer or not', or tumour type. And then we also want to estimate what the error rate is for those predictions.
![]()
(Click on image for a larger version)
For the standard setup I have also got an equation. It is a bit like Rob Hyndman's, but slightly different.
We assume that the observations have been collected at random from a population of interest that we also want to apply the classifier to in future. Where they have been collected in a certain situation and are going to be applied to a different one, they may not be as effective, so you always need to have a reasonable design that is going to match where you are going to use the prediction to where you have been collecting things.
We have assumed that the data distribution is fixed and that there are no time series effects or dependencies between observations, that these are independent.
We assume that the data is coming from a mixture distribution. Say we have got just two classes breast cancer and not then we can assume that the breast cancer data is coming from a probability distribution and the gene expression data for the people who have not got breast cancer is coming from another distribution. We don't know those distributions, but we can at least talk about them.
So these are the fg terms here. There would be just two classes in that case, so fg is the probability distribution specific to that class g. There is a weight on those which is the frequency with which that particular class occurs if you were measuring in a clinic where people were coming because they were worried about breast cancer, the people with breast cancer might be fairly common, but in the general population that is not so high.
Anyway, all the observations put together are drawn from the sum of those weighted distributions for the two classes.
![]()
(Click on image for a larger version)
As I said, we don't know the true class distributions, but at the top left here is a case where you have got data from two classes, you know the class labels and this is a sample size of around 200, which is as good as you can hope for with microarray. The black circles are of one class and the green ones are of another. A straightforward model of each class independently is just multivariate Gaussian, or bivariate Gaussian in this case. This is just a pair of gene expression levels; how we get to that sort of pair I will talk about later. If we just got down to two genes we could have the sort of data shown at the bottom left here, and we could do a normal model, basically an ellipse well, it is more than an ellipse, it goes on forever, but at a certain level of probability density, its boundary overlaps with the other one.
In the plot at the top right it is a bit easier to separate the two classes. (I am afraid these are artificial examples.) But these normal distributions would go forever as well, and so there is overlap.
The question really is that in classification you are not too worried about modelling these; you are more worried about where you should stick the line that separates them. Here at the bottom right it is pretty easy, anywhere in the middle will do. At the bottom left it is a bit harder: there is a line that can be defined when the probability of being in either class is 0.5, and where that decision boundary goes through depends on the type of modelling you have done.
Some classifiers just focus on where to put that boundary, and some focus on modelling and then derive the boundary from it. In this instance, anything on the right side of the boundary in a classifier would be put into the green class, and anything on the left side would be put into the blue class. So that is a sort of classification task.
We never really find out the true distribution, and the less data we have, the less we know about it. So if the true distribution had funny little bumps in it, we would never notice on a small sample size. We would probably have trouble modelling that, anyway.
![]()
(Click on image for a larger version)
By the Bayes optimal rule, if we did know those true distributions there would still be only a limited accuracy we could get in the prediction. So it is just the best you could get. If you knew those exactly, you would classify them according to this rule. You would pick the class g where the relevant term is highest that is divided by the sum over all of them and that would be Bayes optimal. So there is often a limit to how accurate you can become. We try to get as close as we can to it, and it is easier with more data.
![]()
(Click on image for a larger version)
There are many different classifiers out there. Often the performances of each of them are not that different. Some of them are based on statistics, some of them are based on machine-learning things and others are nonlinear methods. I don't think the details are crucial to my talk, so I will skip over that. But many of those work very well. Support vector machines have been very good performers recently, but they are a bit of a black box and some of these others are more easily interpretable. In the end, it depends what you want out of the model.
![]()
(Click on image for a larger version)
As regards error rates, what we are trying to do with the classifier is to minimise the error rate for new observations to predict which class it goes into. We might have an overall error rate of, say, 0.2. Or we can look at the error rate specific to a class. If we know that it should have been in class 1 and it wasn't, well, the rate might be 0.1 for that class, and similarly for class 2 you might have a different error rate. It depends on how important the errors of that type are. They may be false negatives and false positives with regard to being told you have got AIDS or not, or cancer or not, and you have to work out how bad that is. You can put that in, and build your classifier including those sorts of costs.
The aim is then to minimise that weighted set of error rates. The expected value of that is the risk.
![]()
(Click on image for a larger version)
One way to estimate error rates is to look at the apparent error rate, which is the error rate on the data you have already used to train it on. That is no good, because you know too much. It is like looking at the answers and saying, 'This is a fair test of me.' It is not. Really, to test a classifier you need new data, ideally data that you have never used before, never seen before. You just try it on that and see how you go.
However, when you have got a shortage of data that is not very easy. So there are methods such as cross-validation. This is a well-known one which makes very efficient use of the data.
![]()
(Click on image for a larger version)
I will show you now a rough scheme of that. It is where you take the data and you split it into these settings. You use a certain amount of data, most of it, for training, and you test on another bit. Rather than doing that once because if you only had about 20 points, you might have only about two for the test and that wouldn't be very good you would do this a range of times. You use this N/K as a test first, and train on the rest; then you take another bit out and use the rest for training; and you keep rotating through. Then you get an estimate of the accuracy, eventually, by combining all these tests.
For the final classifier you are actually free to use all the original data, and use that for prediction. So it is not quite the same as any of the other classifiers, and the interesting thing with microarray analysis is that the genes that end up being used in a variable selection setup are typically very different for each of these multiple classifiers. So if you have got 10 classifiers here at N(K-1)/K in this cross-validation, there may be a totally different list of genes being used for predictive purposes in each of them versus the final one. People can get upset about that or they can say, 'Well, they're probably all on the same pathway anyway and very much proxies for each other, and it doesn't matter.' But it is something to be aware of, that there is not the stability unless you have got a very large sample size.
![]()
(Click on image for a larger version)
Cross-validation makes efficient use of small datasets, and that is one of the reasons people like to use it. But they tend to go a bit wrong with it.
![]()
(Click on image for a larger version)
We don't have much time for this slide, so I will move on.
![]()
(Click on image for a larger version)
You have to select a classifier type, and that is often based on opinion you haven't necessarily got time to consider all the classifiers but the variable selection is a bit of an issue. You can't pick the 'best' genes, really; you can use some sub-optimal way of getting there and pick a reasonable set of genes of any number you want. But it has to be done; most of the time there are just too many parameters in a classifier that uses all the genes, for the small datasets that you have got, so you have to cut down the dimensionality you are considering, one way or another.
Once you have done that, fitting the classifier parameters is usually straightforward. Then you want to estimate error rates, and that is where you have to be careful. Again I would advise the use of a pristine test set, or at least in cross-validation, the maintenance of that sort of principle.
![]()
(Click on image for a larger version)
The classic selection bias mistake that people have made is to select the variables they want to use based on the entire dataset, and then maybe by looking at the correlation with the response of interest, or the known class. There is a famous example by van't Veer, who has commercialised a classifier for predicting whether breast cancers will metastasise or not. They started with about 24,000 so-called genes and they cut that back to about 70. If you were to work with just the 70 genes that they found to be highly correlated with the response, you could convince yourself that you were getting an error rate of around, say, 10 or 12 per cent. But that unfortunately is based on a selection bias.
To do that selection they did look at all of the answers, basically; they looked at all of the responses. And so, having done that, they no longer had any pristine data for a test. If you use all that data, you can never really believe your results. So that is not really on, and cross-validation needs to be done, as I was showing before, with a re-selection of the genes of relevance.
![]()
(Click on image for a larger version)
I have run out of time, but just as an example of the error rates involved in that case, people would be able to report zero error rates when they hadn't taken account of this, and error rates of, say, 20 per cent with some error bars when they did.



